{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# DataFrameExtractor\n", "\n", "\n", "
\n", "\n", "**Warning**: This section is a bit technical and many users won't need this functionality. Also, it is a bit experimental and the API may change in future versions. Proceed with caution.\n", "\n", "
\n", "\n", "The callables [`picks_to_df`, `events_to_df`](../datastructures/events_to_pandas.ipynb), and [`inventory_to_df`](../datastructures/stations_to_pandas.ipynb) are instances of `DataFrameExtractor`, which provides an extensible and customizable way for creating callables that extract `DataFrames` from arbitrary objects.\n", "\n", "To demonstrate, let's create a new extractor to put arrival objects in the Crandall catalog into a dataframe. The table can be joined together with the picks table to do some (possibly) interesting things." ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "execution": { "iopub.execute_input": "2024-02-28T22:20:32.882302Z", "iopub.status.busy": "2024-02-28T22:20:32.882132Z", "iopub.status.idle": "2024-02-28T22:20:35.732130Z", "shell.execute_reply": "2024-02-28T22:20:35.731480Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "8 Event(s) in Catalog:\n", "2007-08-06T08:48:40.010000Z | +39.464, -111.228 | 4.2 mb\n", "2007-08-07T02:14:24.080000Z | +39.463, -111.223 | 1.17 ml\n", "2007-08-07T03:44:18.470000Z | +39.462, -111.215 | 1.68 ml\n", "2007-08-07T07:13:05.760000Z | +39.461, -111.224 | 2.55 ml\n", "2007-08-07T02:05:04.490000Z | +39.465, -111.225 | 2.44 ml\n", "2007-08-06T10:47:25.600000Z | +39.462, -111.232 | 1.92 ml\n", "2007-08-07T21:42:51.130000Z | +39.463, -111.220 | 1.88 ml\n", "2007-08-06T01:44:48.810000Z | +39.462, -111.238 | 2.32 ml\n" ] } ], "source": [ "import obspy\n", "\n", "import obsplus\n", "\n", "crandall = obsplus.load_dataset('crandall_test')\n", "cat =crandall.event_client.get_events()\n", "print(cat)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Start by initializing the extractor with a list of expected columns and data types. This is optional, but helps ensure the output dataframe has a consistent shape and data type. The [arrival documentation](https://docs.obspy.org/packages/autogen/obspy.core.event.origin.Arrival.html) may be useful to understand these. Rather than collecting all the data contained in the `Arrival` instances, only a few columns of interest will be created." ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "execution": { "iopub.execute_input": "2024-02-28T22:20:35.760584Z", "iopub.status.busy": "2024-02-28T22:20:35.760064Z", "iopub.status.idle": "2024-02-28T22:20:35.764011Z", "shell.execute_reply": "2024-02-28T22:20:35.763424Z" } }, "outputs": [], "source": [ "from collections import OrderedDict\n", "\n", "import obspy.core.event as ev\n", "\n", "# declare datatypes (order to double as required columns)\n", "dtypes = OrderedDict(\n", " resource_id=str, \n", " pick_id=str, \n", " event_id=str,\n", " origin_id= str, \n", " phase=str, \n", " time_correction=float, \n", " distance=float, \n", " time_residual=float, \n", " time_weight=float,\n", ")\n", "\n", "# init the DataFrameExtractor\n", "arrivals_to_df = obsplus.DataFrameExtractor(ev.Arrival, required_columns=list(dtypes), \n", " dtypes=dtypes)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The next step it to define some \"extractors\". These are callables that will take an `Arrival` instance and return the desired data. The extractors can return:\n", "\n", "1. A `dict` of values where each key corresponds to a column name and each value is the row value of that column for the current object.\n", "\n", "2. Anything else, which is interpreted as the row value, and the column name is obtained from the function name." ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "execution": { "iopub.execute_input": "2024-02-28T22:20:35.766166Z", "iopub.status.busy": "2024-02-28T22:20:35.765986Z", "iopub.status.idle": "2024-02-28T22:20:35.769383Z", "shell.execute_reply": "2024-02-28T22:20:35.768816Z" } }, "outputs": [], "source": [ "# an extractor which returns a dictionary\n", "@arrivals_to_df.extractor\n", "def _get_basic(arrival):\n", " out = dict(\n", " resource_id=str(arrival.resource_id),\n", " pick_id=str(arrival.pick_id),\n", " time_correction=arrival.time_correction,\n", " distance=arrival.distance,\n", " time_residual=arrival.time_residual,\n", " time_weight=arrival.time_weight,\n", " )\n", " return out\n", "\n", "\n", "# an extractor which returns a single value\n", "@arrivals_to_df.extractor\n", "def _get_phase(arrival):\n", " return arrival.phase" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Notice that there is no way of extracting information from the parent `Origin` or `Event` objects. The extractor also doesn't know how to find the arrivals in a `Catalog` object. Defining the types of data the extractor can operate on, and injecting the event and origin data into arrival rows will accomplish both of these tasks." ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "execution": { "iopub.execute_input": "2024-02-28T22:20:35.771633Z", "iopub.status.busy": "2024-02-28T22:20:35.771205Z", "iopub.status.idle": "2024-02-28T22:20:35.774747Z", "shell.execute_reply": "2024-02-28T22:20:35.774233Z" } }, "outputs": [], "source": [ "@arrivals_to_df.register(obspy.Catalog)\n", "def _get_arrivals_from_catalogs(cat):\n", " arrivals = [] # a list of arrivals\n", " extras = {} # dict of data to inject to arrival level\n", " for event in cat:\n", " for origin in event.origins:\n", " arrivals.extend(origin.arrivals)\n", " data = dict(event_id=event.resource_id, origin_id=origin.resource_id)\n", " # use arrival id to inject extra to each arrival row\n", " extras.update({id(x): data for x in origin.arrivals})\n", " return arrivals_to_df(arrivals, extras=extras)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The next step is to initiate the extractor." ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "execution": { "iopub.execute_input": "2024-02-28T22:20:35.776772Z", "iopub.status.busy": "2024-02-28T22:20:35.776594Z", "iopub.status.idle": "2024-02-28T22:20:35.801010Z", "shell.execute_reply": "2024-02-28T22:20:35.800419Z" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
resource_idpick_idevent_idorigin_idphasetime_correctiondistancetime_residualtime_weight
0smi:local/6a7f7180-21e5-4bbf-9a4e-20755d212437smi:local/21691352smi:local/248839smi:local/404328PgNaN0.3840.258-1.0
1smi:local/5e517baf-02ec-4a33-9225-7233dc109628smi:local/21691353smi:local/248839smi:local/404328PgNaN0.3561.334-1.0
2smi:local/83a9d3ca-415d-470b-93bb-34677e9f64afsmi:local/21691354smi:local/248839smi:local/404328PgNaN0.5500.158-1.0
3smi:local/6593b4e2-8cd1-4ea0-85d7-7c67a0759a3dsmi:local/21691355smi:local/248839smi:local/404328SgNaN0.384-1.695-1.0
4smi:local/7ee1ae58-d3aa-442c-ad1e-ca2bd9ca56d8smi:local/21691356smi:local/248839smi:local/404328SgNaN0.3560.038-1.0
\n", "
" ], "text/plain": [ " resource_id pick_id \\\n", "0 smi:local/6a7f7180-21e5-4bbf-9a4e-20755d212437 smi:local/21691352 \n", "1 smi:local/5e517baf-02ec-4a33-9225-7233dc109628 smi:local/21691353 \n", "2 smi:local/83a9d3ca-415d-470b-93bb-34677e9f64af smi:local/21691354 \n", "3 smi:local/6593b4e2-8cd1-4ea0-85d7-7c67a0759a3d smi:local/21691355 \n", "4 smi:local/7ee1ae58-d3aa-442c-ad1e-ca2bd9ca56d8 smi:local/21691356 \n", "\n", " event_id origin_id phase time_correction distance \\\n", "0 smi:local/248839 smi:local/404328 Pg NaN 0.384 \n", "1 smi:local/248839 smi:local/404328 Pg NaN 0.356 \n", "2 smi:local/248839 smi:local/404328 Pg NaN 0.550 \n", "3 smi:local/248839 smi:local/404328 Sg NaN 0.384 \n", "4 smi:local/248839 smi:local/404328 Sg NaN 0.356 \n", "\n", " time_residual time_weight \n", "0 0.258 -1.0 \n", "1 1.334 -1.0 \n", "2 0.158 -1.0 \n", "3 -1.695 -1.0 \n", "4 0.038 -1.0 " ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = arrivals_to_df(cat)\n", "df.head()" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "execution": { "iopub.execute_input": "2024-02-28T22:20:35.803088Z", "iopub.status.busy": "2024-02-28T22:20:35.802904Z", "iopub.status.idle": "2024-02-28T22:20:35.807669Z", "shell.execute_reply": "2024-02-28T22:20:35.807185Z" } }, "outputs": [ { "data": { "text/plain": [ "phase\n", "pPn 238\n", "P 224\n", "Pb 129\n", "Sb 87\n", "Pg 79\n", "S 66\n", "Sg 53\n", "Pn 53\n", "Sn 22\n", "pPb 3\n", "Name: count, dtype: int64" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.phase.value_counts()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If only the P phases were needed, the easiest thing to do is filter the dataframe. For demonstration let's modify our phase extractor so that any row that is not a P phase is skipped. This is done by raising a `SkipRow` exception which is an attribute of the `DataFrameExtractor`." ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "execution": { "iopub.execute_input": "2024-02-28T22:20:35.809967Z", "iopub.status.busy": "2024-02-28T22:20:35.809526Z", "iopub.status.idle": "2024-02-28T22:20:35.812947Z", "shell.execute_reply": "2024-02-28T22:20:35.812429Z" } }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/home/runner/work/obsplus/obsplus/src/obsplus/structures/dfextractor.py:118: UserWarning: _get_phase is already a registered extractor, overwriting\n", " warnings.warn(msg)\n" ] } ], "source": [ "@arrivals_to_df.extractor\n", "def _get_phase(arrival):\n", " phase = arrival.phase\n", " if phase.upper() != 'P':\n", " raise arrivals_to_df.SkipRow\n", " return phase" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "execution": { "iopub.execute_input": "2024-02-28T22:20:35.815122Z", "iopub.status.busy": "2024-02-28T22:20:35.814780Z", "iopub.status.idle": "2024-02-28T22:20:35.826657Z", "shell.execute_reply": "2024-02-28T22:20:35.826184Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "phase\n", "P 224\n", "Name: count, dtype: int64\n" ] } ], "source": [ "df = arrivals_to_df(cat)\n", "print(df.phase.value_counts())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Get a picks dataframe and perform a left join on the phases:" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "execution": { "iopub.execute_input": "2024-02-28T22:20:35.828844Z", "iopub.status.busy": "2024-02-28T22:20:35.828492Z", "iopub.status.idle": "2024-02-28T22:20:35.900634Z", "shell.execute_reply": "2024-02-28T22:20:35.900055Z" } }, "outputs": [], "source": [ "# get picks and filter out non-P phases\n", "picks = obsplus.picks_to_df(cat)\n", "picks = picks[picks.phase_hint.str.upper() == \"P\"]" ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "execution": { "iopub.execute_input": "2024-02-28T22:20:35.903303Z", "iopub.status.busy": "2024-02-28T22:20:35.902883Z", "iopub.status.idle": "2024-02-28T22:20:35.907943Z", "shell.execute_reply": "2024-02-28T22:20:35.907357Z" } }, "outputs": [], "source": [ "df_merged = df.merge(picks, how='left', right_on='resource_id', left_on='pick_id')" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "execution": { "iopub.execute_input": "2024-02-28T22:20:35.910348Z", "iopub.status.busy": "2024-02-28T22:20:35.910013Z", "iopub.status.idle": "2024-02-28T22:20:35.923072Z", "shell.execute_reply": "2024-02-28T22:20:35.922494Z" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
resource_id_xpick_idevent_id_xorigin_idphasetime_correctiondistancetime_residualtime_weightresource_id_y...agency_idevent_id_ynetworkstationlocationchanneluncertaintylower_uncertaintyupper_uncertaintyconfidence_level
0smi:local/ed8b0924-a569-4a9b-be1a-9e837cbaf91esmi:local/21691352smi:local/248839smi:local/404330PNaN0.3190.437-1.0smi:local/21691352...smi:local/248839TAP17ABHZNaNNaNNaNNaN
1smi:local/14bccf6c-a0ec-4805-9469-a11c7704aafesmi:local/21691353smi:local/248839smi:local/404330PNaN0.407-0.705-1.0smi:local/21691353...smi:local/248839TAP16ABHZNaNNaNNaNNaN
2smi:local/a1c56c67-4c6f-4f4a-af55-c886007821dcsmi:local/21691354smi:local/248839smi:local/404330PNaN0.585-1.596-1.0smi:local/21691354...smi:local/248839TAQ16ABHZNaNNaNNaNNaN
3smi:local/4f383f7b-56d3-4305-8754-f93e26230698smi:local/21691357smi:local/248839smi:local/404330PNaN0.624-0.518-1.0smi:local/21691357...smi:local/248839UUSRUBHZNaNNaNNaNNaN
4smi:local/fbb1b19a-5c2b-44bc-a4bf-7e4edfbbe1d9smi:local/21691358smi:local/248839smi:local/404330PNaN0.755-0.598-1.0smi:local/21691358...smi:local/248839TAO16ABHZNaNNaNNaNNaN
\n", "

5 rows × 34 columns

\n", "
" ], "text/plain": [ " resource_id_x pick_id \\\n", "0 smi:local/ed8b0924-a569-4a9b-be1a-9e837cbaf91e smi:local/21691352 \n", "1 smi:local/14bccf6c-a0ec-4805-9469-a11c7704aafe smi:local/21691353 \n", "2 smi:local/a1c56c67-4c6f-4f4a-af55-c886007821dc smi:local/21691354 \n", "3 smi:local/4f383f7b-56d3-4305-8754-f93e26230698 smi:local/21691357 \n", "4 smi:local/fbb1b19a-5c2b-44bc-a4bf-7e4edfbbe1d9 smi:local/21691358 \n", "\n", " event_id_x origin_id phase time_correction distance \\\n", "0 smi:local/248839 smi:local/404330 P NaN 0.319 \n", "1 smi:local/248839 smi:local/404330 P NaN 0.407 \n", "2 smi:local/248839 smi:local/404330 P NaN 0.585 \n", "3 smi:local/248839 smi:local/404330 P NaN 0.624 \n", "4 smi:local/248839 smi:local/404330 P NaN 0.755 \n", "\n", " time_residual time_weight resource_id_y ... agency_id \\\n", "0 0.437 -1.0 smi:local/21691352 ... \n", "1 -0.705 -1.0 smi:local/21691353 ... \n", "2 -1.596 -1.0 smi:local/21691354 ... \n", "3 -0.518 -1.0 smi:local/21691357 ... \n", "4 -0.598 -1.0 smi:local/21691358 ... \n", "\n", " event_id_y network station location channel uncertainty \\\n", "0 smi:local/248839 TA P17A BHZ NaN \n", "1 smi:local/248839 TA P16A BHZ NaN \n", "2 smi:local/248839 TA Q16A BHZ NaN \n", "3 smi:local/248839 UU SRU BHZ NaN \n", "4 smi:local/248839 TA O16A BHZ NaN \n", "\n", " lower_uncertainty upper_uncertainty confidence_level \n", "0 NaN NaN NaN \n", "1 NaN NaN NaN \n", "2 NaN NaN NaN \n", "3 NaN NaN NaN \n", "4 NaN NaN NaN \n", "\n", "[5 rows x 34 columns]" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_merged.head()" ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "execution": { "iopub.execute_input": "2024-02-28T22:20:35.925337Z", "iopub.status.busy": "2024-02-28T22:20:35.924980Z", "iopub.status.idle": "2024-02-28T22:20:35.928942Z", "shell.execute_reply": "2024-02-28T22:20:35.928318Z" } }, "outputs": [ { "data": { "text/plain": [ "Index(['resource_id_x', 'pick_id', 'event_id_x', 'origin_id', 'phase',\n", " 'time_correction', 'distance', 'time_residual', 'time_weight',\n", " 'resource_id_y', 'time', 'seed_id', 'filter_id', 'method_id',\n", " 'horizontal_slowness', 'backazimuth', 'onset', 'phase_hint', 'polarity',\n", " 'evaluation_mode', 'event_time', 'evaluation_status', 'creation_time',\n", " 'author', 'agency_id', 'event_id_y', 'network', 'station', 'location',\n", " 'channel', 'uncertainty', 'lower_uncertainty', 'upper_uncertainty',\n", " 'confidence_level'],\n", " dtype='object')" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_merged.columns" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Calculate how often the `phase` attribute in the arrival is different from the `phase_hint` in the pick, which could indicate a quality issue." ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "execution": { "iopub.execute_input": "2024-02-28T22:20:35.931046Z", "iopub.status.busy": "2024-02-28T22:20:35.930865Z", "iopub.status.idle": "2024-02-28T22:20:35.935177Z", "shell.execute_reply": "2024-02-28T22:20:35.934579Z" } }, "outputs": [ { "data": { "text/plain": [ "1.0" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# calculate fraction of phase_hints that match phase\n", "(df_merged['phase'] == df_merged['phase_hint']).sum() / len(df_merged)" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.13" } }, "nbformat": 4, "nbformat_minor": 4 }