{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# DataFrameExtractor\n",
    "\n",
    "\n",
    "<div class=\"alert alert-warning\">\n",
    "\n",
    "**Warning**: This section is a bit technical and many users won't need this functionality. Also, it is a bit experimental and the API may change in future versions. Proceed with caution.\n",
    "\n",
    "</div>\n",
    "\n",
    "The callables [`picks_to_df`, `events_to_df`](../datastructures/events_to_pandas.ipynb), and [`inventory_to_df`](../datastructures/stations_to_pandas.ipynb) are instances of `DataFrameExtractor`, which provides an extensible and customizable way for creating callables that extract `DataFrames` from arbitrary objects.\n",
    "\n",
    "To demonstrate, let's create a new extractor to put arrival objects in the Crandall catalog into a dataframe.  The table can be joined together with the picks table to do some (possibly) interesting things."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2024-02-28T22:20:32.882302Z",
     "iopub.status.busy": "2024-02-28T22:20:32.882132Z",
     "iopub.status.idle": "2024-02-28T22:20:35.732130Z",
     "shell.execute_reply": "2024-02-28T22:20:35.731480Z"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "8 Event(s) in Catalog:\n",
      "2007-08-06T08:48:40.010000Z | +39.464, -111.228 | 4.2  mb\n",
      "2007-08-07T02:14:24.080000Z | +39.463, -111.223 | 1.17 ml\n",
      "2007-08-07T03:44:18.470000Z | +39.462, -111.215 | 1.68 ml\n",
      "2007-08-07T07:13:05.760000Z | +39.461, -111.224 | 2.55 ml\n",
      "2007-08-07T02:05:04.490000Z | +39.465, -111.225 | 2.44 ml\n",
      "2007-08-06T10:47:25.600000Z | +39.462, -111.232 | 1.92 ml\n",
      "2007-08-07T21:42:51.130000Z | +39.463, -111.220 | 1.88 ml\n",
      "2007-08-06T01:44:48.810000Z | +39.462, -111.238 | 2.32 ml\n"
     ]
    }
   ],
   "source": [
    "import obspy\n",
    "\n",
    "import obsplus\n",
    "\n",
    "crandall = obsplus.load_dataset('crandall_test')\n",
    "cat =crandall.event_client.get_events()\n",
    "print(cat)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Start by initializing the extractor with a list of expected columns and data types. This is optional, but helps ensure the output dataframe has a consistent shape and data type. The [arrival documentation](https://docs.obspy.org/packages/autogen/obspy.core.event.origin.Arrival.html) may be useful to understand these. Rather than collecting all the data contained in the `Arrival` instances, only a few columns of interest will be created."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2024-02-28T22:20:35.760584Z",
     "iopub.status.busy": "2024-02-28T22:20:35.760064Z",
     "iopub.status.idle": "2024-02-28T22:20:35.764011Z",
     "shell.execute_reply": "2024-02-28T22:20:35.763424Z"
    }
   },
   "outputs": [],
   "source": [
    "from collections import OrderedDict\n",
    "\n",
    "import obspy.core.event as ev\n",
    "\n",
    "# declare datatypes (order to double as required columns)\n",
    "dtypes = OrderedDict(\n",
    "    resource_id=str, \n",
    "    pick_id=str, \n",
    "    event_id=str,\n",
    "    origin_id= str, \n",
    "    phase=str, \n",
    "    time_correction=float, \n",
    "    distance=float, \n",
    "    time_residual=float,                \n",
    "    time_weight=float,\n",
    ")\n",
    "\n",
    "# init the DataFrameExtractor\n",
    "arrivals_to_df = obsplus.DataFrameExtractor(ev.Arrival, required_columns=list(dtypes), \n",
    "                                            dtypes=dtypes)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The next step it to define some \"extractors\". These are callables that will take an `Arrival` instance and return the desired data. The extractors can return:\n",
    "\n",
    "1. A `dict` of values where each key corresponds to a column name and each value is the row value of that column for the current object.\n",
    "\n",
    "2. Anything else, which is interpreted as the row value, and the column name is obtained from the function name."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2024-02-28T22:20:35.766166Z",
     "iopub.status.busy": "2024-02-28T22:20:35.765986Z",
     "iopub.status.idle": "2024-02-28T22:20:35.769383Z",
     "shell.execute_reply": "2024-02-28T22:20:35.768816Z"
    }
   },
   "outputs": [],
   "source": [
    "# an extractor which returns a dictionary\n",
    "@arrivals_to_df.extractor\n",
    "def _get_basic(arrival):\n",
    "    out = dict(\n",
    "        resource_id=str(arrival.resource_id),\n",
    "        pick_id=str(arrival.pick_id),\n",
    "        time_correction=arrival.time_correction,\n",
    "        distance=arrival.distance,\n",
    "        time_residual=arrival.time_residual,\n",
    "        time_weight=arrival.time_weight,\n",
    "    )\n",
    "    return out\n",
    "\n",
    "\n",
    "# an extractor which returns a single value\n",
    "@arrivals_to_df.extractor\n",
    "def _get_phase(arrival):\n",
    "    return arrival.phase"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Notice that there is no way of extracting information from the parent `Origin` or `Event` objects. The extractor also doesn't know how to find the arrivals in a `Catalog` object. Defining the types of data the extractor can operate on, and injecting the event and origin data into arrival rows will accomplish both of these tasks."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2024-02-28T22:20:35.771633Z",
     "iopub.status.busy": "2024-02-28T22:20:35.771205Z",
     "iopub.status.idle": "2024-02-28T22:20:35.774747Z",
     "shell.execute_reply": "2024-02-28T22:20:35.774233Z"
    }
   },
   "outputs": [],
   "source": [
    "@arrivals_to_df.register(obspy.Catalog)\n",
    "def _get_arrivals_from_catalogs(cat):\n",
    "    arrivals = []  # a list of arrivals\n",
    "    extras = {}  # dict of data to inject to arrival level\n",
    "    for event in cat:\n",
    "        for origin in event.origins:\n",
    "            arrivals.extend(origin.arrivals)\n",
    "            data = dict(event_id=event.resource_id, origin_id=origin.resource_id)\n",
    "            # use arrival id to inject extra to each arrival row\n",
    "            extras.update({id(x): data for x in origin.arrivals})\n",
    "    return arrivals_to_df(arrivals, extras=extras)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The next step is to initiate the extractor."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2024-02-28T22:20:35.776772Z",
     "iopub.status.busy": "2024-02-28T22:20:35.776594Z",
     "iopub.status.idle": "2024-02-28T22:20:35.801010Z",
     "shell.execute_reply": "2024-02-28T22:20:35.800419Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>resource_id</th>\n",
       "      <th>pick_id</th>\n",
       "      <th>event_id</th>\n",
       "      <th>origin_id</th>\n",
       "      <th>phase</th>\n",
       "      <th>time_correction</th>\n",
       "      <th>distance</th>\n",
       "      <th>time_residual</th>\n",
       "      <th>time_weight</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>smi:local/6a7f7180-21e5-4bbf-9a4e-20755d212437</td>\n",
       "      <td>smi:local/21691352</td>\n",
       "      <td>smi:local/248839</td>\n",
       "      <td>smi:local/404328</td>\n",
       "      <td>Pg</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0.384</td>\n",
       "      <td>0.258</td>\n",
       "      <td>-1.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>smi:local/5e517baf-02ec-4a33-9225-7233dc109628</td>\n",
       "      <td>smi:local/21691353</td>\n",
       "      <td>smi:local/248839</td>\n",
       "      <td>smi:local/404328</td>\n",
       "      <td>Pg</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0.356</td>\n",
       "      <td>1.334</td>\n",
       "      <td>-1.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>smi:local/83a9d3ca-415d-470b-93bb-34677e9f64af</td>\n",
       "      <td>smi:local/21691354</td>\n",
       "      <td>smi:local/248839</td>\n",
       "      <td>smi:local/404328</td>\n",
       "      <td>Pg</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0.550</td>\n",
       "      <td>0.158</td>\n",
       "      <td>-1.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>smi:local/6593b4e2-8cd1-4ea0-85d7-7c67a0759a3d</td>\n",
       "      <td>smi:local/21691355</td>\n",
       "      <td>smi:local/248839</td>\n",
       "      <td>smi:local/404328</td>\n",
       "      <td>Sg</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0.384</td>\n",
       "      <td>-1.695</td>\n",
       "      <td>-1.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>smi:local/7ee1ae58-d3aa-442c-ad1e-ca2bd9ca56d8</td>\n",
       "      <td>smi:local/21691356</td>\n",
       "      <td>smi:local/248839</td>\n",
       "      <td>smi:local/404328</td>\n",
       "      <td>Sg</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0.356</td>\n",
       "      <td>0.038</td>\n",
       "      <td>-1.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                                      resource_id             pick_id  \\\n",
       "0  smi:local/6a7f7180-21e5-4bbf-9a4e-20755d212437  smi:local/21691352   \n",
       "1  smi:local/5e517baf-02ec-4a33-9225-7233dc109628  smi:local/21691353   \n",
       "2  smi:local/83a9d3ca-415d-470b-93bb-34677e9f64af  smi:local/21691354   \n",
       "3  smi:local/6593b4e2-8cd1-4ea0-85d7-7c67a0759a3d  smi:local/21691355   \n",
       "4  smi:local/7ee1ae58-d3aa-442c-ad1e-ca2bd9ca56d8  smi:local/21691356   \n",
       "\n",
       "           event_id         origin_id phase  time_correction  distance  \\\n",
       "0  smi:local/248839  smi:local/404328    Pg              NaN     0.384   \n",
       "1  smi:local/248839  smi:local/404328    Pg              NaN     0.356   \n",
       "2  smi:local/248839  smi:local/404328    Pg              NaN     0.550   \n",
       "3  smi:local/248839  smi:local/404328    Sg              NaN     0.384   \n",
       "4  smi:local/248839  smi:local/404328    Sg              NaN     0.356   \n",
       "\n",
       "   time_residual  time_weight  \n",
       "0          0.258         -1.0  \n",
       "1          1.334         -1.0  \n",
       "2          0.158         -1.0  \n",
       "3         -1.695         -1.0  \n",
       "4          0.038         -1.0  "
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df = arrivals_to_df(cat)\n",
    "df.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2024-02-28T22:20:35.803088Z",
     "iopub.status.busy": "2024-02-28T22:20:35.802904Z",
     "iopub.status.idle": "2024-02-28T22:20:35.807669Z",
     "shell.execute_reply": "2024-02-28T22:20:35.807185Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "phase\n",
       "pPn    238\n",
       "P      224\n",
       "Pb     129\n",
       "Sb      87\n",
       "Pg      79\n",
       "S       66\n",
       "Sg      53\n",
       "Pn      53\n",
       "Sn      22\n",
       "pPb      3\n",
       "Name: count, dtype: int64"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.phase.value_counts()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "If only the P phases were needed, the easiest thing to do is filter the dataframe.  For demonstration let's modify our phase extractor so that any row that is not a P phase is skipped. This is done by raising a `SkipRow` exception which is an attribute of the `DataFrameExtractor`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2024-02-28T22:20:35.809967Z",
     "iopub.status.busy": "2024-02-28T22:20:35.809526Z",
     "iopub.status.idle": "2024-02-28T22:20:35.812947Z",
     "shell.execute_reply": "2024-02-28T22:20:35.812429Z"
    }
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/home/runner/work/obsplus/obsplus/src/obsplus/structures/dfextractor.py:118: UserWarning: _get_phase is already a registered extractor, overwriting\n",
      "  warnings.warn(msg)\n"
     ]
    }
   ],
   "source": [
    "@arrivals_to_df.extractor\n",
    "def _get_phase(arrival):\n",
    "    phase = arrival.phase\n",
    "    if phase.upper() != 'P':\n",
    "        raise arrivals_to_df.SkipRow\n",
    "    return phase"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2024-02-28T22:20:35.815122Z",
     "iopub.status.busy": "2024-02-28T22:20:35.814780Z",
     "iopub.status.idle": "2024-02-28T22:20:35.826657Z",
     "shell.execute_reply": "2024-02-28T22:20:35.826184Z"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "phase\n",
      "P    224\n",
      "Name: count, dtype: int64\n"
     ]
    }
   ],
   "source": [
    "df = arrivals_to_df(cat)\n",
    "print(df.phase.value_counts())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Get a picks dataframe and perform a left join on the phases:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2024-02-28T22:20:35.828844Z",
     "iopub.status.busy": "2024-02-28T22:20:35.828492Z",
     "iopub.status.idle": "2024-02-28T22:20:35.900634Z",
     "shell.execute_reply": "2024-02-28T22:20:35.900055Z"
    }
   },
   "outputs": [],
   "source": [
    "# get picks and filter out non-P phases\n",
    "picks = obsplus.picks_to_df(cat)\n",
    "picks = picks[picks.phase_hint.str.upper() == \"P\"]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2024-02-28T22:20:35.903303Z",
     "iopub.status.busy": "2024-02-28T22:20:35.902883Z",
     "iopub.status.idle": "2024-02-28T22:20:35.907943Z",
     "shell.execute_reply": "2024-02-28T22:20:35.907357Z"
    }
   },
   "outputs": [],
   "source": [
    "df_merged = df.merge(picks, how='left', right_on='resource_id', left_on='pick_id')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2024-02-28T22:20:35.910348Z",
     "iopub.status.busy": "2024-02-28T22:20:35.910013Z",
     "iopub.status.idle": "2024-02-28T22:20:35.923072Z",
     "shell.execute_reply": "2024-02-28T22:20:35.922494Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>resource_id_x</th>\n",
       "      <th>pick_id</th>\n",
       "      <th>event_id_x</th>\n",
       "      <th>origin_id</th>\n",
       "      <th>phase</th>\n",
       "      <th>time_correction</th>\n",
       "      <th>distance</th>\n",
       "      <th>time_residual</th>\n",
       "      <th>time_weight</th>\n",
       "      <th>resource_id_y</th>\n",
       "      <th>...</th>\n",
       "      <th>agency_id</th>\n",
       "      <th>event_id_y</th>\n",
       "      <th>network</th>\n",
       "      <th>station</th>\n",
       "      <th>location</th>\n",
       "      <th>channel</th>\n",
       "      <th>uncertainty</th>\n",
       "      <th>lower_uncertainty</th>\n",
       "      <th>upper_uncertainty</th>\n",
       "      <th>confidence_level</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>smi:local/ed8b0924-a569-4a9b-be1a-9e837cbaf91e</td>\n",
       "      <td>smi:local/21691352</td>\n",
       "      <td>smi:local/248839</td>\n",
       "      <td>smi:local/404330</td>\n",
       "      <td>P</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0.319</td>\n",
       "      <td>0.437</td>\n",
       "      <td>-1.0</td>\n",
       "      <td>smi:local/21691352</td>\n",
       "      <td>...</td>\n",
       "      <td></td>\n",
       "      <td>smi:local/248839</td>\n",
       "      <td>TA</td>\n",
       "      <td>P17A</td>\n",
       "      <td></td>\n",
       "      <td>BHZ</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>smi:local/14bccf6c-a0ec-4805-9469-a11c7704aafe</td>\n",
       "      <td>smi:local/21691353</td>\n",
       "      <td>smi:local/248839</td>\n",
       "      <td>smi:local/404330</td>\n",
       "      <td>P</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0.407</td>\n",
       "      <td>-0.705</td>\n",
       "      <td>-1.0</td>\n",
       "      <td>smi:local/21691353</td>\n",
       "      <td>...</td>\n",
       "      <td></td>\n",
       "      <td>smi:local/248839</td>\n",
       "      <td>TA</td>\n",
       "      <td>P16A</td>\n",
       "      <td></td>\n",
       "      <td>BHZ</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>smi:local/a1c56c67-4c6f-4f4a-af55-c886007821dc</td>\n",
       "      <td>smi:local/21691354</td>\n",
       "      <td>smi:local/248839</td>\n",
       "      <td>smi:local/404330</td>\n",
       "      <td>P</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0.585</td>\n",
       "      <td>-1.596</td>\n",
       "      <td>-1.0</td>\n",
       "      <td>smi:local/21691354</td>\n",
       "      <td>...</td>\n",
       "      <td></td>\n",
       "      <td>smi:local/248839</td>\n",
       "      <td>TA</td>\n",
       "      <td>Q16A</td>\n",
       "      <td></td>\n",
       "      <td>BHZ</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>smi:local/4f383f7b-56d3-4305-8754-f93e26230698</td>\n",
       "      <td>smi:local/21691357</td>\n",
       "      <td>smi:local/248839</td>\n",
       "      <td>smi:local/404330</td>\n",
       "      <td>P</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0.624</td>\n",
       "      <td>-0.518</td>\n",
       "      <td>-1.0</td>\n",
       "      <td>smi:local/21691357</td>\n",
       "      <td>...</td>\n",
       "      <td></td>\n",
       "      <td>smi:local/248839</td>\n",
       "      <td>UU</td>\n",
       "      <td>SRU</td>\n",
       "      <td></td>\n",
       "      <td>BHZ</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>smi:local/fbb1b19a-5c2b-44bc-a4bf-7e4edfbbe1d9</td>\n",
       "      <td>smi:local/21691358</td>\n",
       "      <td>smi:local/248839</td>\n",
       "      <td>smi:local/404330</td>\n",
       "      <td>P</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0.755</td>\n",
       "      <td>-0.598</td>\n",
       "      <td>-1.0</td>\n",
       "      <td>smi:local/21691358</td>\n",
       "      <td>...</td>\n",
       "      <td></td>\n",
       "      <td>smi:local/248839</td>\n",
       "      <td>TA</td>\n",
       "      <td>O16A</td>\n",
       "      <td></td>\n",
       "      <td>BHZ</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>5 rows × 34 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "                                    resource_id_x             pick_id  \\\n",
       "0  smi:local/ed8b0924-a569-4a9b-be1a-9e837cbaf91e  smi:local/21691352   \n",
       "1  smi:local/14bccf6c-a0ec-4805-9469-a11c7704aafe  smi:local/21691353   \n",
       "2  smi:local/a1c56c67-4c6f-4f4a-af55-c886007821dc  smi:local/21691354   \n",
       "3  smi:local/4f383f7b-56d3-4305-8754-f93e26230698  smi:local/21691357   \n",
       "4  smi:local/fbb1b19a-5c2b-44bc-a4bf-7e4edfbbe1d9  smi:local/21691358   \n",
       "\n",
       "         event_id_x         origin_id phase  time_correction  distance  \\\n",
       "0  smi:local/248839  smi:local/404330     P              NaN     0.319   \n",
       "1  smi:local/248839  smi:local/404330     P              NaN     0.407   \n",
       "2  smi:local/248839  smi:local/404330     P              NaN     0.585   \n",
       "3  smi:local/248839  smi:local/404330     P              NaN     0.624   \n",
       "4  smi:local/248839  smi:local/404330     P              NaN     0.755   \n",
       "\n",
       "   time_residual  time_weight       resource_id_y  ... agency_id  \\\n",
       "0          0.437         -1.0  smi:local/21691352  ...             \n",
       "1         -0.705         -1.0  smi:local/21691353  ...             \n",
       "2         -1.596         -1.0  smi:local/21691354  ...             \n",
       "3         -0.518         -1.0  smi:local/21691357  ...             \n",
       "4         -0.598         -1.0  smi:local/21691358  ...             \n",
       "\n",
       "         event_id_y network station  location  channel uncertainty  \\\n",
       "0  smi:local/248839      TA    P17A                BHZ         NaN   \n",
       "1  smi:local/248839      TA    P16A                BHZ         NaN   \n",
       "2  smi:local/248839      TA    Q16A                BHZ         NaN   \n",
       "3  smi:local/248839      UU     SRU                BHZ         NaN   \n",
       "4  smi:local/248839      TA    O16A                BHZ         NaN   \n",
       "\n",
       "  lower_uncertainty upper_uncertainty confidence_level  \n",
       "0               NaN               NaN              NaN  \n",
       "1               NaN               NaN              NaN  \n",
       "2               NaN               NaN              NaN  \n",
       "3               NaN               NaN              NaN  \n",
       "4               NaN               NaN              NaN  \n",
       "\n",
       "[5 rows x 34 columns]"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_merged.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2024-02-28T22:20:35.925337Z",
     "iopub.status.busy": "2024-02-28T22:20:35.924980Z",
     "iopub.status.idle": "2024-02-28T22:20:35.928942Z",
     "shell.execute_reply": "2024-02-28T22:20:35.928318Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Index(['resource_id_x', 'pick_id', 'event_id_x', 'origin_id', 'phase',\n",
       "       'time_correction', 'distance', 'time_residual', 'time_weight',\n",
       "       'resource_id_y', 'time', 'seed_id', 'filter_id', 'method_id',\n",
       "       'horizontal_slowness', 'backazimuth', 'onset', 'phase_hint', 'polarity',\n",
       "       'evaluation_mode', 'event_time', 'evaluation_status', 'creation_time',\n",
       "       'author', 'agency_id', 'event_id_y', 'network', 'station', 'location',\n",
       "       'channel', 'uncertainty', 'lower_uncertainty', 'upper_uncertainty',\n",
       "       'confidence_level'],\n",
       "      dtype='object')"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_merged.columns"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Calculate how often the `phase` attribute in the arrival is different from the `phase_hint` in the pick, which could indicate a quality issue."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2024-02-28T22:20:35.931046Z",
     "iopub.status.busy": "2024-02-28T22:20:35.930865Z",
     "iopub.status.idle": "2024-02-28T22:20:35.935177Z",
     "shell.execute_reply": "2024-02-28T22:20:35.934579Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "1.0"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# calculate fraction of phase_hints that match phase\n",
    "(df_merged['phase'] == df_merged['phase_hint']).sum() / len(df_merged)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.13"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}