DataFrameExtractor

Warning: This section is a bit technical and many users won’t need this functionality. Also, it is a bit experimental and the API may change in future versions. Proceed with caution.

The callables `picks_to_df, events_to_df <../datastructures/events_to_pandas.ipynb>`__, and `inventory_to_df <../datastructures/stations_to_pandas.ipynb>`__ are instances of DataFrameExtractor, which provides an extensible and customizable way for creating callables that extract DataFrames from arbitrary objects.

To demonstrate, let’s create a new extractor to put arrival objects in the Crandall catalog into a dataframe. The table can be joined together with the picks table to do some (possibly) interesting things.

[1]:
import obspy

import obsplus

crandall = obsplus.load_dataset('crandall_test')
cat =crandall.event_client.get_events()
print(cat)
8 Event(s) in Catalog:
2007-08-06T08:48:40.010000Z | +39.464, -111.228 | 4.2  mb
2007-08-07T02:14:24.080000Z | +39.463, -111.223 | 1.17 ml
2007-08-07T03:44:18.470000Z | +39.462, -111.215 | 1.68 ml
2007-08-07T07:13:05.760000Z | +39.461, -111.224 | 2.55 ml
2007-08-07T02:05:04.490000Z | +39.465, -111.225 | 2.44 ml
2007-08-06T10:47:25.600000Z | +39.462, -111.232 | 1.92 ml
2007-08-07T21:42:51.130000Z | +39.463, -111.220 | 1.88 ml
2007-08-06T01:44:48.810000Z | +39.462, -111.238 | 2.32 ml

Start by initializing the extractor with a list of expected columns and data types. This is optional, but helps ensure the output dataframe has a consistent shape and data type. The arrival documentation may be useful to understand these. Rather than collecting all the data contained in the Arrival instances, only a few columns of interest will be created.

[2]:
from collections import OrderedDict

import obspy.core.event as ev

# declare datatypes (order to double as required columns)
dtypes = OrderedDict(
    resource_id=str,
    pick_id=str,
    event_id=str,
    origin_id= str,
    phase=str,
    time_correction=float,
    distance=float,
    time_residual=float,
    time_weight=float,
)

# init the DataFrameExtractor
arrivals_to_df = obsplus.DataFrameExtractor(ev.Arrival, required_columns=list(dtypes),
                                            dtypes=dtypes)

The next step it to define some “extractors”. These are callables that will take an Arrival instance and return the desired data. The extractors can return:

  1. A dict of values where each key corresponds to a column name and each value is the row value of that column for the current object.

  2. Anything else, which is interpreted as the row value, and the column name is obtained from the function name.

[3]:
# an extractor which returns a dictionary
@arrivals_to_df.extractor
def _get_basic(arrival):
    out = dict(
        resource_id=str(arrival.resource_id),
        pick_id=str(arrival.pick_id),
        time_correction=arrival.time_correction,
        distance=arrival.distance,
        time_residual=arrival.time_residual,
        time_weight=arrival.time_weight,
    )
    return out


# an extractor which returns a single value
@arrivals_to_df.extractor
def _get_phase(arrival):
    return arrival.phase

Notice that there is no way of extracting information from the parent Origin or Event objects. The extractor also doesn’t know how to find the arrivals in a Catalog object. Defining the types of data the extractor can operate on, and injecting the event and origin data into arrival rows will accomplish both of these tasks.

[4]:
@arrivals_to_df.register(obspy.Catalog)
def _get_arrivals_from_catalogs(cat):
    arrivals = []  # a list of arrivals
    extras = {}  # dict of data to inject to arrival level
    for event in cat:
        for origin in event.origins:
            arrivals.extend(origin.arrivals)
            data = dict(event_id=event.resource_id, origin_id=origin.resource_id)
            # use arrival id to inject extra to each arrival row
            extras.update({id(x): data for x in origin.arrivals})
    return arrivals_to_df(arrivals, extras=extras)

The next step is to initiate the extractor.

[5]:
df = arrivals_to_df(cat)
df.head()
[5]:
resource_id pick_id event_id origin_id phase time_correction distance time_residual time_weight
0 smi:local/6a7f7180-21e5-4bbf-9a4e-20755d212437 smi:local/21691352 smi:local/248839 smi:local/404328 Pg NaN 0.384 0.258 -1.0
1 smi:local/5e517baf-02ec-4a33-9225-7233dc109628 smi:local/21691353 smi:local/248839 smi:local/404328 Pg NaN 0.356 1.334 -1.0
2 smi:local/83a9d3ca-415d-470b-93bb-34677e9f64af smi:local/21691354 smi:local/248839 smi:local/404328 Pg NaN 0.550 0.158 -1.0
3 smi:local/6593b4e2-8cd1-4ea0-85d7-7c67a0759a3d smi:local/21691355 smi:local/248839 smi:local/404328 Sg NaN 0.384 -1.695 -1.0
4 smi:local/7ee1ae58-d3aa-442c-ad1e-ca2bd9ca56d8 smi:local/21691356 smi:local/248839 smi:local/404328 Sg NaN 0.356 0.038 -1.0
[6]:
df.phase.value_counts()
[6]:
phase
pPn    238
P      224
Pb     129
Sb      87
Pg      79
S       66
Sg      53
Pn      53
Sn      22
pPb      3
Name: count, dtype: int64

If only the P phases were needed, the easiest thing to do is filter the dataframe. For demonstration let’s modify our phase extractor so that any row that is not a P phase is skipped. This is done by raising a SkipRow exception which is an attribute of the DataFrameExtractor.

[7]:
@arrivals_to_df.extractor
def _get_phase(arrival):
    phase = arrival.phase
    if phase.upper() != 'P':
        raise arrivals_to_df.SkipRow
    return phase
/home/runner/work/obsplus/obsplus/src/obsplus/structures/dfextractor.py:118: UserWarning: _get_phase is already a registered extractor, overwriting
  warnings.warn(msg)
[8]:
df = arrivals_to_df(cat)
print(df.phase.value_counts())
phase
P    224
Name: count, dtype: int64

Get a picks dataframe and perform a left join on the phases:

[9]:
# get picks and filter out non-P phases
picks = obsplus.picks_to_df(cat)
picks = picks[picks.phase_hint.str.upper() == "P"]
[10]:
df_merged = df.merge(picks, how='left', right_on='resource_id', left_on='pick_id')
[11]:
df_merged.head()
[11]:
resource_id_x pick_id event_id_x origin_id phase time_correction distance time_residual time_weight resource_id_y ... agency_id event_id_y network station location channel uncertainty lower_uncertainty upper_uncertainty confidence_level
0 smi:local/ed8b0924-a569-4a9b-be1a-9e837cbaf91e smi:local/21691352 smi:local/248839 smi:local/404330 P NaN 0.319 0.437 -1.0 smi:local/21691352 ... smi:local/248839 TA P17A BHZ NaN NaN NaN NaN
1 smi:local/14bccf6c-a0ec-4805-9469-a11c7704aafe smi:local/21691353 smi:local/248839 smi:local/404330 P NaN 0.407 -0.705 -1.0 smi:local/21691353 ... smi:local/248839 TA P16A BHZ NaN NaN NaN NaN
2 smi:local/a1c56c67-4c6f-4f4a-af55-c886007821dc smi:local/21691354 smi:local/248839 smi:local/404330 P NaN 0.585 -1.596 -1.0 smi:local/21691354 ... smi:local/248839 TA Q16A BHZ NaN NaN NaN NaN
3 smi:local/4f383f7b-56d3-4305-8754-f93e26230698 smi:local/21691357 smi:local/248839 smi:local/404330 P NaN 0.624 -0.518 -1.0 smi:local/21691357 ... smi:local/248839 UU SRU BHZ NaN NaN NaN NaN
4 smi:local/fbb1b19a-5c2b-44bc-a4bf-7e4edfbbe1d9 smi:local/21691358 smi:local/248839 smi:local/404330 P NaN 0.755 -0.598 -1.0 smi:local/21691358 ... smi:local/248839 TA O16A BHZ NaN NaN NaN NaN

5 rows × 34 columns

[12]:
df_merged.columns
[12]:
Index(['resource_id_x', 'pick_id', 'event_id_x', 'origin_id', 'phase',
       'time_correction', 'distance', 'time_residual', 'time_weight',
       'resource_id_y', 'time', 'seed_id', 'filter_id', 'method_id',
       'horizontal_slowness', 'backazimuth', 'onset', 'phase_hint', 'polarity',
       'evaluation_mode', 'event_time', 'evaluation_status', 'creation_time',
       'author', 'agency_id', 'event_id_y', 'network', 'station', 'location',
       'channel', 'uncertainty', 'lower_uncertainty', 'upper_uncertainty',
       'confidence_level'],
      dtype='object')

Calculate how often the phase attribute in the arrival is different from the phase_hint in the pick, which could indicate a quality issue.

[13]:
# calculate fraction of phase_hints that match phase
(df_merged['phase'] == df_merged['phase_hint']).sum() / len(df_merged)
[13]:
1.0