obsplus.utils.pd module

Generic Utilities for Pandas

obsplus.utils.pd.apply_funcs_to_columns(df, funcs, inplace=False)[source]

Apply callables to columns.

Parameters:
  • df (DataFrame) – The input dataframe.

  • funcs (Optional[Mapping[str, Callable[[Series], Union[Series, ndarray]]]]) – A mapping of {column_name: function_to_apply}.

  • inplace (bool) – If True, perform operation in place.

Return type:

A new dataframe with the columns replaced with output of the function.

obsplus.utils.pd.cast_dtypes(df, dtype=None, inplace=False)[source]

Cast data types for columns in dataframe, skip columns that doesn’t exist.

The following obsplus specific datatypes are supported:

‘ops_datetime’ - call obsplus.utils.time.to_datetime64() on column ‘ops_timedelta` - call obsplus.utils.time.to_timedelta64() on column

Notes

This function is different from pd.astype because it skips columns which don’t exist and handles custom obsplus dtypes.

Parameters:
  • df (DataFrame) – Dataframe

  • dtype (Optional[Mapping[str, Union[type, str]]]) – A dict of columns and datatypes.

  • inplace – If true perform operation in place.

Return type:

DataFrame

obsplus.utils.pd.convert_bytestrings(df, columns, inplace=False)[source]

Convert byte strings columns to strings.

This removes ‘b’ and quotation marks from string columns. For some reason encode doesn’t work on data returned from hdf5, hence this approach is a bit hacky.

Parameters:
  • df – The input dataframe.

  • columns – The names of the columns to convert to string types

  • inplace – If True, perform operation in place.

obsplus.utils.pd.filter_df(df, **kwargs)[source]

Determine if each row of the index meets some filter requirements.

Parameters:
  • df (DataFrame) – The input dataframe.

  • kwargs – Any condition to check against columns of df. Can be a single value or a collection of values (to check isin on columns). Str arguments can also use unix style matching.

Return type:

array

Returns:

  • A boolean array of the same len as df indicating if each row meets the

  • requirements.

obsplus.utils.pd.filter_index(index, network=None, station=None, location=None, channel=None, starttime=None, endtime=None, **kwargs)[source]

Filter a waveform index dataframe based on nslc codes and start/end times.

Parameters:
  • index – A dataframe to filter which should have the corresponding columns to any non-None parameters used in filter.

  • network – A network code as defined by seed standards.

  • station – A station code as defined by seed standards.

  • location – A location code as defined by seed standards.

  • channel – A channel code as defined by seed standards.

  • starttime – The starttime of interest.

  • endtime – The endtime of interest.

  • filters. (Additional kwargs are used as) –

Returns:

  • A numpy array of boolean values indicating if each row met the filter

  • requirements.

obsplus.utils.pd.get_regex(seed_str)[source]

Compile, and cache regex for str queries.

obsplus.utils.pd.get_seed_id_series(df, null_codes=(None, '--', 'None', 'nan', 'null', nan), subset=None)[source]

Create a series of seed_ids from a dataframe with required columns.

The seed id series contains strings of the form:

network.station.location.channel

Any “nullish” values (defined by the parameter null_codes) will be replaced with an empty string.

Parameters:
  • df (DataFrame) –

    Any Dataframe that has columns with str dtype named:

    network, station, location, channel

  • null_codes (Optional[Any]) – Codes which should be replaced with a blank string.

  • subset (Optional[Sequence[str]]) – Used to select a subset of the full seed_id. For example, (‘network’, ‘station’) would return a series of network.station.

Return type:

A series of concatenated seed_ids codes.

Examples

>>> import obsplus
>>> import obspy
>>> # Get a dataframe with only network station location channel columns
>>> cat = obspy.read_inventory()
>>> NSLC = ['network', 'station', 'location', 'channel']
>>> df = obsplus.stations_to_df(cat)[NSLC]
>>> out = get_seed_id_series(df)
>>> # Get a series of network.station
>>> net_sta = get_seed_id_series(df, subset=('network', 'station'))
obsplus.utils.pd.get_waveforms_bulk_args(df, time_dtype='utcdatetime')[source]

Get the inputs to a get_waveforms_bulk from a dataframe.

Parameters:
  • df (DataFrame) –

    A dataframe with required columns:

    network, station, location, channel, starttime, endtime

  • time_dtype (str) – Dtype to use for the starttime and endtime

Return type:

A list of tuples [(network, station, location, channel, starttime, endtime),]

obsplus.utils.pd.join_str_columns(df, columns, join_char='.')[source]

Join string columns on a dataframe together.

Parameters:
  • df (DataFrame) – The input dataframe with columns listed in columns parameter.

  • columns (Sequence[str]) – The columns to be joined. Must be part of df.

  • join_char (str) – The string to join the columns together.

Return type:

Series

obsplus.utils.pd.order_columns(df, required_columns, drop_columns=False, fill_missing=True)[source]

Order a dataframe’s columns and ensure it has required columns.

Parameters:
  • df (DataFrame) – The input dataframe.

  • required_columns (Sequence) – A sequence that contains the column names.

  • drop_columns – If True drop columns not in required_columns.

  • fill_missing – If True, create missing required columns and fill with nullish values.

Return type:

pd.DataFrame

obsplus.utils.pd.replace_or_swallow(df, replace)[source]

Replace values in a dataframe with new values.

Parameters:
  • df (DataFrame) – The dataframe for which the values will be replaced

  • replace (dict) – A dict of {old_value: new_values}

Return type:

DataFrame