obsplus.datasets.dataset module

Module for loading, (and downloading) data sets.

class obsplus.datasets.dataset.DataSet(base_path=None)[source]

Bases: ABC

Abstract Base Class for downloading and serving datasets.

This is not intended to be used directly, but rather through subclassing.

Parameters:

base_path – The path to which the dataset will be saved.

data_path

The path containing the data. By default it is base_path / name.

source_path

The path which contains the original files included in the dataset before download. By default this is found in the same directory as the dataset’s code (.py) file in a folder with the same name as the dataset.

Notes

Importantly, each dataset references two directories, the source_path and data_path. The source_path contains all data included within the dataset and should not be altered. The data_path has a copy of everything in the source_path, plus the files created during the downloading process.

The base_path (the parent of data_path) is resolved for each dataset using the following priorities:

  1. The base_path provided to Dataset’s __init__ method.

  2. .data_path.txt file stored in the data source

  3. An environmental name OPSDATA_PATH

  4. The opsdata_path variable from obsplus.constants

By default the data will be downloaded to the user’s home directory in a folder called “opsdata”, but again, this is easily changed by setting the OPSDATA_PATH environmental variable.

check_hashes(check_hash=False)[source]

Check that the files are all there and have the correct Hashes.

Parameters:

check_hash – If True check the hash of the files.

Raises:
check_version()[source]

Check the version of the dataset.

Verifies the version string in the dataset class definition matches the one saved on disk. Returns True if all is well else raises a DataVersionError.

Parameters:

path – Expected path of the version file.

Raises:

DataVersionError – If any version problems are discovered.

Return type:

bool

copy(deep=True)[source]

Return a copy of the dataset.

Parameters:

deep – If True deep copy the objects attached to the dataset.

Return type:

TypeVar(DataSetType, bound= DataSet)

Notes

This only copies data in memory, not on disk. If you plan to make any changes to the dataset’s on disk resources please use :method:`~obsplus.Dataset.copy_to`.

copy_to(destination=None)[source]

Copy the dataset to a destination.

If the destination already exists simply do nothing.

Parameters:

destination (Union[str, Path, None]) – The destination to copy the dataset. It will be created if it doesnt exist. If None is provided use tmpfile to create a temporary directory.

Return type:

A new dataset object which refers to the copied files.

create_sha256_hash(path=None, hidden=False)[source]

Create a sha256 hash of the dataset’s data files.

The output is stored in a simple json file. Keys are paths (relative to dataset base path) and values are files hashes.

If you want to update/create the hash file in the dataset’s source this can be done by passing the dataset’s source_path as the path argument.

Parameters:
  • path – The path to which the hash data is saved. If None use data_path.

  • hidden – If True also include hidden files.

Return type:

dict

property data_files: Tuple[Path, ...]

Return a list of top-level files associated with the dataset.

Hidden files are ignored.

data_loaded = False
property data_path: Path

Return a path to where the dataset’s data was/will be downloaded.

delete_data_directory()[source]

Delete the datafiles of a dataset.

This will force the data to be re-copied from the source files and download logic to be run.

download_events()[source]

Method to ensure the events have been downloaded.

Events should be written in an obspy-readable format to self.event_path. If not implemented this method will create an empty directory.

Return type:

None

download_stations()[source]

Method to ensure inventories have been downloaded.

Station data should be written in an obspy-readable format to self.station_path. Since there is not yet a functional StationBank, this method must be implemented by subclass.

Return type:

None

download_waveforms()[source]

Method to ensure waveforms have been downloaded.

Waveforms should be written in an obspy-readable format to self.waveform_path.

Return type:

None

property event_client: EventClient | None

A cached property for an event client

property event_path: Path

Return the path to the events.

property events_need_downloading: bool

Returns True if event data need to be downloaded.

get_fetcher(**kwargs)[source]

Return a Fetcher from the data.

kwargs are passed to Fetcher’s constructor. See its documentation for acceptable kwargs.

Return type:

Fetcher

classmethod load_dataset(name, silent=False)[source]

Get a loaded dataset.

Will ensure all files are downloaded and the appropriate data are loaded into memory.

Parameters:

name (Union[str, DataSet]) – The name of the dataset to load or a DataSet object. If a DataSet object is passed a copy of it will be returned.

Return type:

TypeVar(DataSetType, bound= DataSet)

Examples

>>> # --- Load an example dataset for testing
>>> import obsplus
>>> ds = obsplus.load_dataset('default_test')
>>> # If you plan to make changes to the dataset be sure to copy it first
>>> # The following will copy all files in the dataset to a tmpdir
>>> ds2 = obsplus.copy_dataset('default_test')
>>> # --- Use dataset clients to load waveforms, stations, and events
>>> cat = ds.event_client.get_events()
>>> st = ds.waveform_client.get_waveforms()
>>> inv = ds.station_client.get_stations()
>>> # --- get a fetcher for more "dataset aware" querying
>>> fetcher = ds.get_fetcher()
abstract property name: str

Name of the dataset

post_download_hook()[source]

Code to run after any downloads.

pre_download_hook()[source]

Code to run before any downloads.

read_data_version(path=None)[source]

Read the data version from disk.

Return a 3 length tuple from the semantic version string (of the form xx.yy.zz). Raise a DataVersionError if not found.

Return type:

str

property source_path: Path

Return a path to the directory where the data files included with the dataset live.

property station_client: StationClient | None

A cached property for a station client

property station_path: Path

Return the path to the stations.

property stations_need_downloading: bool

Returns True if station data need to be downloaded.

abstract property version: str

Dataset version. Should be a str of the form x.y.z

property version_tuple: Tuple[int, int, int]

Return a tuple of the version string.

property waveform_client: WaveformClient | None

A cached property for a waveform client

property waveform_path: Path

Return the path to the waveforms.

property waveforms_need_downloading: bool

Returns True if waveform data need to be downloaded.

write_version(path=None)[source]

Write the version string to disk.

obsplus.datasets.dataset.load_dataset(name, silent=False)

Get a loaded dataset.

Will ensure all files are downloaded and the appropriate data are loaded into memory.

Parameters:

name (Union[str, DataSet]) – The name of the dataset to load or a DataSet object. If a DataSet object is passed a copy of it will be returned.

Return type:

TypeVar(DataSetType, bound= DataSet)

Examples

>>> # --- Load an example dataset for testing
>>> import obsplus
>>> ds = obsplus.load_dataset('default_test')
>>> # If you plan to make changes to the dataset be sure to copy it first
>>> # The following will copy all files in the dataset to a tmpdir
>>> ds2 = obsplus.copy_dataset('default_test')
>>> # --- Use dataset clients to load waveforms, stations, and events
>>> cat = ds.event_client.get_events()
>>> st = ds.waveform_client.get_waveforms()
>>> inv = ds.station_client.get_stations()
>>> # --- get a fetcher for more "dataset aware" querying
>>> fetcher = ds.get_fetcher()