obsplus.datasets.dataset module¶
Module for loading, (and downloading) data sets.
- class obsplus.datasets.dataset.DataSet(base_path=None)[source]¶
Bases:
ABC
Abstract Base Class for downloading and serving datasets.
This is not intended to be used directly, but rather through subclassing.
- Parameters:
base_path – The path to which the dataset will be saved.
- data_path¶
The path containing the data. By default it is base_path / name.
- source_path¶
The path which contains the original files included in the dataset before download. By default this is found in the same directory as the dataset’s code (.py) file in a folder with the same name as the dataset.
Notes
Importantly, each dataset references two directories, the source_path and data_path. The source_path contains all data included within the dataset and should not be altered. The data_path has a copy of everything in the source_path, plus the files created during the downloading process.
The base_path (the parent of data_path) is resolved for each dataset using the following priorities:
The base_path provided to Dataset’s __init__ method.
.data_path.txt file stored in the data source
An environmental name OPSDATA_PATH
The opsdata_path variable from obsplus.constants
By default the data will be downloaded to the user’s home directory in a folder called “opsdata”, but again, this is easily changed by setting the OPSDATA_PATH environmental variable.
- check_hashes(check_hash=False)[source]¶
Check that the files are all there and have the correct Hashes.
- Parameters:
check_hash – If True check the hash of the files.
- Raises:
FileHashChangedError – If one of the file hashes is not as expeted.
MissingDataFileError – If one the data files was not downloaded.
- check_version()[source]¶
Check the version of the dataset.
Verifies the version string in the dataset class definition matches the one saved on disk. Returns True if all is well else raises a DataVersionError.
- Parameters:
path – Expected path of the version file.
- Raises:
DataVersionError – If any version problems are discovered.
- Return type:
bool
- copy(deep=True)[source]¶
Return a copy of the dataset.
- Parameters:
deep – If True deep copy the objects attached to the dataset.
- Return type:
TypeVar
(DataSetType
, bound= DataSet)
Notes
This only copies data in memory, not on disk. If you plan to make any changes to the dataset’s on disk resources please use :method:`~obsplus.Dataset.copy_to`.
- copy_to(destination=None)[source]¶
Copy the dataset to a destination.
If the destination already exists simply do nothing.
- Parameters:
destination (
Union
[str
,Path
,None
]) – The destination to copy the dataset. It will be created if it doesnt exist. If None is provided use tmpfile to create a temporary directory.- Return type:
A new dataset object which refers to the copied files.
- create_sha256_hash(path=None, hidden=False)[source]¶
Create a sha256 hash of the dataset’s data files.
The output is stored in a simple json file. Keys are paths (relative to dataset base path) and values are files hashes.
If you want to update/create the hash file in the dataset’s source this can be done by passing the dataset’s source_path as the path argument.
- Parameters:
path – The path to which the hash data is saved. If None use data_path.
hidden – If True also include hidden files.
- Return type:
dict
- property data_files: Tuple[Path, ...]¶
Return a list of top-level files associated with the dataset.
Hidden files are ignored.
- data_loaded = False¶
- property data_path: Path¶
Return a path to where the dataset’s data was/will be downloaded.
- delete_data_directory()[source]¶
Delete the datafiles of a dataset.
This will force the data to be re-copied from the source files and download logic to be run.
- download_events()[source]¶
Method to ensure the events have been downloaded.
Events should be written in an obspy-readable format to self.event_path. If not implemented this method will create an empty directory.
- Return type:
None
- download_stations()[source]¶
Method to ensure inventories have been downloaded.
Station data should be written in an obspy-readable format to self.station_path. Since there is not yet a functional StationBank, this method must be implemented by subclass.
- Return type:
None
- download_waveforms()[source]¶
Method to ensure waveforms have been downloaded.
Waveforms should be written in an obspy-readable format to self.waveform_path.
- Return type:
None
- property event_client: EventClient | None¶
A cached property for an event client
- property event_path: Path¶
Return the path to the events.
- property events_need_downloading: bool¶
Returns True if event data need to be downloaded.
- get_fetcher(**kwargs)[source]¶
Return a Fetcher from the data.
kwargs are passed to
Fetcher
’s constructor. See its documentation for acceptable kwargs.- Return type:
- classmethod load_dataset(name, silent=False)[source]¶
Get a loaded dataset.
Will ensure all files are downloaded and the appropriate data are loaded into memory.
- Parameters:
name (
Union
[str
,DataSet
]) – The name of the dataset to load or a DataSet object. If a DataSet object is passed a copy of it will be returned.- Return type:
TypeVar
(DataSetType
, bound= DataSet)
Examples
>>> # --- Load an example dataset for testing >>> import obsplus >>> ds = obsplus.load_dataset('default_test') >>> # If you plan to make changes to the dataset be sure to copy it first >>> # The following will copy all files in the dataset to a tmpdir >>> ds2 = obsplus.copy_dataset('default_test')
>>> # --- Use dataset clients to load waveforms, stations, and events >>> cat = ds.event_client.get_events() >>> st = ds.waveform_client.get_waveforms() >>> inv = ds.station_client.get_stations()
>>> # --- get a fetcher for more "dataset aware" querying >>> fetcher = ds.get_fetcher()
- abstract property name: str¶
Name of the dataset
- read_data_version(path=None)[source]¶
Read the data version from disk.
Return a 3 length tuple from the semantic version string (of the form xx.yy.zz). Raise a DataVersionError if not found.
- Return type:
str
- property source_path: Path¶
Return a path to the directory where the data files included with the dataset live.
- property station_client: StationClient | None¶
A cached property for a station client
- property station_path: Path¶
Return the path to the stations.
- property stations_need_downloading: bool¶
Returns True if station data need to be downloaded.
- abstract property version: str¶
Dataset version. Should be a str of the form x.y.z
- property version_tuple: Tuple[int, int, int]¶
Return a tuple of the version string.
- property waveform_client: WaveformClient | None¶
A cached property for a waveform client
- property waveform_path: Path¶
Return the path to the waveforms.
- property waveforms_need_downloading: bool¶
Returns True if waveform data need to be downloaded.
- obsplus.datasets.dataset.load_dataset(name, silent=False)¶
Get a loaded dataset.
Will ensure all files are downloaded and the appropriate data are loaded into memory.
- Parameters:
name (
Union
[str
,DataSet
]) – The name of the dataset to load or a DataSet object. If a DataSet object is passed a copy of it will be returned.- Return type:
TypeVar
(DataSetType
, bound= DataSet)
Examples
>>> # --- Load an example dataset for testing >>> import obsplus >>> ds = obsplus.load_dataset('default_test') >>> # If you plan to make changes to the dataset be sure to copy it first >>> # The following will copy all files in the dataset to a tmpdir >>> ds2 = obsplus.copy_dataset('default_test')
>>> # --- Use dataset clients to load waveforms, stations, and events >>> cat = ds.event_client.get_events() >>> st = ds.waveform_client.get_waveforms() >>> inv = ds.station_client.get_stations()
>>> # --- get a fetcher for more "dataset aware" querying >>> fetcher = ds.get_fetcher()