WaveBank¶
WaveBank
is an in-process database for accessing seismic time-series data. Any directory structure containing ObsPy-readable waveforms can be used as the data source. WaveBank
uses a simple indexing scheme and the Hierarchical Data Format to keep track of each Trace
in the directory. Without WaveBank
(or another similar program) applications have implement their own data organization/access logic which is tedious and
clutters up application code. WaveBank
provides a better way.
Load Example Data¶
This tutorial will demonstrate the use of WaveBank
on two different obsplus datasets.
The first dataset, crandall canyon, only has event waveform files. The second only has continuous data from two TA stations. We start by loading these datasets, making a temporary copy, and getting a path to their waveform directories.
[1]:
%%capture
import obsplus
# make sure datasets are downloaded and copy them to temporary
# directories to make sure no accidental changes are made
crandall_dataset = obsplus.load_dataset('crandall_test').copy()
ta_dataset = obsplus.load_dataset('ta_test').copy()
# get path to waveform directories
crandall_path = crandall_dataset.waveform_path
ta_path = ta_dataset.waveform_path
[2]:
crandall_path
[2]:
PosixPath('/home/runner/opsdata/crandall_test/waveforms')
Create a WaveBank object¶
To create a WaveBank
instance simply pass the class a path to the waveform directory.
[3]:
bank = obsplus.WaveBank(crandall_path)
Utilizing the udpate_index
method on the bank ensures the index is up-to-date. This will iterate through all files that are timestamped later than the last time update_index
was run.
Note: If the index has not yet been created or new files have been added, update_index
needs to be called.
[4]:
bank.update_index()
[4]:
WaveBank(base_path=/home/runner/opsdata/crandall_test/waveforms)
Get waveforms¶
The files can be retrieved from the directory with the get_waveforms
method. This method has the same signature as the ObsPy client get_waveform
methods so they can be used interchangeably:
[5]:
import obspy
t1 = obspy.UTCDateTime('2007-08-06T01-44-48')
t2 = t1 + 60
st = bank.get_waveforms(starttime=t1, endtime=t2)
print (st[:5]) # print first 5 traces
5 Trace(s) in Stream:
TA.O15A..BHE | 2007-08-06T01:44:48.000000Z - 2007-08-06T01:45:48.000000Z | 40.0 Hz, 2401 samples
TA.O15A..BHN | 2007-08-06T01:44:47.999998Z - 2007-08-06T01:45:47.999998Z | 40.0 Hz, 2401 samples
TA.O15A..BHZ | 2007-08-06T01:44:47.999998Z - 2007-08-06T01:45:47.999998Z | 40.0 Hz, 2401 samples
TA.O16A..BHE | 2007-08-06T01:44:48.000000Z - 2007-08-06T01:45:48.000000Z | 40.0 Hz, 2401 samples
TA.O16A..BHN | 2007-08-06T01:44:48.000000Z - 2007-08-06T01:45:48.000000Z | 40.0 Hz, 2401 samples
WaveBank
can filter on channels, locations, stations, networks, etc. using linux style search strings or regex.
[6]:
st2 = bank.get_waveforms(network='UU', starttime=t1, endtime=t2)
# ensure only UU traces were returned
for tr in st2:
assert tr.stats.network == 'UU'
print(st2[:5]) # print first 5 traces
5 Trace(s) in Stream:
UU.CTU..HHE | 2007-08-06T01:44:47.994000Z - 2007-08-06T01:45:47.994000Z | 100.0 Hz, 6001 samples
UU.CTU..HHN | 2007-08-06T01:44:47.994000Z - 2007-08-06T01:45:47.994000Z | 100.0 Hz, 6001 samples
UU.CTU..HHZ | 2007-08-06T01:44:47.994000Z - 2007-08-06T01:45:47.994000Z | 100.0 Hz, 6001 samples
UU.MPU..HHE | 2007-08-06T01:44:47.992000Z - 2007-08-06T01:45:47.992000Z | 100.0 Hz, 6001 samples
UU.MPU..HHN | 2007-08-06T01:44:47.992000Z - 2007-08-06T01:45:47.992000Z | 100.0 Hz, 6001 samples
[7]:
st = bank.get_waveforms(starttime=t1, endtime=t2, station='O1??', channel='BH[NE]')
# test returned traces
for tr in st:
assert tr.stats.starttime >= t1 - .00001
assert tr.stats.endtime <= t2 + .00001
assert tr.stats.station.startswith('O1')
assert tr.stats.channel.startswith('BH')
assert tr.stats.channel[-1] in {'N', 'E'}
print(st)
6 Trace(s) in Stream:
TA.O15A..BHE | 2007-08-06T01:44:48.000000Z - 2007-08-06T01:45:48.000000Z | 40.0 Hz, 2401 samples
TA.O15A..BHN | 2007-08-06T01:44:47.999998Z - 2007-08-06T01:45:47.999998Z | 40.0 Hz, 2401 samples
TA.O16A..BHE | 2007-08-06T01:44:48.000000Z - 2007-08-06T01:45:48.000000Z | 40.0 Hz, 2401 samples
TA.O16A..BHN | 2007-08-06T01:44:48.000000Z - 2007-08-06T01:45:48.000000Z | 40.0 Hz, 2401 samples
TA.O18A..BHE | 2007-08-06T01:44:47.999998Z - 2007-08-06T01:45:47.999998Z | 40.0 Hz, 2401 samples
TA.O18A..BHN | 2007-08-06T01:44:47.999998Z - 2007-08-06T01:45:47.999998Z | 40.0 Hz, 2401 samples
WaveBank also has a get_waveforms_bulk
method for efficiently retrieving a large number of streams.
[8]:
args = [ # in practice this list may contain hundreds or thousands of requests
('TA', 'O15A', '', 'BHZ', t1 - 5, t2 - 5,),
('UU', 'SRU', '', 'HHZ', t1, t2,),
]
st = bank.get_waveforms_bulk(args)
print(st )
2 Trace(s) in Stream:
TA.O15A..BHZ | 2007-08-06T01:44:42.999998Z - 2007-08-06T01:45:42.999998Z | 40.0 Hz, 2401 samples
UU.SRU..HHZ | 2007-08-06T01:44:47.995000Z - 2007-08-06T01:45:47.995000Z | 100.0 Hz, 6001 samples
Yield waveforms¶
The Bank class also provides a generator for iterating large amounts of continuous waveforms. The following example shows how to get streams of one hour duration with a minute of overlap between the slices.
The first step is to create a bank on a dataset which has continuous data. The example below will use the TA dataset.
[9]:
ta_bank = obsplus.WaveBank(ta_path)
[10]:
# get a few hours of kemmerer data
ta_t1 = obspy.UTCDateTime('2007-02-15')
ta_t2 = obspy.UTCDateTime('2007-02-16')
for st in ta_bank.yield_waveforms(starttime=ta_t1, endtime=ta_t2, duration=3600, overlap=60):
print (f'got {len(st)} streams from {st[0].stats.starttime} to {st[0].stats.endtime}')
got 6 streams from 2007-02-15T00:00:09.999998Z to 2007-02-15T01:00:59.999998Z
got 6 streams from 2007-02-15T00:59:59.999998Z to 2007-02-15T02:00:59.999998Z
got 6 streams from 2007-02-15T01:59:59.999998Z to 2007-02-15T03:00:59.999998Z
got 6 streams from 2007-02-15T02:59:59.999998Z to 2007-02-15T04:00:59.999998Z
got 6 streams from 2007-02-15T03:59:59.999998Z to 2007-02-15T05:00:59.999998Z
got 6 streams from 2007-02-15T04:59:59.999998Z to 2007-02-15T06:00:59.999998Z
got 6 streams from 2007-02-15T05:59:59.999998Z to 2007-02-15T07:00:59.999998Z
got 6 streams from 2007-02-15T06:59:59.999998Z to 2007-02-15T08:00:59.999998Z
got 6 streams from 2007-02-15T07:59:59.999998Z to 2007-02-15T09:00:59.999998Z
got 6 streams from 2007-02-15T08:59:59.999998Z to 2007-02-15T10:00:59.999998Z
got 6 streams from 2007-02-15T09:59:59.999998Z to 2007-02-15T11:00:59.999998Z
got 6 streams from 2007-02-15T10:59:59.999998Z to 2007-02-15T12:00:59.999998Z
got 6 streams from 2007-02-15T11:59:59.999998Z to 2007-02-15T13:00:59.999998Z
got 6 streams from 2007-02-15T12:59:59.999998Z to 2007-02-15T14:00:59.999998Z
got 6 streams from 2007-02-15T13:59:59.999998Z to 2007-02-15T15:00:59.999998Z
got 6 streams from 2007-02-15T14:59:59.999998Z to 2007-02-15T16:00:59.999998Z
got 6 streams from 2007-02-15T15:59:59.999998Z to 2007-02-15T17:00:59.999998Z
got 6 streams from 2007-02-15T16:59:59.999998Z to 2007-02-15T18:00:59.999998Z
got 6 streams from 2007-02-15T17:59:59.999998Z to 2007-02-15T19:00:59.999998Z
got 6 streams from 2007-02-15T18:59:59.999998Z to 2007-02-15T20:00:59.999998Z
got 6 streams from 2007-02-15T19:59:59.999998Z to 2007-02-15T21:00:59.999998Z
got 6 streams from 2007-02-15T20:59:59.999998Z to 2007-02-15T22:00:59.999998Z
got 6 streams from 2007-02-15T21:59:59.999998Z to 2007-02-15T23:00:59.999998Z
got 6 streams from 2007-02-15T22:59:59.999998Z to 2007-02-16T00:00:59.999998Z
Put waveforms¶
Files can be added to the bank by passing a stream or trace to the bank.put_waveforms
method. WaveBank
does not merge files so overlap in data may occur if care is not taken.
[11]:
# show that no data for RJOB is in the bank
st = bank.get_waveforms(station='RJOB')
assert len(st) == 0
print(st)
0 Trace(s) in Stream:
[12]:
# add the default stream to the archive (which contains data for RJOB)
bank.put_waveforms(obspy.read())
st_out = bank.get_waveforms(station='RJOB')
# test output
assert len(st_out)
for tr in st_out:
assert tr.stats.station == 'RJOB'
print(st_out)
3 Trace(s) in Stream:
BW.RJOB..EHE | 2009-08-24T00:20:03.000000Z - 2009-08-24T00:20:32.990000Z | 100.0 Hz, 3000 samples
BW.RJOB..EHN | 2009-08-24T00:20:03.000000Z - 2009-08-24T00:20:32.990000Z | 100.0 Hz, 3000 samples
BW.RJOB..EHZ | 2009-08-24T00:20:03.000000Z - 2009-08-24T00:20:32.990000Z | 100.0 Hz, 3000 samples
Availability¶
WaveBank
can be used to get the availability of data. The outputs can either be a dataframe or as a list of tuples in the form of [(network, station, location, channel, min_starttime, max_endtime)].
[13]:
# get a dataframe of availability by seed ids and timestamps
bank.get_availability_df(channel='BHE', station='[OR]*')
[13]:
network | station | location | channel | starttime | endtime | |
---|---|---|---|---|---|---|
0 | TA | O15A | BHE | 2007-08-06 01:44:38.825000 | 2007-08-07 21:43:51.124998 | |
1 | TA | O16A | BHE | 2007-08-06 01:44:38.825000 | 2007-08-07 21:43:51.125000 | |
2 | TA | O18A | BHE | 2007-08-06 01:44:38.824998 | 2007-08-07 21:43:51.125000 | |
3 | TA | R16A | BHE | 2007-08-07 02:04:54.500000 | 2007-08-07 21:43:51.125000 | |
4 | TA | R17A | BHE | 2007-08-06 01:44:38.825000 | 2007-08-07 21:43:51.125000 |
[14]:
# get list of tuples of availability
bank.availability(channel='BHE', station='[OR]*')
[14]:
[('TA',
'O15A',
'',
'BHE',
2007-08-06T01:44:38.825000Z,
2007-08-07T21:43:51.124998Z),
('TA',
'O16A',
'',
'BHE',
2007-08-06T01:44:38.825000Z,
2007-08-07T21:43:51.125000Z),
('TA',
'O18A',
'',
'BHE',
2007-08-06T01:44:38.824998Z,
2007-08-07T21:43:51.125000Z),
('TA',
'R16A',
'',
'BHE',
2007-08-07T02:04:54.500000Z,
2007-08-07T21:43:51.125000Z),
('TA',
'R17A',
'',
'BHE',
2007-08-06T01:44:38.825000Z,
2007-08-07T21:43:51.125000Z)]
Get Gaps and uptime¶
WaveBank
can return a dataframe of missing data with the get_gaps_df
method, and a dataframe of reliability statistics with the get_uptime_df
method. These are useful for assessing the completeness of an archive of contiguous data.
[15]:
bank.get_gaps_df(channel='BHE', station='O*').head()
[15]:
network | station | location | channel | starttime | endtime | sampling_period | path | gap_duration | |
---|---|---|---|---|---|---|---|---|---|
0 | TA | O15A | BHE | 2007-08-06 01:45:48.799998 | 2007-08-06 08:48:30.024998 | 0 days 00:00:00.025000 | TA.O15A..BHE__20070806T014438Z__20070806T01454... | 0 days 07:02:41.225000 | |
1 | TA | O15A | BHE | 2007-08-06 08:49:39.999998 | 2007-08-06 10:47:15.624998 | 0 days 00:00:00.025000 | TA.O15A..BHE__20070806T084830Z__20070806T08494... | 0 days 01:57:35.625000 | |
2 | TA | O15A | BHE | 2007-08-06 10:48:25.599998 | 2007-08-07 02:04:54.499998 | 0 days 00:00:00.025000 | TA.O15A..BHE__20070806T104715Z__20070806T10482... | 0 days 15:16:28.900000 | |
3 | TA | O15A | BHE | 2007-08-07 02:06:04.474998 | 2007-08-07 02:14:14.100000 | 0 days 00:00:00.025000 | TA.O15A..BHE__20070807T020454Z__20070807T02060... | 0 days 00:08:09.625002 | |
4 | TA | O15A | BHE | 2007-08-07 02:15:24.074998 | 2007-08-07 03:44:08.474998 | 0 days 00:00:00.025000 | TA.O15A..BHE__20070807T021414Z__20070807T02152... | 0 days 01:28:44.400000 |
[16]:
ta_bank.get_uptime_df()
[16]:
network | station | location | channel | starttime | endtime | duration | gap_duration | uptime | availability | |
---|---|---|---|---|---|---|---|---|---|---|
0 | TA | M11A | VHE | 2007-02-15 00:00:09.999998 | 2007-02-24 23:59:59.999998 | 9 days 23:59:50 | 0 days | 9 days 23:59:50 | 1.0 | |
1 | TA | M11A | VHN | 2007-02-15 00:00:09.999998 | 2007-02-24 23:59:59.999998 | 9 days 23:59:50 | 0 days | 9 days 23:59:50 | 1.0 | |
2 | TA | M11A | VHZ | 2007-02-15 00:00:09.999998 | 2007-02-24 23:59:59.999998 | 9 days 23:59:50 | 0 days | 9 days 23:59:50 | 1.0 | |
3 | TA | M14A | VHE | 2007-02-15 00:00:00.000003 | 2007-02-25 00:00:00.000003 | 10 days 00:00:00 | 0 days | 10 days 00:00:00 | 1.0 | |
4 | TA | M14A | VHN | 2007-02-15 00:00:00.000003 | 2007-02-25 00:00:00.000003 | 10 days 00:00:00 | 0 days | 10 days 00:00:00 | 1.0 | |
5 | TA | M14A | VHZ | 2007-02-15 00:00:00.000004 | 2007-02-25 00:00:00.000004 | 10 days 00:00:00 | 0 days | 10 days 00:00:00 | 1.0 |
Read index¶
WaveBank
can return a dataframe of the the index with the read_index
method, although in most cases this shouldn’t be needed.
[17]:
ta_bank.read_index().head()
[17]:
network | station | location | channel | starttime | endtime | sampling_period | path | |
---|---|---|---|---|---|---|---|---|
0 | TA | M11A | VHN | 2007-02-19 14:59:59.999998 | 2007-02-19 15:59:59.999998 | 0 days 00:00:10 | TA/M11A/VHN/2007-02-19T15-00-00.mseed | |
1 | TA | M14A | VHN | 2007-02-19 15:00:00.000003 | 2007-02-19 16:00:00.000003 | 0 days 00:00:10 | TA/M11A/VHN/2007-02-19T15-00-00.mseed | |
2 | TA | M11A | VHN | 2007-02-15 23:59:59.999998 | 2007-02-16 00:59:59.999998 | 0 days 00:00:10 | TA/M11A/VHN/2007-02-16T00-00-00.mseed | |
3 | TA | M14A | VHN | 2007-02-16 00:00:00.000003 | 2007-02-16 01:00:00.000003 | 0 days 00:00:10 | TA/M11A/VHN/2007-02-16T00-00-00.mseed | |
4 | TA | M11A | VHN | 2007-02-20 15:59:59.999998 | 2007-02-20 16:59:59.999998 | 0 days 00:00:10 | TA/M11A/VHN/2007-02-20T16-00-00.mseed |
Similar Projects¶
WaveBank
is a useful tool, but it may not be a good fit for every application. Check out the following items as well:
Obspy has a way to visualize availability of waveform data in a directory using obspy-scan. If you prefer a graphical option to working with DataFrame
s this might be for you.
Obspy also has filesystem client for working with SeisComP structured archives.
IRIS released a mini-seed indexing program called mseedindex which has an ObsPy API.