WaveBank

WaveBank is an in-process database for accessing seismic time-series data. Any directory structure containing ObsPy-readable waveforms can be used as the data source. WaveBank uses a simple indexing scheme and the Hierarchical Data Format to keep track of each Trace in the directory. Without WaveBank (or another similar program) applications have implement their own data organization/access logic which is tedious and clutters up application code. WaveBank provides a better way.

Load Example Data

This tutorial will demonstrate the use of WaveBank on two different obsplus datasets.

The first dataset, crandall canyon, only has event waveform files. The second only has continuous data from two TA stations. We start by loading these datasets, making a temporary copy, and getting a path to their waveform directories.

[1]:
%%capture
import obsplus

# make sure datasets are downloaded and copy them to temporary
# directories to make sure no accidental changes are made
crandall_dataset = obsplus.load_dataset('crandall_test').copy()
ta_dataset = obsplus.load_dataset('ta_test').copy()

# get path to waveform directories
crandall_path = crandall_dataset.waveform_path
ta_path = ta_dataset.waveform_path
[2]:
crandall_path
[2]:
PosixPath('/home/runner/opsdata/crandall_test/waveforms')

Create a WaveBank object

To create a WaveBank instance simply pass the class a path to the waveform directory.

[3]:
bank = obsplus.WaveBank(crandall_path)

Utilizing the udpate_index method on the bank ensures the index is up-to-date. This will iterate through all files that are timestamped later than the last time update_index was run.

Note: If the index has not yet been created or new files have been added, update_index needs to be called.

[4]:
bank.update_index()
[4]:
WaveBank(base_path=/home/runner/opsdata/crandall_test/waveforms)

Get waveforms

The files can be retrieved from the directory with the get_waveforms method. This method has the same signature as the ObsPy client get_waveform methods so they can be used interchangeably:

[5]:
import obspy

t1 = obspy.UTCDateTime('2007-08-06T01-44-48')
t2 = t1 + 60
st = bank.get_waveforms(starttime=t1, endtime=t2)
print (st[:5])  # print first 5 traces
5 Trace(s) in Stream:
TA.O15A..BHE | 2007-08-06T01:44:48.000000Z - 2007-08-06T01:45:48.000000Z | 40.0 Hz, 2401 samples
TA.O15A..BHN | 2007-08-06T01:44:47.999998Z - 2007-08-06T01:45:47.999998Z | 40.0 Hz, 2401 samples
TA.O15A..BHZ | 2007-08-06T01:44:47.999998Z - 2007-08-06T01:45:47.999998Z | 40.0 Hz, 2401 samples
TA.O16A..BHE | 2007-08-06T01:44:48.000000Z - 2007-08-06T01:45:48.000000Z | 40.0 Hz, 2401 samples
TA.O16A..BHN | 2007-08-06T01:44:48.000000Z - 2007-08-06T01:45:48.000000Z | 40.0 Hz, 2401 samples

WaveBank can filter on channels, locations, stations, networks, etc. using linux style search strings or regex.

[6]:
st2 = bank.get_waveforms(network='UU', starttime=t1, endtime=t2)

# ensure only UU traces were returned
for tr in st2:
    assert tr.stats.network == 'UU'

print(st2[:5])  # print first 5 traces
5 Trace(s) in Stream:
UU.CTU..HHE | 2007-08-06T01:44:47.994000Z - 2007-08-06T01:45:47.994000Z | 100.0 Hz, 6001 samples
UU.CTU..HHN | 2007-08-06T01:44:47.994000Z - 2007-08-06T01:45:47.994000Z | 100.0 Hz, 6001 samples
UU.CTU..HHZ | 2007-08-06T01:44:47.994000Z - 2007-08-06T01:45:47.994000Z | 100.0 Hz, 6001 samples
UU.MPU..HHE | 2007-08-06T01:44:47.992000Z - 2007-08-06T01:45:47.992000Z | 100.0 Hz, 6001 samples
UU.MPU..HHN | 2007-08-06T01:44:47.992000Z - 2007-08-06T01:45:47.992000Z | 100.0 Hz, 6001 samples
[7]:
st = bank.get_waveforms(starttime=t1, endtime=t2, station='O1??', channel='BH[NE]')

# test returned traces
for tr in st:
    assert tr.stats.starttime >= t1 - .00001
    assert tr.stats.endtime <= t2 + .00001
    assert tr.stats.station.startswith('O1')
    assert tr.stats.channel.startswith('BH')
    assert tr.stats.channel[-1] in {'N', 'E'}

print(st)
6 Trace(s) in Stream:
TA.O15A..BHE | 2007-08-06T01:44:48.000000Z - 2007-08-06T01:45:48.000000Z | 40.0 Hz, 2401 samples
TA.O15A..BHN | 2007-08-06T01:44:47.999998Z - 2007-08-06T01:45:47.999998Z | 40.0 Hz, 2401 samples
TA.O16A..BHE | 2007-08-06T01:44:48.000000Z - 2007-08-06T01:45:48.000000Z | 40.0 Hz, 2401 samples
TA.O16A..BHN | 2007-08-06T01:44:48.000000Z - 2007-08-06T01:45:48.000000Z | 40.0 Hz, 2401 samples
TA.O18A..BHE | 2007-08-06T01:44:47.999998Z - 2007-08-06T01:45:47.999998Z | 40.0 Hz, 2401 samples
TA.O18A..BHN | 2007-08-06T01:44:47.999998Z - 2007-08-06T01:45:47.999998Z | 40.0 Hz, 2401 samples

WaveBank also has a get_waveforms_bulk method for efficiently retrieving a large number of streams.

[8]:
args = [  # in practice this list may contain hundreds or thousands of requests
    ('TA', 'O15A', '', 'BHZ', t1 - 5, t2 - 5,),
    ('UU', 'SRU', '', 'HHZ', t1, t2,),
]
st = bank.get_waveforms_bulk(args)
print(st )
2 Trace(s) in Stream:
TA.O15A..BHZ | 2007-08-06T01:44:42.999998Z - 2007-08-06T01:45:42.999998Z | 40.0 Hz, 2401 samples
UU.SRU..HHZ  | 2007-08-06T01:44:47.995000Z - 2007-08-06T01:45:47.995000Z | 100.0 Hz, 6001 samples

Yield waveforms

The Bank class also provides a generator for iterating large amounts of continuous waveforms. The following example shows how to get streams of one hour duration with a minute of overlap between the slices.

The first step is to create a bank on a dataset which has continuous data. The example below will use the TA dataset.

[9]:
ta_bank = obsplus.WaveBank(ta_path)
[10]:
# get a few hours of kemmerer data
ta_t1 = obspy.UTCDateTime('2007-02-15')
ta_t2 = obspy.UTCDateTime('2007-02-16')

for st in ta_bank.yield_waveforms(starttime=ta_t1, endtime=ta_t2, duration=3600, overlap=60):
    print (f'got {len(st)} streams from {st[0].stats.starttime} to {st[0].stats.endtime}')
got 6 streams from 2007-02-15T00:00:09.999998Z to 2007-02-15T01:00:59.999998Z
got 6 streams from 2007-02-15T00:59:59.999998Z to 2007-02-15T02:00:59.999998Z
got 6 streams from 2007-02-15T01:59:59.999998Z to 2007-02-15T03:00:59.999998Z
got 6 streams from 2007-02-15T02:59:59.999998Z to 2007-02-15T04:00:59.999998Z
got 6 streams from 2007-02-15T03:59:59.999998Z to 2007-02-15T05:00:59.999998Z
got 6 streams from 2007-02-15T04:59:59.999998Z to 2007-02-15T06:00:59.999998Z
got 6 streams from 2007-02-15T05:59:59.999998Z to 2007-02-15T07:00:59.999998Z
got 6 streams from 2007-02-15T06:59:59.999998Z to 2007-02-15T08:00:59.999998Z
got 6 streams from 2007-02-15T07:59:59.999998Z to 2007-02-15T09:00:59.999998Z
got 6 streams from 2007-02-15T08:59:59.999998Z to 2007-02-15T10:00:59.999998Z
got 6 streams from 2007-02-15T09:59:59.999998Z to 2007-02-15T11:00:59.999998Z
got 6 streams from 2007-02-15T10:59:59.999998Z to 2007-02-15T12:00:59.999998Z
got 6 streams from 2007-02-15T11:59:59.999998Z to 2007-02-15T13:00:59.999998Z
got 6 streams from 2007-02-15T12:59:59.999998Z to 2007-02-15T14:00:59.999998Z
got 6 streams from 2007-02-15T13:59:59.999998Z to 2007-02-15T15:00:59.999998Z
got 6 streams from 2007-02-15T14:59:59.999998Z to 2007-02-15T16:00:59.999998Z
got 6 streams from 2007-02-15T15:59:59.999998Z to 2007-02-15T17:00:59.999998Z
got 6 streams from 2007-02-15T16:59:59.999998Z to 2007-02-15T18:00:59.999998Z

got 6 streams from 2007-02-15T17:59:59.999998Z to 2007-02-15T19:00:59.999998Z
got 6 streams from 2007-02-15T18:59:59.999998Z to 2007-02-15T20:00:59.999998Z
got 6 streams from 2007-02-15T19:59:59.999998Z to 2007-02-15T21:00:59.999998Z
got 6 streams from 2007-02-15T20:59:59.999998Z to 2007-02-15T22:00:59.999998Z
got 6 streams from 2007-02-15T21:59:59.999998Z to 2007-02-15T23:00:59.999998Z
got 6 streams from 2007-02-15T22:59:59.999998Z to 2007-02-16T00:00:59.999998Z

Put waveforms

Files can be added to the bank by passing a stream or trace to the bank.put_waveforms method. WaveBank does not merge files so overlap in data may occur if care is not taken.

[11]:
# show that no data for RJOB is in the bank
st = bank.get_waveforms(station='RJOB')

assert len(st) == 0

print(st)
0 Trace(s) in Stream:

[12]:
# add the default stream to the archive (which contains data for RJOB)
bank.put_waveforms(obspy.read())
st_out = bank.get_waveforms(station='RJOB')

# test output
assert len(st_out)
for tr in st_out:
    assert tr.stats.station == 'RJOB'


print(st_out)
3 Trace(s) in Stream:
BW.RJOB..EHE | 2009-08-24T00:20:03.000000Z - 2009-08-24T00:20:32.990000Z | 100.0 Hz, 3000 samples
BW.RJOB..EHN | 2009-08-24T00:20:03.000000Z - 2009-08-24T00:20:32.990000Z | 100.0 Hz, 3000 samples
BW.RJOB..EHZ | 2009-08-24T00:20:03.000000Z - 2009-08-24T00:20:32.990000Z | 100.0 Hz, 3000 samples

Availability

WaveBank can be used to get the availability of data. The outputs can either be a dataframe or as a list of tuples in the form of [(network, station, location, channel, min_starttime, max_endtime)].

[13]:
# get a dataframe of availability by seed ids and timestamps
bank.get_availability_df(channel='BHE', station='[OR]*')
[13]:
network station location channel starttime endtime
0 TA O15A BHE 2007-08-06 01:44:38.825000 2007-08-07 21:43:51.124998
1 TA O16A BHE 2007-08-06 01:44:38.825000 2007-08-07 21:43:51.125000
2 TA O18A BHE 2007-08-06 01:44:38.824998 2007-08-07 21:43:51.125000
3 TA R16A BHE 2007-08-07 02:04:54.500000 2007-08-07 21:43:51.125000
4 TA R17A BHE 2007-08-06 01:44:38.825000 2007-08-07 21:43:51.125000
[14]:
# get list of tuples of availability
bank.availability(channel='BHE', station='[OR]*')
[14]:
[('TA',
  'O15A',
  '',
  'BHE',
  2007-08-06T01:44:38.825000Z,
  2007-08-07T21:43:51.124998Z),
 ('TA',
  'O16A',
  '',
  'BHE',
  2007-08-06T01:44:38.825000Z,
  2007-08-07T21:43:51.125000Z),
 ('TA',
  'O18A',
  '',
  'BHE',
  2007-08-06T01:44:38.824998Z,
  2007-08-07T21:43:51.125000Z),
 ('TA',
  'R16A',
  '',
  'BHE',
  2007-08-07T02:04:54.500000Z,
  2007-08-07T21:43:51.125000Z),
 ('TA',
  'R17A',
  '',
  'BHE',
  2007-08-06T01:44:38.825000Z,
  2007-08-07T21:43:51.125000Z)]

Get Gaps and uptime

WaveBank can return a dataframe of missing data with the get_gaps_df method, and a dataframe of reliability statistics with the get_uptime_df method. These are useful for assessing the completeness of an archive of contiguous data.

[15]:
bank.get_gaps_df(channel='BHE', station='O*').head()
[15]:
network station location channel starttime endtime sampling_period path gap_duration
0 TA O15A BHE 2007-08-06 01:45:48.799998 2007-08-06 08:48:30.024998 0 days 00:00:00.025000 TA.O15A..BHE__20070806T014438Z__20070806T01454... 0 days 07:02:41.225000
1 TA O15A BHE 2007-08-06 08:49:39.999998 2007-08-06 10:47:15.624998 0 days 00:00:00.025000 TA.O15A..BHE__20070806T084830Z__20070806T08494... 0 days 01:57:35.625000
2 TA O15A BHE 2007-08-06 10:48:25.599998 2007-08-07 02:04:54.499998 0 days 00:00:00.025000 TA.O15A..BHE__20070806T104715Z__20070806T10482... 0 days 15:16:28.900000
3 TA O15A BHE 2007-08-07 02:06:04.474998 2007-08-07 02:14:14.100000 0 days 00:00:00.025000 TA.O15A..BHE__20070807T020454Z__20070807T02060... 0 days 00:08:09.625002
4 TA O15A BHE 2007-08-07 02:15:24.074998 2007-08-07 03:44:08.474998 0 days 00:00:00.025000 TA.O15A..BHE__20070807T021414Z__20070807T02152... 0 days 01:28:44.400000
[16]:
ta_bank.get_uptime_df()
[16]:
network station location channel starttime endtime duration gap_duration uptime availability
0 TA M11A VHE 2007-02-15 00:00:09.999998 2007-02-24 23:59:59.999998 9 days 23:59:50 0 days 9 days 23:59:50 1.0
1 TA M11A VHN 2007-02-15 00:00:09.999998 2007-02-24 23:59:59.999998 9 days 23:59:50 0 days 9 days 23:59:50 1.0
2 TA M11A VHZ 2007-02-15 00:00:09.999998 2007-02-24 23:59:59.999998 9 days 23:59:50 0 days 9 days 23:59:50 1.0
3 TA M14A VHE 2007-02-15 00:00:00.000003 2007-02-25 00:00:00.000003 10 days 00:00:00 0 days 10 days 00:00:00 1.0
4 TA M14A VHN 2007-02-15 00:00:00.000003 2007-02-25 00:00:00.000003 10 days 00:00:00 0 days 10 days 00:00:00 1.0
5 TA M14A VHZ 2007-02-15 00:00:00.000004 2007-02-25 00:00:00.000004 10 days 00:00:00 0 days 10 days 00:00:00 1.0

Read index

WaveBank can return a dataframe of the the index with the read_index method, although in most cases this shouldn’t be needed.

[17]:
ta_bank.read_index().head()
[17]:
network station location channel starttime endtime sampling_period path
0 TA M11A VHN 2007-02-19 14:59:59.999998 2007-02-19 15:59:59.999998 0 days 00:00:10 TA/M11A/VHN/2007-02-19T15-00-00.mseed
1 TA M14A VHN 2007-02-19 15:00:00.000003 2007-02-19 16:00:00.000003 0 days 00:00:10 TA/M11A/VHN/2007-02-19T15-00-00.mseed
2 TA M11A VHN 2007-02-15 23:59:59.999998 2007-02-16 00:59:59.999998 0 days 00:00:10 TA/M11A/VHN/2007-02-16T00-00-00.mseed
3 TA M14A VHN 2007-02-16 00:00:00.000003 2007-02-16 01:00:00.000003 0 days 00:00:10 TA/M11A/VHN/2007-02-16T00-00-00.mseed
4 TA M11A VHN 2007-02-20 15:59:59.999998 2007-02-20 16:59:59.999998 0 days 00:00:10 TA/M11A/VHN/2007-02-20T16-00-00.mseed

Similar Projects

WaveBank is a useful tool, but it may not be a good fit for every application. Check out the following items as well:

Obspy has a way to visualize availability of waveform data in a directory using obspy-scan. If you prefer a graphical option to working with DataFrames this might be for you.

Obspy also has filesystem client for working with SeisComP structured archives.

IRIS released a mini-seed indexing program called mseedindex which has an ObsPy API.