adjtomo/pysep

Pre-processing segmented traces

SeismoFelix opened this issue · 7 comments

I have downloaded data directly from WILBER Iris service and in Pysep to double check all data that should be available can be streamed. Unfortunately, some traces downloaded from WILBER are segmented and this is a problem for Pysep when it comes to rotate because the length of the traces is random even among the components for the same station. I got the following error:

pysep/lib/python3.11/site-packages/obspy/signal/rotate.py", line 222, in rotate2zne
raise ValueError("The given directions are not linearly independent, "

One way I suggest addressing this is by checking if the seismograms requested are not split into several independent files for a single trace. If it is the case, then merge the files and create a single trace before rotating. It could be the case where the segmentation is due to gaps in the time window requested. In that case, I guess the idea of filling gaps may exceed the current purpose of Pysep. Still, the trace could be rejected, or simply skip the step of rotating and just download the data as is.

I try to set null the rotate option in the yaml file but I got the following error:

File "/Users/felix/Documents/Software/pysep/pysep/pysep.py", line 1199, in run self.st = quality_check_waveforms_after_processing(self.st) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/felix/Documents/Software/pysep/pysep/utils/curtail.py", line 108, in quality_check_waveforms_after_processing st_out = remove_stations_for_insufficient_length(st_out) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/felix/Documents/Software/pysep/pysep/utils/curtail.py", line 219, in remove_stations_for_insufficient_length expected_length = vals[np.argmax(counts)] ^^^^^^^^^^^^^^^^^ File "<__array_function__ internals>", line 200, in argmax File "/Users/felix/opt/anaconda3/envs/pysep/lib/python3.11/site-packages/numpy/core/fromnumeric.py", line 1242, in argmax return _wrapfunc(a, 'argmax', axis=axis, out=out, **kwds) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/felix/opt/anaconda3/envs/pysep/lib/python3.11/site-packages/numpy/core/fromnumeric.py", line 57, in _wrapfunc return bound(*args, **kwds) ^^^^^^^^^^^^^^^^^^^^ ValueError: attempt to get argmax of an empty sequence

So, I assume any rotation should be selected to proceed with quality control.

Thanks for the fantastic package,

Félix

bch0w commented

Hi @SeismoFelix, thanks for raising this issue.

Just to re-state to make sure I understand: you have a number of Traces in a Stream that correspond to the same component (e.g., NN.SSS..HHZ has 3 traces), with each trace defining a different time segment (e.g., 0-10s, 12-15s, 20-30s), which causes rotation to fail because something in the rotation function is probably trying to include all of these traces.

I think having a check on multiple traces for a given component would be useful to let the user know that the data are segmented. I would be hesitant to put a merge in as different users may want different behavior for this data, but we could add an option to allow this if explicitly stated.

Looks like the second error you encountered is from the following quality control function

def remove_stations_for_insufficient_length(st):
"""
Remove stations if the length does not match the mode of all other lengths
in the stream, which is assumed to be the expected length
:type st: obspy.core.stream.Stream
:param st: Stream to check for data gaps and insufficient start and
end times
"""
st_out = st.copy()
# Assuming that the mode of the stream lengths represents the actual value
stream_lengths = [tr.stats.endtime - tr.stats.starttime for tr in st_out]
vals, counts = np.unique(stream_lengths, return_counts=True)
expected_length = vals[np.argmax(counts)]
logger.debug(f"assuming that the expected stream length is: "
f"{expected_length}s")
for tr, length in zip(st_out[:], stream_lengths):
if length != expected_length:
logger.debug(f"{tr.get_id()} has insufficient time length of "
f"{length}s, removing")
st_out.remove(tr)
return st_out

I have a feeling the ValueError is occurring because the length of the Stream is zero, I would guess because the earlier operations removed all traces from the Stream. I think there is also a small bug related to setting rotate as null which causes the rotate function to be applied. I'll fix that asap.

Here's the to do list for a corresponding PR:

  • Add check on segmented traces and warn user if a component has multiple matching traces
  • Optional parameter to merge segmented traces with NaN or fill value, otherwise reject
  • Add intermediate checks on Stream length to avoid unexpected errors with trying to quality control empty Stream
  • Fix rotation check bug (8e33bcf)
bch0w commented

@SeismoFelix Could you provide me an example of segmented data in the form of an mseed or SAC file, or a Wilber download link? It would very useful for development and I could include it in the test data to ensure future versions of PySEP can deal with it.

Thanks @bch0w,

I guess in addition to "the rotation function is probably trying to include all of these traces", if the components have different lengths the rotation matrix is going to fail. So, probably a sanity check should also verify that the traces have the same length. I also notice, sorry if I did not raise a new issue (I can do it as well) that some stations have not properly set the CMPAZ value. So, in the east component, CMPAZ should be 90, but in some cases is set to zero. This could be an error of the data provider, but an extra revision should include checking CMPAZ, and CMPINC is set properly for the 3 components.

This is the YAM file I made for steaming the data using Pysep

event_tag: 20230208142025_Turkey
config_file: null
client: IRIS
client_debug: false
timeout: 600
taup_model: ak135
event_selection: default
origin_time: '2023-02-08T14:20:25.0000Z'
seconds_before_event: 120
seconds_after_event: 1800
event_latitude: 37.985
event_longitude: 37.391
event_depth_km: 17.0
event_magnitude: 5
networks: '*'
stations: DARE,KMRS,GAZ,SARI,ANDN,URFA,BNN,SVSK,DYBB,MAZI,RSDY,MERS,BNGB,SVAN,KAMT,KVT,KTUT,AFSR,CANT,LMOS,CHOM,HDMB,LOD,DHK1,KARS,AZNN,BORA,GNI,ELL,CHBG,SHMT,MSBI,KIV,YER,EIL,APE,MV03,KTHR,VLX,FSK,AGG,AOS2,THL,VRI
channels: '*'
locations: '*'
reference_time: '2023-02-08T14:20:25.0000Z'
seconds_before_ref: 120
seconds_after_ref: 180
phase_list: null
mindistance: 0
maxdistance: 1500.0
minazimuth: 0
maxazimuth: 360
minlatitude: null
maxlatitude: null
minlongitude: null
maxlongitude: null
demean: true
detrend: true
taper_percentage: 0
rotate:
- ENZ
remove_response: true
output_unit: VEL
water_level: 60
pre_filt: default
scale_factor: 1.0
resample_freq: 50
remove_clipped: false
log_level: DEBUG 

And here is the Python Script:

from pysep import Pysep
sep = Pysep(config_file='yaml_files/20230208142025.yaml')
sep.run()

If you want to stream the data directly from Wilber use this link. The parameters are the same set in the yaml file. Check the DARE station and you will see the segmented traces.

bch0w commented

Hi @SeismoFelix, can you update your version of PySEP to the latest devel version and see if the additions in #84 addressed your issue? #84 details the new parameters you can add to your parameter file to retain gapped data, e.g.,

fill_data_gaps: interpolate
gap_fraction: 0.2

See: https://github.com/adjtomo/pysep/blob/devel/pysep/pysep.py#L227-L254 for how to implement

Also, yes could you please open another issue with the azimuth and inclination issue, to help better keep track of things. thanks!

Hi @bch0w, I updated Pysep and the script runs. Still, Pysep still cannot grab all the data I can stream directly from WILBER or using the ObsPy massive downloader directly. For instance, if I refrain from setting use_mass_download: true and try to fetch data from KOERI client: KOERI. Only few stations are streamed because

[2023-02-27 11:20:05] - pysep - DEBUG: KO.DARE..HHE has insufficient time length of 301.66s, removing

However, If I go to Wilber, I definitely can stream those seismograms.

KOERI, among other clients, offers data segmented and with gaps, but I am assuming that after setting fill_data_gaps: interpolate and gap_fraction: 0.5` Pysep should be able to deal with merging the seismograms and interpolating the gaps.

It is also interesting that if I use the massive downloader, KOERI channel is ignored. Those are the stations I got using the massive downloader:

2023-02-08T142025_TURKEY.AB.MV03.00.HHE.sac 2023-02-08T142025_TURKEY.GO.AZNN.20.LNN.sac
2023-02-08T142025_TURKEY.AB.MV03.00.HHN.sac 2023-02-08T142025_TURKEY.GO.AZNN.20.LNZ.sac
2023-02-08T142025_TURKEY.AB.MV03.00.HHZ.sac 2023-02-08T142025_TURKEY.GO.CHBG.00.HHE.sac
2023-02-08T142025_TURKEY.AB.MV03.00.LHE.sac 2023-02-08T142025_TURKEY.GO.CHBG.00.HHN.sac
2023-02-08T142025_TURKEY.AB.MV03.00.LHN.sac 2023-02-08T142025_TURKEY.GO.CHBG.00.HHZ.sac
2023-02-08T142025_TURKEY.AB.MV03.00.LHZ.sac 2023-02-08T142025_TURKEY.GO.CHBG.00.LHE.sac
2023-02-08T142025_TURKEY.AV.KVT..BHE.sac 2023-02-08T142025_TURKEY.GO.CHBG.00.LHN.sac
2023-02-08T142025_TURKEY.AV.KVT..BHN.sac 2023-02-08T142025_TURKEY.GO.CHBG.00.LHZ.sac
2023-02-08T142025_TURKEY.AV.KVT..BHZ.sac 2023-02-08T142025_TURKEY.GO.CHBG.20.HNE.sac
2023-02-08T142025_TURKEY.GO.AZNN.00.HHE.sac 2023-02-08T142025_TURKEY.GO.CHBG.20.HNN.sac
2023-02-08T142025_TURKEY.GO.AZNN.00.HHN.sac 2023-02-08T142025_TURKEY.GO.CHBG.20.HNR.sac
2023-02-08T142025_TURKEY.GO.AZNN.00.HHZ.sac 2023-02-08T142025_TURKEY.GO.CHBG.20.HNT.sac
2023-02-08T142025_TURKEY.GO.AZNN.00.LHE.sac 2023-02-08T142025_TURKEY.GO.CHBG.20.HNZ.sac
2023-02-08T142025_TURKEY.GO.AZNN.00.LHN.sac 2023-02-08T142025_TURKEY.GO.CHBG.20.LNE.sac
2023-02-08T142025_TURKEY.GO.AZNN.00.LHZ.sac 2023-02-08T142025_TURKEY.GO.CHBG.20.LNN.sac
2023-02-08T142025_TURKEY.GO.AZNN.20.HNE.sac 2023-02-08T142025_TURKEY.GO.CHBG.20.LNZ.sac
2023-02-08T142025_TURKEY.GO.AZNN.20.HNN.sac 2023-02-08T142025_TURKEY.MP.DHK1.00.HHE.sac
2023-02-08T142025_TURKEY.GO.AZNN.20.HNR.sac 2023-02-08T142025_TURKEY.MP.DHK1.00.HHN.sac
2023-02-08T142025_TURKEY.GO.AZNN.20.HNT.sac 2023-02-08T142025_TURKEY.MP.DHK1.00.HHZ.sac
2023-02-08T142025_TURKEY.GO.AZNN.20.HNZ.sac 2023-02-08T142025_TURKEY.UW.ELL..EHZ.sac
2023-02-08T142025_TURKEY.GO.AZNN.20.LNE.sac

As you see, no KOERI (KO) stations are streamed. On the other hand, If I set client: KOERI this is what I get:

2023-02-08T142025_TURKEY.KO.AFSR..HHE.sac 2023-02-08T142025_TURKEY.KO.KARS..HHE.sac
2023-02-08T142025_TURKEY.KO.AFSR..HHN.sac 2023-02-08T142025_TURKEY.KO.KARS..HHN.sac
2023-02-08T142025_TURKEY.KO.AFSR..HHR.sac 2023-02-08T142025_TURKEY.KO.KARS..HHR.sac
2023-02-08T142025_TURKEY.KO.AFSR..HHT.sac 2023-02-08T142025_TURKEY.KO.KARS..HHT.sac
2023-02-08T142025_TURKEY.KO.AFSR..HHZ.sac 2023-02-08T142025_TURKEY.KO.KARS..HHZ.sac

So, the questions are: Why I did not get those traces with the massive downloader? and Why when I try to fetch a single client most of the data is rejected instead of being merged and interpolated?

This is the parameter file I used:

event_tag: 20230208142025_Turkey
config_file: null
client: KOERI
#use_mass_download: true 
client_debug: false
timeout: 600
taup_model: ak135
event_selection: default
origin_time: '2023-02-08T14:20:25.0000Z'
seconds_before_event: 120
seconds_after_event: 1800
event_latitude: 37.985
event_longitude: 37.391
event_depth_km: 17.0
event_magnitude: 5
networks: '*'
stations: DARE,KMRS,GAZ,SARI,ANDN,URFA,BNN,SVSK,DYBB,MAZI,RSDY,MERS,BNGB,SVAN,KAMT,KVT,KTUT,AFSR,CANT,LMOS,CHOM,HDMB,LOD,DHK1,KARS,AZNN,BORA,GNI,ELL,CHBG,SHMT,MSBI,KIV,YER,EIL,APE,MV03,KTHR,VLX,FSK,AGG,AOS2,THL,VRI
channels: '*'
locations: '*'
reference_time: '2023-02-08T14:20:25.0000Z'
seconds_before_ref: 120
seconds_after_ref: 180
phase_list: null
mindistance: 0
maxdistance: 1500.0
minazimuth: 0
maxazimuth: 360
minlatitude: null
maxlatitude: null
minlongitude: null
maxlongitude: null
demean: true
detrend: true
taper_percentage: 0
rotate:
- ENZ
- RTZ
remove_response: true
output_unit: VEL
water_level: 60
pre_filt: default
scale_factor: 1.0
resample_freq: 50
remove_clipped: false
log_level: DEBUG
fill_data_gaps: interpolate
gap_fraction: 0.5

Thanks a lot for chiming in on this.

bch0w commented

Hi @SeismoFelix thanks for checking that, sorry it's still not working as expected. Trying to identify the multiple issues you're having:

  1. Insufficient time length removing stations
  2. Merging gapped data not always working
  3. Inconsistent results between bulk request and mass downloader (could you please open this as a new issue as it is quite different from the current issue, thanks!)

(1) small bug: since the input time bounds (seconds_before_ref and seconds_after_ref) are integers in your Config file, you are getting floating point rounding errors causing sub-second inconsistency in trace lengths, causing one of the checks to throw out some of the stations. I will implement a bug fix that forces these values to be floating points, regardless of input parmameter type.

To get around this, you can also use the parameter remove_insufficient_length to ignore that check for uniform stream lengths.

remove_insufficient_length: false

However it seems that the ObsPy trim function has some trouble cutting things at the same sub-second accuracy:

KO.MAZI..HHZ | 2023-02-08T14:18:25.010000Z - 2023-02-08T14:23:24.990000Z | 50.0 Hz, 15000 samples
KO.MERS..HHE | 2023-02-08T14:18:25.000000Z - 2023-02-08T14:23:25.000000Z | 50.0 Hz, 15001 samples

I'll need to investigate this a bit further to ensure that we are truly getting the same start times for all traces.


(2), I see merging gaps working and not working at the same time.

[2023-02-27 09:12:30] - pysep - INFO: KO.YER..HHN has data gaps, filling with: interpolate
[2023-02-27 09:12:30] - pysep - WARNING: KO.BNGB..HHZ has data gaps, removing

This is because filling data gaps only fills values between the start and end time of traces, but if the data gap is at the very beginning or end of the trace, there was no way to address it.

The bugfix is to allow the parameter fill_data_gaps to fill in those boundary data gaps.

bch0w commented

@SeismoFelix I tried to address points 1 and 2 in the above comment with the latest PR (#88). I tried this out with the example YAML file you provided and I managed to get ~200 waveforms returned which seems like more than double the original.

Could you test that out and see how it works? Similarly if you could let me know how many waveforms Wilber returns, I will probably have a better idea of what number we should be aiming for with PySEP.

Also please submit point 3 (mass downloader) as a separate issue and I'll get to that when I can.