Suggestions mostly about I/O

Dear Bryant,

I would like to suggest and implement the following changes:

Add support for observations in SAC format. Sometimes it can be useful to read metadata included on the SAC file headers. For instance, reading the source-receiver distance in order to compute a measurement window based on a predefined apparent velocity.
Allow observations and synthetics to have a different data format. The above modification would require this. Currently, both observations and synthetics must have the same format and SPECFEM cannot output waveforms in SAC.
Remove requirement: all synthetics must have a corresponding observation. According to "_setup_quantify_misfit" (preprocess/default.py line 344), the number of observations and synthetics must be the same. However, there might be situations where not all synthetics have an observation. For instance, one event might not be recorded at one or more stations defined on the STATIONS file.
Initialize adjoint traces by reading the STATIONS file. The function "initialize_adjoint_traces" (preprocess/default.py line 296) reads the variable "data_filenames" which lists the file names of the observations. This will cause adjoint sources from stations with no observations not to be initialized. This can be avoided by initializing the adjoint traces by setting their name according to the STATIONS file.

Additionally, I have a question. Are the adjoint sources being initialized at all? I see the function "initialize_adjoint_traces" (preprocess/default.py line 296) is defined but I cannot find any part of the code where it is called.

Finally, on which branch I should include my changes? There are more additions I would like to add in the future.

Best,
Eduardo Valero.
PhD student at KAUST.

Hi @evcano, this all looks great! Thank you for taking the time to look through the code and determine points for improvement, I admit the default preprocessing module could use some work so I'd be very happy for you to implement these changes.

Add support for observations in SAC format.

I agree this would be useful. My tendency has been to grab metadata from the Specfem STATIONS and CMTSOLUTION (or other source types) files and validate via station names since it's guaranteed that these files will be available, but I think a separate check on the availability of SAC headers would be useful.

I also think that using this metadata to compute measurement windows would be a very nice feature and a great addition to the module.

Allow observations and synthetics to have a different data format.

Yes, totally agree this would be useful. Because I tend to use Pyatoa to do my preprocessing, which has some logic trees to determine data format, this is not something I thought much about.

Although I will say that the 3D_GLOBE version of SPECFEM can output SAC formatted seismograms, so it could be useful to check how those are formatted and ensure SeisFlows can handle them.

Remove requirement: all synthetics must have a corresponding observation

Yes you're right, there is no hard check that ensures that the obs data available matches the syn data.

I think the restriction can be left in, however there needs to be a check at L355 here

seisflows/seisflows/preprocess/default.py

Lines 353 to 359 in 4365b9c

    
           observed = sorted(glob(os.path.join(obs_path, "*"))) 
        
           synthetic = sorted(glob(os.path.join(syn_path, "*"))) 
        
           assert(len(observed) == len(synthetic)), ( 
        
               f"number of observed traces does not match length of synthetic for " 
        
               f"source: {source_name}" 
        
           )

that discards observations which do not have a matching synthetic, otherwise the "quantify_misfit" function may mismatch files as it loops through data.

Initialize adjoint traces by reading the STATIONS file.

Good catch, I think my plan was to scrap the adjoint source initialization entirely which is why this function is not called. Instead I would prefer to let the "quantify_misfit" function write everything out based on the available synthetic data.

The initialization is a leftover from the old codebase and I am not sure it is necessary (although it may be useful for priming the filesystem?).

Finally, on which branch I should include my changes? There are more additions I would like to add in the future.

Please feel free to submit PRs to the devel branch, and we can code review and iterate from there. Happy to talk through the process to determine how changes should be implemented here in this issue.

Thanks again for proposing these changes, looking forward to seeing the PR :)

Hi @evcano, just letting you know I had a look at your previous PR and it looked great. Looking forward to the updated PR

The suggestions stated on this issue were merged on PR #155. Issue closed.

	observed = sorted(glob(os.path.join(obs_path, "*")))
	synthetic = sorted(glob(os.path.join(syn_path, "*")))

	assert(len(observed) == len(synthetic)), (
	f"number of observed traces does not match length of synthetic for "
	f"source: {source_name}"
	)