UCBerkeleySETI/turbo_seti

Correct find_event.calc_freq_range()

telegraphic opened this issue · 6 comments

The function calc_freq_range uses hardcoded values MAX_DRIFT_RATE and OBS_LENGTH. These should instead be derived from the data.

For the most part, this will mean extra events reported, not fewer events. The effect of using hardcoded MAX_DRIFT_RATE=2.0 will mean the hit boundaries are calculated assuming +/- 2 Hz, and area used for RFI coincidence detection will be too small for higher drift rates (meaning additional events detected). OBS_LENGTH assumes 5 minute observations, which will mean coincident detection is too small for longer observations.

@telegraphic

Torn between projects. Cryptography is more time-consuming lately. Not sure when I can have a look at this.

The hit object argument needs to be dumped by a debugger or print statements in real-time to get a feel for which data could be used.

@telegraphic

I see your points now. I assigned this to you for the moment, if just for a review.

Maximum Drift Rate vs Actual Drift Rate

We need the defaulted or supplied maximum drift rate only known to FindDoppler (and then discarded) in this situation inside find_events:

    #Now find how much RFI is within a frequency range of the hit 
    #by comparing the ON to the OFF observations. Update RFI_in_range
    if len(off_table) == 0:
        print('Length of off_table = 0')
        snr_adjusted_table['RFI_in_range'] = 0
    else:
        snr_adjusted_table['RFI_in_range'] = snr_adjusted_table.apply(
            lambda hit: 
                len(off_table[((off_table['Freq'] > calc_freq_range(hit)[0]) 
                               & (off_table['Freq'] < calc_freq_range(hit)[1])
                               )]), axis=1)

In follow_event, the actual drift rate is used so no issue there.

Observation Length

Essentially, the header.nsamp and header.n_ints_in_file are needed to multiply together, producing the observation length. This is required in both of the 2 situations described earlier. Unfortunately, like the maximum drift rate, these 2 pieces of information are not available.

Recommendation for Remediation

  • Amend turbo_seti to always write out a supplementary CSV file containing its run-time parameter values which would include, at minimum:
    -- max_drift_rate
    -- n_ints_in_file
    -- tsamp

  • The full path of this CSV file would be an additional argument for find_event_pipeline and find_events. Note that find_events can be called directly.

  • The following routines need to be amended to work with the 3 parameters from the CSV file:
    -- find_event_pipeline just pass along the CSV path
    -- find_events df = pandas.read_csv
    -- find_events use the df to pass along field values as needed
    -- calc_freq_range new max_drift_rate and obs_length parameters
    -- follow_event new obs_length parameter

Comments/reactions?

Hi @texadactyl, I certainly agree that a supplementary file -- or putting this into the header of the existing CSV file -- would be a nice feature. Also for reproducability we could store the version of turboseti + blimpy used.

I am leaning slightly towards putting this into the existing CSV file. The issue here is that older versions of turboseti won't be able to read it, but it's one fewer file to keep track of! However if we decided to log a lot more information (e.g. dumping pip freeze output, IP address of server, the kitchen sink) then I a new file is a good idea.

This would instead mean updating FileWriter and read_dat

class FileWriter(GeneralWriter):

https://github.com/UCBerkeleySETI/turbo_seti/blob/master/turbo_seti/find_event/find_event.py#L43

As a sidenote, I will think about how hyperseti should output hits in CSV form, as it would be nice to keep as much interoperability as possible.

@telegraphic

Existing CSV file format sample:

,TopHitNum,DriftRate,SNR,Freq,ChanIndx,FreqStart,FreqEnd,CoarseChanNum,FullNumHitsInRange,FileID,Source,MJD,RA,DEC,DELTAT,DELTAF,Hit_ID,status,in_n_ons,RFI_in_range,delta_t
0,1,-0.019133,250459.347795,8419.921874,0,8419.921874,8419.920245,32,11256,Voyager1.single_coarse.fine_res.h5,VOYAGER1,57650.782094907408,17h10m03.984s,12d10m58.8s,18.253611,-2.793968,VOYAGER1_0,on_table_1,0,0,0.0
0,1,-0.019133,250459.347795,8419.921874,0,8419.921874,8419.920245,32,11256,Voyager1.single_coarse.fine_res.h5,VOYAGER1,57650.782094907408,17h10m03.984s,12d10m58.8s,18.253611,-2.793968,VOYAGER1_1,on_table_1,0,0,0.0
1,2,3.97966,15718.665595,8419.921801,26,8419.921874,8419.920172,32,11256,Voyager1.single_coarse.fine_res.h5,VOYAGER1,57650.782094907408,17h10m03.984s,12d10m58.8s,18.253611,-2.793968,VOYAGER1_1,on_table_1,0,0,0.0
2,3,-3.577867,15718.458163,8419.876161,16361,8419.877796,8419.8761,32,11256,Voyager1.single_coarse.fine_res.h5,VOYAGER1,57650.782094907408,17h10m03.984s,12d10m58.8s,18.253611,-2.793968,VOYAGER1_2,on_table_1,0,0,0.0
3,4,-0.392226,153.731099,8419.319368,2653,8419.321003,8419.31774,45,72620,Voyager1.single_coarse.fine_res.h5,VOYAGER1,57650.782094907408,17h10m03.984s,12d10m58.8s,18.253611,-2.793968,VOYAGER1_3,on_table_1,0,0,0.0
4,5,-0.373093,1237.702395,8419.297028,10649,8419.298662,8419.295399,45,72620,Voyager1.single_coarse.fine_res.h5,VOYAGER1,57650.782094907408,17h10m03.984s,12d10m58.8s,18.253611,-2.793968,VOYAGER1_4,on_table_1,0,0,0.0
5,6,-0.392226,156.779181,8419.274374,2373,8419.276009,8419.272745,46,38009,Voyager1.single_coarse.fine_res.h5,VOYAGER1,57650.782094907408,17h10m03.984s,12d10m58.8s,18.253611,-2.793968,VOYAGER1_5,on_table_1,0,0,0.0

One way to put the new information in the existing CSV file is through DAT-style comments. Ugh!
An alternative is to repeat the new information as new columns in each row. Ugh-ugh!
I'd go with Ugh number 1 (DATish).

Unsolved issue: We can get the header.nsamp and header.n_ints_in_file out of an existing HDF5 file but how do we provide the max_drift_rate to find_event_pipeline? Specifying that value twice (1. FindDoppler, 2. find_event_pipeline) is a risk of inconsistency.

Current/old versions of turbo_seti

  • The current/old versions of turbo_seti should not be impacted at all as they should not generate nor read the modified CSV file.
  • The current processing (2.0.23) successfully ignores comments (Pandas dataframe read_csv()). So, the new CSV file format will not bother turbo_seti 2.0.23.
  • If somehow (?) a new-format CSV file appeared in a current/old version of turbo_seti, the new DAT-style header (# statements) would not be processed as they appear to be comments. Life would go on in the old design flaw manner.

After an upgrade to fix this issue

  • The new find_event_pipeline() should be re-run to produce the new CSV file format. Then, the new plot_event_pipeline() will be re-run to process the new CSV file format.
  • In the case of upgrading from a very old turbo_seti, it might be advisable to re-run FindDoppler to produce new DAT files.
  • If for some reason, the new plot_event_pipeline() is run on an old-format CSV file, it will fail due to missing information that it is expecting. Part of the error report should note that find_event_pipeline() needs to be run first to recreate the CSV in the new file format.

In the hinesight-is-powerful department

I wish that turbo_seti had used sqlite instead of DAT & CSV files. Then, we could have 2 independent tables (header, details). This is a common way to develop among my cryptography peers and others.

Source Modification Approach

data_handler.py

  • Already calculated: obs_length = header.nsamp * header.n_ints_in_file. Move it into the header for FindDoppler.search().

find_doppler.py and file_writers.py

  • max_drift_rate is already available as a parameter to FindDoppler object instantiation.
  • obs_length is already available from data_handler.py.
  • Append the DAT "# DELTAT" line with max_drift_rate and obs_length.

find_events.py

  • read_dat: extract max_drift_rate and obs_length from the DAT file and pass them back to caller inside the dataframe (expanded).
  • make_table: merge into read_dat. So, read_dat returns to find_events in 2 places: on-table and off-table.
  • find_events: no change in interface to calc_freq_range since the new parameters are part of each hit row.
  • calc_freq_range: use parametric values from the hit row instead of hard-coded constants.

No need to change any other source files.

Impacted software

turbo_seti/find_doppler/

  • data_handler.py
  • file_writers.py
  • find_doppler.py

turbo_seti/find_event/

  • find_event.py

test/

  • test_find_event.py
  • test_pipelines_3.py