justincely/lightcurve_pipeline

Resolve targets

Opened this issue · 7 comments

Output lightcurves are separated based on TARGNAME, but many of the TARGNAMEs differ even though they refer to the same target. For example,

SATURN-VISIT11-SLEW-ORBIT1
SATURN-VISIT11-SLEW-ORBIT2
SATURN-VISIT21-SLEW-ORBIT1
SATURN-VISIT21-SLEW-ORBIT2
SATURN-VISIT32-SLEW-ORBIT1
SATURN-VISIT32-SLEW-ORBIT2

presumably all refer to the same target. It would be useful to resolve the targets during ingest and place similar targets in the same directory in the filesystem (thus enabling more complete composite lightcurves)

Potentially easier ones to resolve are examples of just using hyphenation/punctuation differently:

  1. GD71 vs. GD-71
  2. WD-0308-565 vs. WD0308-565

Some of these cases are for targets that have large amounts of exposure time, so keeping them together would be very beneficial.

Perhaps we can hack into the archive's target resolver?

It is probably best to resolve some of these targets by hand. One way we could do this is build a dictionary whose keys are the TARGNAMEs that we want to consider true and whose values are lists of TARGNAME alternatives to the key. For example:

targ_dict['SATURN'] = ['SATURN-VISIT11-SLEW-ORBIT1', 'SATURN-VISIT11-SLEW-ORBIT2', ...]

This dictionary can then be flipped to save on computing time, as such:

targ_dict['SATURN-VISIT11-SLEW-ORBIT1'] = 'SATURN'
targ_dict['SATURN-VISIT11-SLEW-ORBIT2'] = 'SATURN'

Then, in the pipeline, before the TARGNAME is added to the database, it can be checked to see if it exists as a key in this dictionary, and if it is, the TARGNAME can be exchanged with the dictionary value.

This will help us build a TARGNAME vs EXPTIME plot, as mentioned in issue #9.

@justincely found an online target resolving service that perhaps can help us resolve some of the targets. He made a wrapper function resolve() in the resolve.py module that takes a target name and returns a set of resolved target names.

I've made an notebook that plays around with the target resolver. It appears that only ~20% of the target names are able to be resolved.

@bourque 20% of the targets is fine - that actually makes a good deal of sense. HST time is very competitive, and duplications need to be well justified. So it's definitely going to be in the minority when a target is observed twice with different names.

I created a dictionary in utils.targname_dict that stores some manually resolved targets. This dictionary was made with two general rules:

  1. Targets with hyphens (not dashes that indicate "negative", but hyphens) were changed with hyphens removed
  2. Targets that contained COPY, or REPEAT, etc., or were numbered sequentially (e.g. JUPITER-NORTH1, JUPITER-NORTH2, etc.) were changed to just the nominal target name.

For targets that do not need to be resolved, the dictionary values are blank (e.g. ''); in this way, one can perform a diff between future database instances the dictionary keys to see which targets need to be added.

@justincely found the resolver that MAST uses: http://mastresolver.stsci.edu/Santa-war/. This could help us resolve targets even further.