resource.projectTitle.controlled
gothub opened this issue · 16 comments
Description
Check if DOE Project name associated with the data package comes from controlled list.
Priority
- ESSDIVE: Required
Issues
- List of issues to be resolved
Procedure
Get project name from metadata and check that the project name is included within a controlled list.
Initially this controlled list will be provided from https://data.ess-dive.lbl.gov/js/themes/ess-dive/data/projects.json, but in the future the controlled list check could be performed by calling an API provided by ESS-DIVE.
The check will fail if an exact match is not found for the project name (title).
Requested check messages
On failure: "Warning. The DOE project name listed is not from the controlled list of projects. When entering project name, use the autocomplete feature to choose from the existing projects. If you can not find your project name, try entering the PI name."
Note that this check is referenced in this ESS-DIVE issue.
It is also mentioned in this ESS-DIVE issue.
This check has been renamed to resource.projectTitle.controlled
, as the XML element being tested is
/eml/dataset/project/title
.
@vchendrix The R package you suggested appears to work well for fuzzy matches between /eml/dataset/project/title
and the projectTitle entries in https://data.ess-dive.lbl.gov/js/themes/ess-dive/data/projects.json.
The check will find the closest match. If the match is exact, the output message will be:
The project title was found in the list of known project titles.
If an exact match was not found, the output message will be
The project title was not found in the list of known project titles.
The closest match was "<insert closest match here".
The closest match found can be wildly different, depending on length of the title and if anything actually similar exists.
Here are a couple of titles obtained from ESS-DIVE Solr and the controlled list, along with the calculated values
Project Title: WHONDRS
Closest dist: 8, closest match: ExaSheds
percent diff: 1.142857
Project Title: Trace Metal Dynamics and Limitations on Biogeochemical Cycling in Wetland Soils and Hyporheic Zones, PI Jeffrey G. Catalano
Closest dist: 24, closest match: Trace Metal Dynamics and Limitations on Biogeochemical Cycling in Wetland Soils and Hyporheic Zones
percent diff: 0.195122
Project Title: SPRUCE
Closest dist: 8, closest match: ExaSheds
percent diff: 1.333333
Project Title: Free-Air CO2 Enrichment Model Data Synthesis
Closest dist: 12, closest match: Free Air CO2 Enrichment Model Data Synthesis (FACE-MDS)
percent diff: 0.272727
Please let me know if this is sufficient, or if you have any ideas on what to print or calculate when there is not a close match.
The values above are calculated as
dist <- stringdist(projectTitles[[iproj]], controlledProjectTitles[[ictrl]], method=c("lv"))
percentDiff <- dist/nchar(projectTitles[[iproj]])
We may want to have a cut off on the percent diff. Like if it is => 1, then there is no useful match? Otherwise, this looks good to me!
👍 Agreed on the percent match cutoff. That makes sense if we find a good cutoff level. It might be that something like 0.5 or 0.75 might produce fewer spurious suggestions.
Yes, 0.5 - 0.75 cutoffs seem more reasonable. Can you try the 0.5 or so cutoff and see what happens there?
Here are some results. A tolerance of .70 seems good, as there are fewer 'no matches', and the number of 'close matches' that are obviously are not the intended title are few, at least for the number of tiles currently present. This fuzzy matching is character based, and I'm realizing it would be much better if it were token based. Anyway, here are the counts for the different tolerances (cutoffs):
tolerance: .75
Exact matches: 39
Close matches (within tolerance of 0.750000 percent): 65
No matches: 18
tolerance: .70
Exact matches: 39
Close matches (within tolerance of 0.700000 percent): 41
No matches: 42
diff: .60
Exact matches: 39
Close matches (within tolerance of 0.600000 percent): 25
No matches: 58
diff: .50
Exact matches: 39
Close matches (within tolerance of 0.500000 percent): 13
No matches: 70
I can provide the raw results if desired.
I think the results would be helpful to see for those four cases whether several people think the "close match" list is reasonable or includes outliers (e.g. has false positives), and whether the "No match" list contains titles that should have been a close match (e.g, that are false negatives). The threshold is really about minimizing both false negatives and false positives.
For those of you who would like to view the sample results from the test script, here they are.
The diffTolerance*.lis
files are the debug output of the R program runs that compares project tiles from Solr to titles obtained from the ESS-DIVE Solr service.
The controlledProjectTitles.lis
is the controlled list of project tiles obtained from the projects.json
controlled list.
We decided to stick with exact matches for now, and to add instructions on how to find the appropriate project name in our UI. We do not have a suitable public facing project list other than the autocomplete feature in our data submission UI.
@gothub The message for now is:
On failure: "Warning. The DOE project name listed is not from the controlled list of projects. When entering project name, use the autocomplete feature to choose from the existing projects. If you can not find your project name, try entering the PI name."
Regarding the ESS-DIVE project list located at https://data.ess-dive.lbl.gov/js/themes/ess-dive/data/projects.json - add a mechanism to refresh the local copy (to metadig-engine) of this file periodically, for example, daily. This may have to be implemented at the metadig-engine level and not the check level.
This check is now in the ESS-DIVE 1.1.0 suite.
As discussed in #440, it might be best to pull from an external source as opposed to metadig/data
so that the file can more easily be updated.
@vchendrix or @mburrus can you point me to a stable location of that file hosted somewhere? Is https://data.ess-dive.lbl.gov/js/themes/ess-dive/data/projects.json best? (Note: that URL doesn't actually resolve for me)
@jeanetteclark See issue #438, which provides the new request URI to update the service. I think we should close this issue, and plan the new work over in #438 and #440.