NCEAS/metajam

Handle dryad URLs

Opened this issue · 5 comments

Currently, because of the way the data URL for dryad is constructed, it doesn't work with our function. check_version ends up looking for nonsensical results because it keeps chunking the URL and eventually looking for anything that matches 1. I've changed the breaking point to nchar(pid) > 5 (instead of 0) to account for this to some extent. 4163fb9

Not sure what the logic of dryad URL's is, so more investigation is needed!

download_d1_data("https://datadryad.org/bitstream/handle/10255/dryad.181477/experiement1.txt?sequence=1", ".")

For some related issues on the structure of Dryad identifiers in DataONE, see https://redmine.dataone.org/issues/7896

@brunj7 what is the origin of the URL in the above example from @isteves ? It doesn't look like a DataONE Dryad identifier or a DataONE URL. The changes that we discussed to make check_version more efficient would only work for DataONE identifiers or DataONE URLs.

@gothub sorry for the confusion. The idea is that scientists could also go on each data repository and get the URL from there. The KNB check_version("https://knb.ecoinformatics.org/knb/d1/mn/v2/object/msleckman.40.1") seems to conform to what we discussed; but we should also handle PASTA check_version("https://pasta.lternet.edu/package/data/eml/edi/195/2/51abf1c7a36a33a2a8bb05ccbf8c81c6").

The DRYAD URL comes from this package https://datadryad.org/resource/doi:10.5061/dryad.7ns4pk2 for the dataset experiment_1.txt. It seems that https://datadryad.org/bitstream/handle/10255/dryad.181477/experiement1.txt will also resolve and if I search for dryad.181477 on their repo I find the corresponding data package; so more likely their internal identifier?

Side note: when I search on dataONE for this DOI (10.5061/dryad.7ns4pk2) I get 5 hits...more likely related to the problem Matt mentioned, but if I search for the "DRYAD" dataset identifier (dryad.181477) I get 0 hit.

So we might have to understand the URL logic behind DRYAD if we want to support it.

Here is the corresponding DataONE URL for the above Dryad id: https://cn.dataone.org/cn/v2/resolve/https://doi.org/10.5061/dryad.7ns4pk2/1/bitstream

@gothub following our discussion I think it would make sense to add a rule to prioritize the DataONE URLs and then default to the current system if it fails to make the fct more efficient.

This being said that does not solve the mapping problem between DRAYD URLs and corresponding DataONE ones.