tdwg/rs.tdwg.org

Harvesting DwC terms (all of them)

Closed this issue · 5 comments

Hi,

I am creating a tool in which I want to dynamically harvest the Darwin Core terms and descriptions etc. I want to pull these from a robust and reliable source, which is unlikely to change through time.

I am told that this is the source upon which everything else is built:
https://github.com/tdwg/rs.tdwg.org/blob/master/terms/terms.csv

To pull this CSV into Python, I can do something like this:

import pandas as pd

url = 'https://raw.githubusercontent.com/tdwg/rs.tdwg.org/master/terms/terms.csv'
df = pd.read_csv(url)

An issue with this is that this sometimes runs very quickly and sometimes takes far too long. I haven't gotten to the bottom of why this is. However, I have noticed that it runs very quickly (as expected) if I have the url open in my browser, and very slowly (not at all) if not. Maybe something to do with caching?

Regardless, this is not ideal.

I think the source should be stored somewhere else in a more easily machine-readable format. For example, as you already provide here: https://rs.gbif.org/core/dwc_event_2022-02-02.xml

import pandas as pd

url = 'https://rs.gbif.org/core/dwc_event_2022-02-02.xml'
df = pd.read_xml(url)

But instead, a version like this for all the terms.

Is this possible? Am I overlooking a more suitable option that is already available?

Thanks

I think the CSV is very machine readable and nice. How about you clone the repo and use the file locally?

You are correct that that CSV is the source from which all other serializations are generated. So loading it as you suggested with pandas is probably the most straightforward way to access the data. I frequently read CSVs from GitHub using pandas and haven't noticed a problem such as you described. It doesn't seem like browser caching would affect your use in Python.

I have two suggestions. One would be to use caching directly in Python. I have used the requests_cache module successfully. It will cache repeated HTTP requests (at least those made with the requests library; not sure about pandas read_csv) and you can set the cache expiration time to whatever you want. Here's code I've used:

import requests_cache
requests_cache.install_cache('wqs_cache', backend='sqlite', expire_after=300, allowable_methods=['GET', 'POST'])

The second suggestion is to just use the machine-readable RDF that is made available for this purpose. It's available as RDF/XML and Turtle. You can read more about that here. Of course you'd have to use one of the Python RDF libraries to get what you want from the triples you acquire. But those files are cached for I think 30 days. So after the first time you retrieve them, they will come from the rs.tdwg.org server's cache very quickly.

I think the CSV is very machine readable and nice. How about you clone the repo and use the file locally?

This would not be suitable in my case. I want to be able to pull the latest version of the DwC terms each time the code is run. I don't want to have to pull the repo each time, or intermittently, because the version of the DwC terms I am using will become outdated if this is not carefully monitored.

@baskaufs I will try again, thanks for the tips.

I found a solution. Perhaps a bug in the version of Pandas that I was using. See the answer here:

https://stackoverflow.com/a/75601780/14125020

Full working code for transparency:

import pandas as pd
import requests
import io

url = 'https://raw.githubusercontent.com/tdwg/rs.tdwg.org/master/terms/terms.csv'
response = requests.get(url)
df = pd.read_csv(io.StringIO(response.content.decode('utf-8')))

An issue has been submitted to Pandas
pandas-dev/pandas#51711

I was too quick to assume that this issue was resolved. The code above is still not solving my problem. However, I don't think this issue is anything to do with this specific CSV. Perhaps it is specific to my setup, or a bug in pandas and/or requests. I will close this issue now again. Thanks for your time.