AtlasOfLivingAustralia/galah-python

is API access with JWT tokens working yet?

Closed this issue · 11 comments

Hey acbuyan,

I have been silent for a while as our project got interrupted for a few months, but I am now back at working on automized download of the BioCollect data for a specific project, including all fields available as well as the media (photos of the Koala).

First I generated a list with all field names (n=760) to use these rather than the "basic" setting for downloading atlas occurences of my research UID.
After receiving error "Expecting value: line 1 column 1 (char0)" a JSON decode error in the request call in atlas_occurrences.py, I added a line to print the raw server response in that python module.
That showed me that querying all 760 fields produced an URL too long to handle (caused an error 414).
I therefore wrote a loop around this to download the fields in chunks and stitch the data back together. This however then caused a new error (now status: in Queue), error thrown because of line:
l 311: return pd.read_csv(zipfile.ZipFile(io.BytesIO(zipURL.content)).open('data.csv'),low_memory=False)
apparently there was a pandas data error "No columns to parse from file".
With this I also received an automatic email from ALA Support that the "Occurrence download failed - data". I did reply to that but haven't heard back yet.

When I read through the ALA API doc, I found that the API I (likely) need to access all data occurrences of that project may be protected?: /occurrences/{recordUuid} (retrieve full record details).
I am unsure if this is the correct API though.
Yet it seems like the JWT token access to protected API is included but not yet functional in galah-python? Judging from the galah_config.py script and the generate_jwt_token.py script?

I have already applied and received a ClientID and Secret and would love to see if this allows me to get the required data of the project...
Yet you have done such incredible work with this package already, that I would rather try and inquire how far JWT tokens are implemented here yet, and if this is the way to solve my access and download issues reaching far beyond the "basic" field list.
I am not a software developer or have experience with APIs, I am more a data analyst. In this case however I can't reach the data I need to analyse, haha, so if there is any way I could try to help with this issue/further development, please let me know.

Hope you can point me in the correct direction, happy to give more specific project details if needed.

Thanks again!
Jojo

Hi Jojo,

Good to hear from you, and great to hear that you continue to use galah-python! It makes my day :)

First things first: there have been a couple of updates to galah-python; have you updated to the latest version? I don't think this will solve your issue, but I have fixed some bugs along the way and added other options too.

As far as your question on JWT tokens, we’re holding off on JWT token integration until we deploy Cognito. That should be soon but we’re not sure. In addition, JWT tokens are only for access to sensitive data, and only then to access that you have personally been approved to receive. If these conditions do not apply, then JWT tokens are not what's causing your problem.

Unfortunately, galah-python doesn’t support BioCollect APIs, so you can only use galah to get occurrences once they have been passed to biocache.

I may have asked you this before, but what is the reason that you are downloading all fields? It is unfortunately never a good idea to download all of the ALA fields, as they include spatial data that cannot be relevant to every application. It also includes Darwin Core fields that are rarely populated and may not be useful to you. If you know what data fields you need, you can then include it as an argument to atlas_occurrences

Hi again Jojo,

I get what you mean about the headings. If you're not overly familiar with the concept of Darwin Core (https://dwc.tdwg.org/) headings, it can feel overwhelming.

Out of curiosity, have you tried the search_all function in galah? It could help you narrow down your fields search? You can use show_all to show all the terms, and search_all to find terms with keywords either in the term (id) itself or the description. It looks like this:

                    id                                     description   type link
0        _nest_parent_                                             NaN  field  NaN
1          _nest_path_                                             NaN  field  NaN
2               _root_                                             NaN  field  NaN
3       abcdTypeStatus                   ABCD field in use by herbaria  field  NaN
4    acceptedNameUsage  http://rs.tdwg.org/dwc/terms/acceptedNameUsage  field  NaN
..                 ...                                             ...    ...  ...
755  multimediaLicence                              Media filter field  media     
756             images                              Media filter field  media     
757             videos                              Media filter field  media     
758             sounds                              Media filter field  media     
759                qid                Reference to pre-generated query  other     

[760 rows x 4 columns]
>>> galah.search_all(fields="latitude")
                    id                                        description   type                                               link
0      decimalLatitude  The decimal latitude associated with this reco...  field  https://github.com/AtlasOfLivingAustralia/ala-...
1     verbatimLatitude      http://rs.tdwg.org/dwc/terms/verbatimLatitude  field                                                NaN
2  raw_decimalLatitude  The decimal latitude as supplied by the data p...  field                                                NaN

As far as getting the data, I don't think it was a problem with what you were doing, I think it was with how galah-python was constructing the URLs. I've patched it and pushed my changes to the Python Package Index, so all you should need to do is update your installation and it should work.

yes - do you mind double-checking what version of galah-python you have? It should be 0.8.3. If it is that version and it is still not working, then I'll have to go back to the drawing board.

Ok, this might sound weird, but ... try putting decimalLatitude at the beginning of your list of terms and see if that works

That works!
Interesting fix :)

What did that change?

I don't know - I'll ask the systems team and see why.

Thank you!

Hi Jojo,
Wanted to let you know that it is on the ALA side of things. I'm not sure when a fix is coming, so for now keep using the decimalLatitude fix. I'm going to close the issue for now, but do let me know if anything else comes up!