AtlasOfLivingAustralia/galah-python

atlas_media() MissingSchema error when collect=True

JojoReikun opened this issue · 10 comments

Hey @acbuyan,

I have recently progressed with the script I'm writing for automatic data download from an ALA resource stemming from a citizen science project.
I need to download the media as well as the occurrences data, but testing the download since the last download, the atlas_media() function came up with an error of an internal module:

A few images run successfully [...], but then the error occurs, see console output below.
It doesn't seem to go through your check in atlas_media if collect==True where you test if image was retrieved?

Consule output:

`running Koala Watch Dashboard Script...
downloading RCC data since last download.
last_download_date: 2023-07-10T14:23:19Z
['dataResourceUid==dr19064', 'eventDate>=2023-07-10T14:23:19Z']
koala_watch_occurrences_count: totalRecords
0 73
URL for querying:

https://biocache-ws.ala.org.au/ws/occurrences/offline/download?email=jojo.schultz%40outlook.de&dwcHeaders=True&reasonTypeId=4&emailNotify=false&disableAllQualityfilters=true&fields=basisOfRecord%2CcatalogNumber%2Ccl1048%2Ccl21%2Ccl959%2Cclass%2CcollectionCode%2CcollectionName%2CcollectionUid%2CcoordinatePrecision%2CcoordinateUncertaintyInMeters%2Ccountry%2CdataGeneralizations%2CdataResourceUid%2Cday%2Cdcterms:license%2CdecimalLatitude%2CdecimalLongitude%2CeventDate%2Cfamily%2Cgenus%2CindividualCount%2CinformationWithheld%2CinstitutionCode%2CinstitutionName%2CinstitutionUid%2Ckingdom%2Clocality%2CmaximumDepthInMeters%2CmaximumElevationInMeters%2CminimumDepthInMeters%2CminimumElevationInMeters%2Cmonth%2CoccurrenceStatus%2Corder%2Cphylum%2Cpreparations%2Craw_sex%2Craw_vernacularName%2CrecordID%2CrecordedBy%2CscientificName%2CspatiallyValid%2Cspecies%2CstateProvince%2Csubspecies%2CtaxonConceptID%2CtaxonRank%2CverbatimBasisOfRecord%2CverbatimCoordinateSystem%2CverbatimLatitude%2CverbatimLongitude%2CverbatimScientificName%2CvernacularName%2Cyear&&fq=%28lsid%3Ahttps%3A//biodiversity.org.au/afd/taxa/e9d6fbbd-1505-4073-990a-dc66c930dad6%29%20AND%20%28%28dataResourceUid%3A%22dr19064%22%29%20AND%20%28eventDate%3A%5B2023-07-10T14:23:19Z%20TO%20%2A%5d%29%29&qa=none&

Data for download:

https://biocache.ala.org.au/biocache-download/7d58aab5-1e0f-384d-ba9a-f632fb210b51/1691384065893/data.zip

size of dataframe: (73, 53)
Do you want to download the media files (photos etc.)? (y/n): y

[...]

URL for querying:

https://images.ala.org.au/ws/image/a6177554-e6d0-43c7-8ab9-b279aadec230

Traceback (most recent call last):
File "D:\Jojo\DDC\KoalaWatchDashboard\KoalaDashboardScript\KoalaWatch_main_.py", line 14, in
main()
File "D:\Jojo\DDC\KoalaWatchDashboard\KoalaDashboardScript\KoalaWatch_main_.py", line 9, in main
download_ala_data()
File "D:\Jojo\DDC\KoalaWatchDashboard\KoalaDashboardScript\KoalaWatch\operations\galah_data_download.py", line 304, in download_ala_data
subsequent_download_ala_data(collect_bool=False)
File "D:\Jojo\DDC\KoalaWatchDashboard\KoalaDashboardScript\KoalaWatch\operations\galah_data_download.py", line 246, in subsequent_download_ala_data
df_occurrences_media = galah.atlas_media(taxa="Phascolarctos cinereus", filters=filter_date_new,
File "C:\Users\JojoS\Miniconda3\envs\KoalaDashboardScript\lib\site-packages\galah\atlas_media.py", line 230, in atlas_media
response = requests.get(image,stream=True)
File "C:\Users\JojoS\Miniconda3\envs\KoalaDashboardScript\lib\site-packages\requests\api.py", line 73, in get
return request("get", url, params=params, **kwargs)
File "C:\Users\JojoS\Miniconda3\envs\KoalaDashboardScript\lib\site-packages\requests\api.py", line 59, in request
return session.request(method=method, url=url, **kwargs)
File "C:\Users\JojoS\Miniconda3\envs\KoalaDashboardScript\lib\site-packages\requests\sessions.py", line 573, in request
prep = self.prepare_request(req)
File "C:\Users\JojoS\Miniconda3\envs\KoalaDashboardScript\lib\site-packages\requests\sessions.py", line 484, in prepare_request
p.prepare(
File "C:\Users\JojoS\Miniconda3\envs\KoalaDashboardScript\lib\site-packages\requests\models.py", line 368, in prepare
self.prepare_url(url, params)
File "C:\Users\JojoS\Miniconda3\envs\KoalaDashboardScript\lib\site-packages\requests\models.py", line 439, in prepare_url
raise MissingSchema(
requests.exceptions.MissingSchema: Invalid URL '0_IMG_0437.JPG': No scheme supplied. Perhaps you meant https://0_IMG_0437.JPG?

Process finished with exit code 1
`

Do you know about this issue? Might it be because image naming could differ in a citizen science project?
I'm not sure if this error is something you can fix, or might result from the ALA server itself.
Any pointer is greatly appreciated.

Kind regards,
Jojo

HI @JojoReikun, thanks as always for your feedback! So, could you provide the code you are using to collect the media from atlas_media()?

Hey @acbuyan,

That would have been helpful hey, here it is:
filter_date_new = ["dataResourceUid==dr19064", "eventDate>=" + str(last_download_date)]
df_occurrences_media = galah.atlas_media(taxa="Phascolarctos cinereus", filters=filter_date_new, verbose=True, collect=collect_bool, path="redlands_media")

and one more question: what is the download date?

The last download date I used that threw the error was the 10th of July 2023

hm....one other question (because I'm not sure why this is happening, to be honest): does your download script rename anything? Or does it just call galah?

it calls the function pretty much immediately.
I was wondering whether it is a problem related to the file naming of the specific photo?
Given that about 20 or so photos prior to the error the url returned from verbose works without a problem, however the media folder remains completely empty.
Would there be an option to skip these unsuccessful url requests if we can't find the issue? :)
I'm wondering why it wouldn't be picked up from your check loop.

The xml if I click on the link of the image looks like this, maybe that helps you spot anything that could cause the issue?

<entry key="success">true</entry>
<entry key="imageIdentifier">a6177554-e6d0-43c7-8ab9-b279aadec230</entry>
<entry key="mimeType">image/jpeg</entry>
<entry key="originalFileName">
processed_7972B1C4-7D02-4567-97D8-A1FE04F19601.jpeg
</entry>
<entry key="sizeInBytes">1252606</entry>
<entry key="rights"/>
<entry key="rightsHolder">Scott Bretherton</entry>
<entry key="dateUploaded">2023-07-30 10:25:21</entry>
<entry key="dateTaken">2023-07-30 10:25:21</entry>
<entry key="imageUrl">
https://images.ala.org.au/store/0/3/2/c/a6177554-e6d0-43c7-8ab9-b279aadec230/original
</entry>
<entry key="tileUrlPattern">
https://images.ala.org.au/store/0/3/2/c/a6177554-e6d0-43c7-8ab9-b279aadec230/tms/{z}/{x}/{y}.png
</entry>
<entry key="mmPerPixel"/>
<entry key="height">4032</entry>
<entry key="width">3024</entry>
<entry key="tileZoomLevels">8</entry>
<entry key="description"/>
<entry key="title">
processed_7972B1C4-7D02-4567-97D8-A1FE04F19601.jpeg
</entry>
<entry key="type">StillImage</entry>
<entry key="audience"/>
<entry key="references"/>
<entry key="publisher"/>
<entry key="contributor"/>
<entry key="created"/>
<entry key="source"/>
<entry key="creator">Scott Bretherton</entry>
<entry key="license">http://creativecommons.org/licenses/by/4.0/</entry>
<entry key="recognisedLicence">
<entry key="acronym">CC BY 4.0</entry>
<entry key="name">Creative Commons Attribution (International)</entry>
<entry key="url">https://creativecommons.org/licenses/by/4.0/</entry>
<entry key="imageUrl">https://licensebuttons.net/l/by/4.0/88x31.png</entry>
</entry>
<entry key="dataResourceUid">dr19064</entry>
<entry key="occurrenceID">b19f7027-3cda-4fc8-ad6e-768b9d209734</entry>
</map>

I'll try debug this problem myself further too. Should I find anything, I'll let you know.

Cheers,
Jojo

this is a silly question, but did you provide your email before you run the query?

yup, I use galah_config() defining atlas and email as the first thing :)

I have tried to avoid the buggy photo by manually selecting download windows of the eventDate and other photos come up too.
I know that the rightHolders of these photos submit many data Points so it seems to really be a few occasions that this is happening.
The photos that come up with the missing Scheme error lay about 5-6 days appart...

ok, this is good information to know. I'll look into it, and it should be fixed in the next release of galah-python. I'm sorry that I'm not sure what to tell you in the meantime. I'll look into it in the meantime

Update:
Hey @acbuyan:
I have been trying to trouble shoot this a little. Maybe some of this helps. Still haven't been able to find a solution.

Looking closer it seems I am actually not able to download any media from this data set. It is always the first image url that get's stuck in the python requests call. I have added some prints to the atlas_media() function just to see.

URL for querying:
https://images.ala.org.au/ws/image/3968dc89-8daf-44ad-b7db-4addb81bf657
temp_dict: 
 {'success': [True], 'imageIdentifier': ['3968dc89-8daf-44ad-b7db-4addb81bf657'], 'mimeType': ['image/jpeg'], 'originalFileName': ['20210728131206_IMG_4429.JPG'], 'sizeInBytes': [1024005], 'rights': [nan], 'rightsHolder': ['Reagan Bettell'], 'dateUploaded': ['2021-07-28 14:41:59'], 'dateTaken': ['2021-07-28 14:41:59'], 'imageUrl': ['https://images.ala.org.au/store/7/5/6/f/3968dc89-8daf-44ad-b7db-4addb81bf657/original'], 'tileUrlPattern': ['https://images.ala.org.au/store/7/5/6/f/3968dc89-8daf-44ad-b7db-4addb81bf657/tms/{z}/{x}/{y}.png'], 'mmPerPixel': [nan], 'height': [1536], 'width': [2048], 'tileZoomLevels': [5], 'description': [nan], 'title': ['20210728131206_IMG_4429.JPG'], 'type': ['StillImage'], 'audience': [nan], 'references': [nan], 'publisher': [nan], 'contributor': [nan], 'created': [nan], 'source': [nan], 'creator': ['Reagan Bettell'], 'license': ['Creative Commons Attribution 3.0'], 'recognisedLicence': [{'acronym': 'CC BY 4.0', 'name': 'Creative Commons Attribution (International)', 'url': 'https://creativecommons.org/licenses/by/4.0/', 'imageUrl': 'https://licensebuttons.net/l/by/4.0/88x31.png'}], 'dataResourceUid': [nan], 'occurrenceID': ['fc819ba9-968e-4ce6-ac5f-39bc7353bc3a']}

[...]

Downloading image 1 of 15: 20210728131206_IMG_4429.JPG

[...]
File "C:\Users\JojoS\Miniconda3\envs\KoalaDashboard\lib\site-packages\galah\atlas_media.py", line 237, in atlas_media
    response = requests.get(image,stream=True)
  File "C:\Users\JojoS\Miniconda3\envs\KoalaDashboard\lib\site-packages\requests\api.py", line 73, in get
    return request("get", url, params=params, **kwargs)
  File "C:\Users\JojoS\Miniconda3\envs\KoalaDashboard\lib\site-packages\requests\api.py", line 59, in request
    return session.request(method=method, url=url, **kwargs)
  File "C:\Users\JojoS\Miniconda3\envs\KoalaDashboard\lib\site-packages\requests\sessions.py", line 575, in request
    prep = self.prepare_request(req)
  File "C:\Users\JojoS\Miniconda3\envs\KoalaDashboard\lib\site-packages\requests\sessions.py", line 486, in prepare_request
    p.prepare(
  File "C:\Users\JojoS\Miniconda3\envs\KoalaDashboard\lib\site-packages\requests\models.py", line 368, in prepare
    self.prepare_url(url, params)
  File "C:\Users\JojoS\Miniconda3\envs\KoalaDashboard\lib\site-packages\requests\models.py", line 439, in prepare_url
    raise MissingSchema(
requests.exceptions.MissingSchema: Invalid URL '20210728131206_IMG_4429.JPG': No scheme supplied. Perhaps you meant https://20210728131206_IMG_4429.JPG?

I have changed the dict key that gets added to image_urls list just to see (in atlas_media.py), if I do this:
image_urls.append(temp_dict['imageUrl'][0])
instead of: image_urls.append(temp_dict['originalFileName'][0])
it passes the requests function but then throws the error FileNotFound, as it's now the wrong url to query, but it has the prefix "https://*.

I tried a completely different data set of bandicoots but the media and photos were empty, so I guess that's why it worked then.
I have also tried a completely different date range from back in 2021, but same media error came up, so it's nothing that has changed recently that causes this.

Cheers,
JOJO