Some Images cannot be relinked back to occurrences for Biocollect
patkyn opened this issue · 6 comments
There were 102517 images that were harvested from Biocollect job. However, not all of them can be linked back to the occurrences.
As discussed with @djtfmartin
The reason being the original_filename for the image is different from the image url. There are currently 41,428 images that cannot be re-linked.
Comparing image identifiers in darwin core archive multimedia extensions against the export csv from here: https://images.ala.org.au/ws/exportDataset/dr364, there are 26316 images of those can be somehow re-linked back using title or image-identifier (Example 1, 2).
However, the remaining 15,112 images are not in the export csv list. Some of those images have been deleted from image service (Example 4) and some which are still in image service but do not contain the data-resource_uid. (Example 3)
Example1:
https://biocache.ala.org.au/occurrences/d8c0ee4c-f912-4738-adcb-1fd6e7513446. This occurrence record is supposed to have additional image https://images.ala.org.au/image/details?imageId=4a494402-ea72-49ed-ac7c-79392e7ab3f2.
However, in this case (identifier in multimedia extrension which in this case is already url from image service 😕 http://images.ala.org.au/image/proxyImageThumbnailLarge?imageId=4a494402-ea72-49ed-ac7c-79392e7ab3f2)
Querying from Image Service DB
Example 2:
https://biocache.ala.org.au/occurrences/92b44228-9def-4469-ae79-22eb3f0a238b.
Image already exist in Image Service https://images.ala.org.au/image/2adce2cd-9c8f-43fc-be2f-8d8badf783ce
Querying from Image Service DB
Example 3:
6 of the images in the darwin core archive are not linked. In this case, it is because the images that are in image service are not populated with data_resource_uid. Hence they didn't appear in the export csv list that is used by the image-sync.
https://biocache.ala.org.au/occurrences/c4fb90bb-925c-4de3-a30b-8b758cd9e9f5
Example 4:
There are some image urls that are not in image.ala.org.au. These should be deleted from ecodata
http://images.ala.org.au/image/proxyImageThumbnailLarge?imageId=0f5da8eb-f94c-4602-9ff5-854fde52f018
After downloading the full export from https://images.ala.org.au/ws/#/Export/exportCSV and comparing the imageIdentifiers that is referenced in the multimedia.csv of dwca dr364.zip, there are stray images that do not belong to dr364 from images.ala.org.au. See existingStrayImages.csv and a summary here
existingStrayImages.csv
Some of these images are also used by other valid drs. dr5486 currently have no occurrence records and it is old OzAtlas dr which is not in use https://collections.ala.org.au/dataResource/show/dr5486
The images attached to dr5486 will be transferred to dr364
dataResourceUid | imageIdentifier count | Transfer images to dr364? | Remarks |
---|---|---|---|
dr13290 | 13 | N | |
dr14002 | 4 | N | |
dr14317 | 1 | N | non existent dr but image is deleted https://images.ala.org.au/image/details?imageId=0f5da8eb-f94c-4602-9ff5-854fde52f018 |
dr16778 | 1 | N | |
dr1902 | 51 | N | |
dr2696 | 3 | N | |
dr3147 | 3 | N | |
dr4701 | 1 | N | |
dr5486 | 1647 | Y | (invalid dr previously ozatlas) |
No drs | 31498 | Y |
Sql script is generated to transfer the images to dr364
updateStrayImages.txt
Remaining 1366 images that have imageIdentifiers but not existent in images.ala.org.au
notExistingStrayImages.csv
Biocollect legacy url update AtlasOfLivingAustralia/biocollect#1343
Ran the update on images prod using this modified sql which update image in batches of 500 image_identifiers
updateStrayImages.txt
updateStrayImages-dr5486.txt
Recent discovery of recently uploaded image from biocollect sighting. the image url from biocollect in the dwca is with 'https://biocollect.ala.org.au/image?id=...' instead of from https://images.ala.org.au, hence this cannot be sync back
Refer to Example 2 above
https://biocache.ala.org.au/occurrences/61fdeced-46ed-4653-bbae-edcdc4429076
Requires AtlasOfLivingAustralia/la-pipelines#516 to be fixed
Closing ... @patkyn please reopen if this isn't resolved