AtlasOfLivingAustralia/data-management

Some Images cannot be relinked back to occurrences for Biocollect

patkyn opened this issue · 6 comments

There were 102517 images that were harvested from Biocollect job. However, not all of them can be linked back to the occurrences.

As discussed with @djtfmartin
The reason being the original_filename for the image is different from the image url. There are currently 41,428 images that cannot be re-linked.

Comparing image identifiers in darwin core archive multimedia extensions against the export csv from here: https://images.ala.org.au/ws/exportDataset/dr364, there are 26316 images of those can be somehow re-linked back using title or image-identifier (Example 1, 2).

However, the remaining 15,112 images are not in the export csv list. Some of those images have been deleted from image service (Example 4) and some which are still in image service but do not contain the data-resource_uid. (Example 3)

Example1:
https://biocache.ala.org.au/occurrences/d8c0ee4c-f912-4738-adcb-1fd6e7513446. This occurrence record is supposed to have additional image https://images.ala.org.au/image/details?imageId=4a494402-ea72-49ed-ac7c-79392e7ab3f2.

However, in this case (identifier in multimedia extrension which in this case is already url from image service 😕 http://images.ala.org.au/image/proxyImageThumbnailLarge?imageId=4a494402-ea72-49ed-ac7c-79392e7ab3f2)

image

image

Querying from Image Service DB
image

Example 2:
https://biocache.ala.org.au/occurrences/92b44228-9def-4469-ae79-22eb3f0a238b.

image

Image already exist in Image Service https://images.ala.org.au/image/2adce2cd-9c8f-43fc-be2f-8d8badf783ce

Querying from Image Service DB
image

Example 3:
6 of the images in the darwin core archive are not linked. In this case, it is because the images that are in image service are not populated with data_resource_uid. Hence they didn't appear in the export csv list that is used by the image-sync.

https://biocache.ala.org.au/occurrences/c4fb90bb-925c-4de3-a30b-8b758cd9e9f5

image

image

Example 4:
There are some image urls that are not in image.ala.org.au. These should be deleted from ecodata
http://images.ala.org.au/image/proxyImageThumbnailLarge?imageId=0f5da8eb-f94c-4602-9ff5-854fde52f018

After downloading the full export from https://images.ala.org.au/ws/#/Export/exportCSV and comparing the imageIdentifiers that is referenced in the multimedia.csv of dwca dr364.zip, there are stray images that do not belong to dr364 from images.ala.org.au. See existingStrayImages.csv and a summary here
existingStrayImages.csv

Some of these images are also used by other valid drs. dr5486 currently have no occurrence records and it is old OzAtlas dr which is not in use https://collections.ala.org.au/dataResource/show/dr5486
The images attached to dr5486 will be transferred to dr364

dataResourceUid imageIdentifier count Transfer images to dr364? Remarks
dr13290 13  N  
dr14002 4  N  
dr14317 1 N non existent dr but image is deleted https://images.ala.org.au/image/details?imageId=0f5da8eb-f94c-4602-9ff5-854fde52f018
dr16778 1  N  
dr1902 51  N  
dr2696 3  N  
dr3147 3 N  
dr4701 1  N  
dr5486 1647 Y (invalid dr previously ozatlas)
No drs 31498 Y

Sql script is generated to transfer the images to dr364
updateStrayImages.txt

updateStrayImages-dr5486.txt

Remaining 1366 images that have imageIdentifiers but not existent in images.ala.org.au
notExistingStrayImages.csv

Biocollect legacy url update AtlasOfLivingAustralia/biocollect#1343

Ran the update on images prod using this modified sql which update image in batches of 500 image_identifiers
updateStrayImages.txt
updateStrayImages-dr5486.txt

Recent discovery of recently uploaded image from biocollect sighting. the image url from biocollect in the dwca is with 'https://biocollect.ala.org.au/image?id=...' instead of from https://images.ala.org.au, hence this cannot be sync back
Refer to Example 2 above

https://biocache.ala.org.au/occurrences/61fdeced-46ed-4653-bbae-edcdc4429076

image

images db in image service
image

Requires AtlasOfLivingAustralia/la-pipelines#516 to be fixed

temi commented

@patkyn this should be fixed now.

Closing ... @patkyn please reopen if this isn't resolved