ElasticSearch index bugs
Opened this issue · 0 comments
The ElasticSearchService
has two ways of indexing an image, one is an individual record (ElasticSearchService.indexImage
) and the second is a bulk index of all records in the database (ScheduleReindexAllImagesTask
using ImageService.exportIndexToFile
and ElasticSearchService.bulkIndexImageInES
). The ES documents these produce are inconsistent in the fields they provide, the data types of some of those fields (ie bulkIndex all fields are strings, individual doc has int
s for width, height, etc) and the name of some fields (contentmd5hash, contentsha1hash in bulk index vs contentMD5Hash, contentSHA1Hash in individual index).
I suggest updating the bulk index to match the individual index in fields, field names and data types.
Something something well defined schemas.
Example indexImage
document:
{
"imageIdentifier" : "d7db130f-3416-430d-acb8-dca966f61a9e",
"contentMD5Hash" : "97ab347bee9ed365963ea1eebd402e3c",
"contentSHA1Hash" : "00e354b58a4c5e3baf5d8e69ef0ff823414410ec",
"format" : "image/jpeg",
"originalFilename" : "https://inaturalist-open-data.s3.amazonaws.com/photos/218415139/original.jpeg",
"extension" : "jpeg",
"dateUploaded" : "2022-08-01T10:41:56Z",
"dateTaken" : "2022-08-01T10:41:56Z",
"fileSize" : 785423,
"height" : 2048,
"width" : 1365,
"zoomLevels" : 5,
"dataResourceUid" : "dr1411",
"creator" : "Grace Keast",
"title" : null,
"description" : null,
"rights" : null,
"rightsHolder" : "Grace Keast",
"license" : "http://creativecommons.org/licenses/by-nc/4.0/",
"thumbHeight" : 300,
"thumbWidth" : 200,
"harvestable" : false,
"recognisedLicence" : "CC BY-NC 4.0",
"occurrenceID" : null,
"dateUploadedYearMonth" : "2022-08",
"fileType" : "image",
"imageSize" : "2m"
}
Example bulkIndexImageInES
document:
{
"imageIdentifier" : "277e29e6-eea0-454d-a81c-4d90d374a72a",
"contentmd5hash" : "866ff2eeebf50518c2f25b19cdf7645a",
"contentsha1hash" : "fc1706d73208d297dd83820132627a56312edb24",
"format" : "image/jpeg",
"originalfilename" : "https://static.inaturalist.org/photos/32546837/original.jpg",
"extension" : "jpg?1552086948",
"dateUploaded" : "2019-11-15",
"dateTaken" : "2019-11-15",
"fileSize" : "958021",
"height" : "1360",
"width" : "2048",
"zoomLevels" : "5",
"dataResourceUid" : "dr1411",
"creator" : "Rolf Lawrenz",
"rightsHolder" : "Rolf Lawrenz",
"license" : "http://creativecommons.org/licenses/by/4.0/",
"thumbHeight" : "199",
"thumbWidth" : "300",
"harvestable" : "false",
"occurrenceID" : "4e48e22f-b9c6-494b-bb9d-0db9f621548b",
"type" : "StillImage",
"created" : "2019-03-06T12:36:50-08:00",
"references" : "https://www.inaturalist.org/photos/32546837",
"dateUploadedYearMonth" : "2019-11",
"fileType" : "image",
"recognisedLicence" : "unrecognised_licence",
"imageSize" : "2m"
}