AtlasOfLivingAustralia/image-service

ElasticSearch index bugs

Opened this issue · 0 comments

The ElasticSearchService has two ways of indexing an image, one is an individual record (ElasticSearchService.indexImage) and the second is a bulk index of all records in the database (ScheduleReindexAllImagesTask using ImageService.exportIndexToFile and ElasticSearchService.bulkIndexImageInES). The ES documents these produce are inconsistent in the fields they provide, the data types of some of those fields (ie bulkIndex all fields are strings, individual doc has ints for width, height, etc) and the name of some fields (contentmd5hash, contentsha1hash in bulk index vs contentMD5Hash, contentSHA1Hash in individual index).

I suggest updating the bulk index to match the individual index in fields, field names and data types.

Something something well defined schemas.

Example indexImage document:

{
  "imageIdentifier" : "d7db130f-3416-430d-acb8-dca966f61a9e",
  "contentMD5Hash" : "97ab347bee9ed365963ea1eebd402e3c",
  "contentSHA1Hash" : "00e354b58a4c5e3baf5d8e69ef0ff823414410ec",
  "format" : "image/jpeg",
  "originalFilename" : "https://inaturalist-open-data.s3.amazonaws.com/photos/218415139/original.jpeg",
  "extension" : "jpeg",
  "dateUploaded" : "2022-08-01T10:41:56Z",
  "dateTaken" : "2022-08-01T10:41:56Z",
  "fileSize" : 785423,
  "height" : 2048,
  "width" : 1365,
  "zoomLevels" : 5,
  "dataResourceUid" : "dr1411",
  "creator" : "Grace Keast",
  "title" : null,
  "description" : null,
  "rights" : null,
  "rightsHolder" : "Grace Keast",
  "license" : "http://creativecommons.org/licenses/by-nc/4.0/",
  "thumbHeight" : 300,
  "thumbWidth" : 200,
  "harvestable" : false,
  "recognisedLicence" : "CC BY-NC 4.0",
  "occurrenceID" : null,
  "dateUploadedYearMonth" : "2022-08",
  "fileType" : "image",
  "imageSize" : "2m"
}

Example bulkIndexImageInES document:

{
  "imageIdentifier" : "277e29e6-eea0-454d-a81c-4d90d374a72a",
  "contentmd5hash" : "866ff2eeebf50518c2f25b19cdf7645a",
  "contentsha1hash" : "fc1706d73208d297dd83820132627a56312edb24",
  "format" : "image/jpeg",
  "originalfilename" : "https://static.inaturalist.org/photos/32546837/original.jpg",
  "extension" : "jpg?1552086948",
  "dateUploaded" : "2019-11-15",
  "dateTaken" : "2019-11-15",
  "fileSize" : "958021",
  "height" : "1360",
  "width" : "2048",
  "zoomLevels" : "5",
  "dataResourceUid" : "dr1411",
  "creator" : "Rolf Lawrenz",
  "rightsHolder" : "Rolf Lawrenz",
  "license" : "http://creativecommons.org/licenses/by/4.0/",
  "thumbHeight" : "199",
  "thumbWidth" : "300",
  "harvestable" : "false",
  "occurrenceID" : "4e48e22f-b9c6-494b-bb9d-0db9f621548b",
  "type" : "StillImage",
  "created" : "2019-03-06T12:36:50-08:00",
  "references" : "https://www.inaturalist.org/photos/32546837",
  "dateUploadedYearMonth" : "2019-11",
  "fileType" : "image",
  "recognisedLicence" : "unrecognised_licence",
  "imageSize" : "2m"
}