hubmapconsortium/search-api

Remove top-level copied field `files`

Closed this issue · 9 comments

During the index/reindex runtime, the Dataset ingest_metadata.files field gets copied to a top-level filed files. Recently we've come across some datasets that contain a large number of ingest_metadata.files (the Dataset field ingest_metadata gets renamed to metadata during index runtime) entries (fbf3af732f53b00f20a9ecc1ecc3c52b for instance, the payload size 2MB).

Screenshot 2024-05-22 at 10 54 57 PM

Such duplicates have caused:

  • bigger response json payload (> 10MB)
  • longer search query execution and reindex time

We should remove the original one and only keep the copied version.

@lchoy @john-conroy @NickAkhmetov @bherr2 will this change affect any of your UI handlings?

Having the files at the top level of the doc would break our UI and require some work in the portal-ui.

We read from metadata.files. Does this affect that?

PS. Here are the fields we query for / use: https://github.com/hubmapconsortium/ccf-ui/blob/main/projects/ccf-database/src/lib/xconsortia/xconsortia-data-import.ts#L17-L38

@john-conroy @bherr2 does this mean the portal-ui and ccf-ui are not consuming the top-level files (copied from metadata.files) at all?

On ccf-ui side, that's correct.

@bherr2 @john-conroy if you are sure you don't use the top-level files field, we'll plan to remove it, is that fine with you?

There will be additional upcoming changes to the Dataset metadata.files and metadata.metadata in the near future. We'll discuss and come up with a plan.

Fine by me

I'll have to look through our repos before I can fully confirm.

Closing this issue, will handle this separately.