Investigate what caused mapping explosion

Question

Investigate what caused mapping explosion

Closed this issue 9 months ago · 3 comments

4500 total fields limit on DEV, documents failed to get indexed in Elasticsearch due to

{"root_cause":[{"type":"illegal_argument_exception","reason":"Limit of total fields [4500] has been exceeded"}]}

Bumped to 5500, 6500, then 7500 just so the documents can be indexed.

Answer 1 · 2024-03-07T18:52:05.000Z

hm_dev_consortium_entities.json
hm_dev_consortium_portal.json
hm_prod_consortium_entities.json
hm_prod_consortium_portal.json

Compared PROD and DEV for portal fields, 2034 new fields added.
new_fields.txt

With further investigation, these following Donor and Samples are causing the issue with INCORRECT metadata values.

╒═══════════════╤══════════════════════════════════╕
│"n.entity_type"│"n.uuid"                          │
╞═══════════════╪══════════════════════════════════╡
│"Donor"        │"c624abbe9836c7e3b6a8d8216a316f30"│
├───────────────┼──────────────────────────────────┤
│"Sample"       │"4d750b2cdad1579bb3bd5d89b9223431"│
├───────────────┼──────────────────────────────────┤
│"Sample"       │"1de4ad9bd516660043b8e190e8978045"│
├───────────────┼──────────────────────────────────┤
│"Sample"       │"dfe7f5dcac2a6d0efac3c3655dd8e29b"│
├───────────────┼──────────────────────────────────┤
│"Sample"       │"8de5e146b6bf4cdb6fcbd66edecbc2fc"│
├───────────────┼──────────────────────────────────┤
│"Sample"       │"a1b225e9f285dfef8101094f4c98b43f"│
├───────────────┼──────────────────────────────────┤
│"Sample"       │"6d2f7e6ea8b95c690db6df8b70ef84e8"│
└───────────────┴──────────────────────────────────┘

Answer 2 · 2024-03-07T20:09:05.000Z

Turned out these entities now having metadata field looking like below (from Sample 6d2f7e6ea8b95c690db6df8b70ef84e8). And this metadata.description field with the nested sub-fields are introducing 2K + new fields during the Elasticsearch mapping process. It SEEMS someone used the http response (looking at the "code": 200) and added to the metadata value.

{
  "code": 200,
  "description": {
    "created_by_user_displayname": "Haitham Mohamed Abdelazim",
    "created_by_user_email": "haitham.mohameda@ufl.edu",
    "created_by_user_sub": "68cc1d4b-66c5-4ed1-9da9-33f260fc12f0",
    "created_timestamp": 1701454862364,
    "data_access_level": "consortium",
    "direct_ancestor": {
      "created_by_user_displayname": "Haitham Mohamed Abdelazim",
      "created_by_user_email": "haitham.mohameda@ufl.edu",
      "created_by_user_sub": "68cc1d4b-66c5-4ed1-9da9-33f260fc12f0",
      "created_timestamp": 1701450436361,
      "data_access_level": "consortium",
      "entity_type": "Donor",
      "group_name": "TC - University of Florida",
      "group_uuid": "9d7be1c8-20ea-11ee-b9e0-4f7fcf0abd92",
      "hubmap_id": "HBM982.BFJP.379",
      "label": "NHK#39401771",
      "last_modified_timestamp": 1701450436361,
      "last_modified_user_displayname": "Haitham Mohamed Abdelazim",
      "last_modified_user_email": "haitham.mohameda@ufl.edu",
      "last_modified_user_sub": "68cc1d4b-66c5-4ed1-9da9-33f260fc12f0",
      "protocol_url": "https://dx.doi.org/10.17504/protocols.io.e6nvwdo6dlmk",
      "submission_id": "TCUF0076",
      "uuid": "85e4171bc51aea25204b087c19219b88"
    },
    "entity_type": "Sample",
    "group_name": "TC - University of Florida",
    "group_uuid": "9d7be1c8-20ea-11ee-b9e0-4f7fcf0abd92",
    "hubmap_id": "HBM992.ZTGQ.825",
    "lab_tissue_sample_id": "NA",
    "last_modified_timestamp": 1701454862364,
    "last_modified_user_displayname": "Haitham Mohamed Abdelazim",
    "last_modified_user_email": "haitham.mohameda@ufl.edu",
    "last_modified_user_sub": "68cc1d4b-66c5-4ed1-9da9-33f260fc12f0",
    "metadata": {
      "histological_report": "",
      "lab_id": "block",
      "metadata_schema_id": "3e98cee6-d3fb-467b-8d4e-9ba7ee49eeff",
      "notes": "",
      "pathology_distance_unit": "cm",
      "pathology_distance_value": "42",
      "preparation_condition": "frozen in liquid nitrogen",
      "preparation_medium": "Methanol",
      "preparation_protocol_doi": "https://dx.doi.org/10.17504/protocols.io.eq2lyno9qvx9/v1",
      "processing_time_unit": "minute",
      "processing_time_value": "",
      "quality_criteria": "",
      "sample_id": "HBM347.VVBC.274",
      "source_id": "HBM992.ZTGQ.825",
      "source_storage_duration_unit": "minute",
      "source_storage_duration_value": "33",
      "storage_medium": "Methanol",
      "storage_method": "frozen in liquid nitrogen",
      "tissue_weight_unit": "kg",
      "tissue_weight_value": "42",
      "volume_unit": "mm^3",
      "volume_value": "1"
    },
    "organ": "LK",
    "protocol_url": "https://dx.doi.org/10.17504/protocols.io.e6nvwdo6dlmk",
    "sample_category": "organ",
    "submission_id": "TCUF0076-LK",
    "uuid": "e6726f3c27cf021b2f61903b93c9a552"
  },
  "name": "OK",
  "pathname": "xwdkzwxmp23lm8xqvs6z/HIDSAMPMETA.csv",
  "file_row": 3
}

The normal metadata of a Sample should look similar to:

{
  "sample_id": "STAN0003-LI-2-6",
  "vital_state": "deceased",
  "health_status": "relatively healthy",
  "organ_condition": "healthy",
  "procedure_date": "2019-03-14",
  "perfusion_solution": "HTK",
  "pathologist_report": "normal, no cancer, no necrosis",
  "warm_ischemia_time_value": "0",
  "warm_ischemia_time_unit": "minutes",
  "cold_ischemia_time_value": "237",
  "cold_ischemia_time_unit": "minutes",
  "specimen_preservation_temperature": "Freezer (-80 Celsius)",
  "specimen_quality_criteria": "H&E",
  "specimen_tumor_distance_value": "",
  "specimen_tumor_distance_unit": ""
}

OR

{
  "organ_donor_data": [
    {
      "start_datetime": "0",
      "end_datetime": "",
      "graph_version": "UMLS2019AA",
      "concept_id": "C0086287",
      "code": "1086007",
      "sab": "SNOMEDCT_US",
      "data_type": "Nominal",
      "data_value": "",
      "numeric_operator": "",
      "units": "",
      "preferred_term": "Female",
      "grouping_concept": "C1522384",
      "grouping_concept_preferred_term": "Sex",
      "grouping_code": "57312000",
      "grouping_sab": "SNOMEDCT_US"
    },
    {
      "start_datetime": "0",
      "end_datetime": "",
      "graph_version": "UMLS2019AA",
      "concept_id": "C0001779",
      "code": "424144002",
      "sab": "SNOMEDCT_US",
      "data_type": "Numeric",
      "data_value": "66",
      "numeric_operator": "EQ",
      "units": "years",
      "preferred_term": "Age",
      "grouping_concept": "C0001779",
      "grouping_concept_preferred_term": "Age",
      "grouping_code": "424144002",
      "grouping_sab": "SNOMEDCT_US"
    },
    ...
  ]
}

Answer 3 · 2024-03-11T15:01:51.000Z

Turned out this was caused during testing the Sample Metadata Upload feature on DEV and TEST. Bill fixed the data in Neo4j and things are back to normal after reindex all. Still bumped the total number of mapped fields limit to 6000 to accommodate the future increase.

But once we finalize the mapping updates described in #761 (comment) we should have less fields mapped.