Investigate what caused mapping explosion
Closed this issue · 3 comments
4500 total fields limit on DEV, documents failed to get indexed in Elasticsearch due to
{"root_cause":[{"type":"illegal_argument_exception","reason":"Limit of total fields [4500] has been exceeded"}]}
Bumped to 5500, 6500, then 7500 just so the documents can be indexed.
hm_dev_consortium_entities.json
hm_dev_consortium_portal.json
hm_prod_consortium_entities.json
hm_prod_consortium_portal.json
Compared PROD and DEV for portal fields, 2034 new fields added.
new_fields.txt
With further investigation, these following Donor and Samples are causing the issue with INCORRECT metadata
values.
╒═══════════════╤══════════════════════════════════╕
│"n.entity_type"│"n.uuid" │
╞═══════════════╪══════════════════════════════════╡
│"Donor" │"c624abbe9836c7e3b6a8d8216a316f30"│
├───────────────┼──────────────────────────────────┤
│"Sample" │"4d750b2cdad1579bb3bd5d89b9223431"│
├───────────────┼──────────────────────────────────┤
│"Sample" │"1de4ad9bd516660043b8e190e8978045"│
├───────────────┼──────────────────────────────────┤
│"Sample" │"dfe7f5dcac2a6d0efac3c3655dd8e29b"│
├───────────────┼──────────────────────────────────┤
│"Sample" │"8de5e146b6bf4cdb6fcbd66edecbc2fc"│
├───────────────┼──────────────────────────────────┤
│"Sample" │"a1b225e9f285dfef8101094f4c98b43f"│
├───────────────┼──────────────────────────────────┤
│"Sample" │"6d2f7e6ea8b95c690db6df8b70ef84e8"│
└───────────────┴──────────────────────────────────┘
Turned out these entities now having metadata field looking like below (from Sample 6d2f7e6ea8b95c690db6df8b70ef84e8). And this metadata.description field with the nested sub-fields are introducing 2K + new fields during the Elasticsearch mapping process. It SEEMS someone used the http response (looking at the "code": 200) and added to the metadata value.
{
"code": 200,
"description": {
"created_by_user_displayname": "Haitham Mohamed Abdelazim",
"created_by_user_email": "haitham.mohameda@ufl.edu",
"created_by_user_sub": "68cc1d4b-66c5-4ed1-9da9-33f260fc12f0",
"created_timestamp": 1701454862364,
"data_access_level": "consortium",
"direct_ancestor": {
"created_by_user_displayname": "Haitham Mohamed Abdelazim",
"created_by_user_email": "haitham.mohameda@ufl.edu",
"created_by_user_sub": "68cc1d4b-66c5-4ed1-9da9-33f260fc12f0",
"created_timestamp": 1701450436361,
"data_access_level": "consortium",
"entity_type": "Donor",
"group_name": "TC - University of Florida",
"group_uuid": "9d7be1c8-20ea-11ee-b9e0-4f7fcf0abd92",
"hubmap_id": "HBM982.BFJP.379",
"label": "NHK#39401771",
"last_modified_timestamp": 1701450436361,
"last_modified_user_displayname": "Haitham Mohamed Abdelazim",
"last_modified_user_email": "haitham.mohameda@ufl.edu",
"last_modified_user_sub": "68cc1d4b-66c5-4ed1-9da9-33f260fc12f0",
"protocol_url": "https://dx.doi.org/10.17504/protocols.io.e6nvwdo6dlmk",
"submission_id": "TCUF0076",
"uuid": "85e4171bc51aea25204b087c19219b88"
},
"entity_type": "Sample",
"group_name": "TC - University of Florida",
"group_uuid": "9d7be1c8-20ea-11ee-b9e0-4f7fcf0abd92",
"hubmap_id": "HBM992.ZTGQ.825",
"lab_tissue_sample_id": "NA",
"last_modified_timestamp": 1701454862364,
"last_modified_user_displayname": "Haitham Mohamed Abdelazim",
"last_modified_user_email": "haitham.mohameda@ufl.edu",
"last_modified_user_sub": "68cc1d4b-66c5-4ed1-9da9-33f260fc12f0",
"metadata": {
"histological_report": "",
"lab_id": "block",
"metadata_schema_id": "3e98cee6-d3fb-467b-8d4e-9ba7ee49eeff",
"notes": "",
"pathology_distance_unit": "cm",
"pathology_distance_value": "42",
"preparation_condition": "frozen in liquid nitrogen",
"preparation_medium": "Methanol",
"preparation_protocol_doi": "https://dx.doi.org/10.17504/protocols.io.eq2lyno9qvx9/v1",
"processing_time_unit": "minute",
"processing_time_value": "",
"quality_criteria": "",
"sample_id": "HBM347.VVBC.274",
"source_id": "HBM992.ZTGQ.825",
"source_storage_duration_unit": "minute",
"source_storage_duration_value": "33",
"storage_medium": "Methanol",
"storage_method": "frozen in liquid nitrogen",
"tissue_weight_unit": "kg",
"tissue_weight_value": "42",
"volume_unit": "mm^3",
"volume_value": "1"
},
"organ": "LK",
"protocol_url": "https://dx.doi.org/10.17504/protocols.io.e6nvwdo6dlmk",
"sample_category": "organ",
"submission_id": "TCUF0076-LK",
"uuid": "e6726f3c27cf021b2f61903b93c9a552"
},
"name": "OK",
"pathname": "xwdkzwxmp23lm8xqvs6z/HIDSAMPMETA.csv",
"file_row": 3
}
The normal metadata of a Sample should look similar to:
{
"sample_id": "STAN0003-LI-2-6",
"vital_state": "deceased",
"health_status": "relatively healthy",
"organ_condition": "healthy",
"procedure_date": "2019-03-14",
"perfusion_solution": "HTK",
"pathologist_report": "normal, no cancer, no necrosis",
"warm_ischemia_time_value": "0",
"warm_ischemia_time_unit": "minutes",
"cold_ischemia_time_value": "237",
"cold_ischemia_time_unit": "minutes",
"specimen_preservation_temperature": "Freezer (-80 Celsius)",
"specimen_quality_criteria": "H&E",
"specimen_tumor_distance_value": "",
"specimen_tumor_distance_unit": ""
}
OR
{
"organ_donor_data": [
{
"start_datetime": "0",
"end_datetime": "",
"graph_version": "UMLS2019AA",
"concept_id": "C0086287",
"code": "1086007",
"sab": "SNOMEDCT_US",
"data_type": "Nominal",
"data_value": "",
"numeric_operator": "",
"units": "",
"preferred_term": "Female",
"grouping_concept": "C1522384",
"grouping_concept_preferred_term": "Sex",
"grouping_code": "57312000",
"grouping_sab": "SNOMEDCT_US"
},
{
"start_datetime": "0",
"end_datetime": "",
"graph_version": "UMLS2019AA",
"concept_id": "C0001779",
"code": "424144002",
"sab": "SNOMEDCT_US",
"data_type": "Numeric",
"data_value": "66",
"numeric_operator": "EQ",
"units": "years",
"preferred_term": "Age",
"grouping_concept": "C0001779",
"grouping_concept_preferred_term": "Age",
"grouping_code": "424144002",
"grouping_sab": "SNOMEDCT_US"
},
...
]
}
Turned out this was caused during testing the Sample Metadata Upload feature on DEV and TEST. Bill fixed the data in Neo4j and things are back to normal after reindex all. Still bumped the total number of mapped fields limit to 6000 to accommodate the future increase.
But once we finalize the mapping updates described in #761 (comment) we should have less fields mapped.