OHDSI/ETL-Synthea

Performance issue running ETLSyntheaBuilder::LoadEventTables: insert_condition_occurrence.sql

NACHC-CAD opened this issue · 10 comments

I’m having a performance issue running ETLSyntheaBuilder::LoadEventTables: insert_condition_occurrence.sql is taking over an hour to run for the patients in the 1k covid sample file available at https://synthea.mitre.org/downloads.

This is running on PostgreSQL.

Can someone advise what indexes should or should not exist when this script is used for this bulk upload and/or if there are other PostgreSql settings that should be modified to get this to run better?

The indexes that I have in place when ETLSyntheaBuilder::LoadEventTables and more details are in the OHDSI forum posting here:
https://forums.ohdsi.org/t/etl-synthea-performance-question-etlsyntheabuilder-loadeventtables-and-insert-condition-occurrence-sql/18452

It looks like the query shown below is called for insert_condition_occurrence.sql and was causing the poor performance (2 hours to execute).


select
	row_number()over(order by p.person_id)     condition_occurrence_id,
	p.person_id                                person_id,
	srctostdvm.target_concept_id               condition_concept_id,
	c.start                                    condition_start_date,
	c.start                                    condition_start_datetime,
	c.stop                                     condition_end_date,
	c.stop                                     condition_end_datetime,
	38000175                                   condition_type_concept_id,
	cast(null as varchar)                      stop_reason,
	pr.provider_id                             provider_id,
	fv.visit_occurrence_id_new                 visit_occurrence_id,
	fv.visit_occurrence_id_new + 1000000       visit_detail_id,
	c.code                                     condition_source_value,
	srctosrcvm.source_concept_id               condition_source_concept_id,
	null                                       condition_status_source_value,
	0                                          condition_status_concept_id
from 
	etl_synthea_1k_synthea_native.conditions c
join 
	etl_synthea_1k.source_to_standard_vocab_map srctostdvm
  		on srctostdvm.source_code              = c.code
 		and srctostdvm.target_domain_id        = 'Condition'
 		and srctostdvm.target_vocabulary_id    = 'SNOMED'
		and srctostdvm.source_vocabulary_id    = 'SNOMED'
		and srctostdvm.target_standard_concept = 'S'
		and srctostdvm.target_invalid_reason is null
join 
	etl_synthea_1k.source_to_source_vocab_map srctosrcvm
		on srctosrcvm.source_code = c.code
		and srctosrcvm.source_vocabulary_id = 'SNOMED'
		and srctosrcvm.source_domain_id = 'Condition'
left join 
	etl_synthea_1k.final_visit_ids fv
  		on fv.encounter_id = c.encounter
left join 
	etl_synthea_1k_synthea_native.encounters e
  		on c.encounter = e.id
 		and c.patient = e.patient
left join 
	etl_synthea_1k.provider pr 
  		on e.provider = pr.provider_source_value
join 
	etl_synthea_1k.person p
  		on c.patient = p.person_source_value

Adding the indexes shown below and including all of the indexes created from the CDM script OMOPCDM_postgresql_5.4_indices.sql resolved this issue for me.

I'm currently adding the indexes by hand. To do this I needed to break up LoadEventTables into two methods CreateVocabTables and LoadEventTables. The calls to CreateVocabMapTables and CreateVisitRollupTables that were in LoadEventTables were moved to the new method. This code has been checked into a NACHC-CAD fork at https://github.com/NACHC-CAD/ETL-Synthea.


create index source_to_standard_vocab_map_sctditvisvitsctir on source_to_standard_vocab_map(
	source_code,
	target_domain_id,
 	target_vocabulary_id,
	source_vocabulary_id,
	target_standard_concept,
	target_invalid_reason
);

create index source_to_standard_vocab_map_scsvsd on source_to_standard_vocab_map (
	source_code,
	source_vocabulary_id,
	source_domain_id
);

create index source_to_source_vocab_map_scsvisdi on source_to_source_vocab_map (
	source_code,
	source_vocabulary_id,
	source_domain_id
);

create index provider_psv on provider (
	provider_source_value
);

create index person_psv on person (
	person_source_value
);

This is excellent @NACHC-CAD -- I ran into this exact issue and am using your fork without any problems. Thanks!!!

Worked like a charm for me as well!

Is there a way that we can integrate this change as a PR? @katy-sadowski
But "we" I mean not me? 😄

We're currently working on adding support for additional versions of Synthea so this would be on the back burner until that is completed.

Yeah! I think most of the pieces are in place already to put something together:

  • @AnthonyMolinaro 's PR to add option to add the standard CDM indices - #153
  • Code in @NACHC-CAD 's fork which splits out creation of the intermediate vocab & visit tables into a separate function (may want to rename it from createVocabTables to createVocabAndVisitTables 😉)
  • @NACHC-CAD 's additional index queries - could add a param to createVocabTables that gives the option to add these indices, or make a separate function for them

@NACHC-CAD if you want to make a PR of your fork to split out the intermediate table creation, I could work on the 3rd item to incorporate the index queries; what do you think?

Also wanted to note that I had performance issues with the cost table 3.0.0 insert query. I'm using Postgres and a Synthea 3.0.0 dataset I generated with 1000 patients (default parameters). I canceled the query after it had been running for at least an hour. By adding this index on claims_transactions, runtime was decreased to 17s:

create index claims_transactions_cpap on source.claims_transactions (
	claimid,
	patientid,
	appointmentid,
	providerid
);

This made me wonder if the index issue will be a game of whack a mole with all the different DBMS, configurations, and dataset sizes & contents, even just within the Synthea universe. With a thorough review of all the ETL queries we could probably come up with a decently comprehensive list of potentially beneficial indexes. On the other hand, we could also just update the docs to advise users that they may need to inspect the SQL & query plans and add indexes in the case of long-running queries on their system. @fdefalco what are your thoughts on that?

Yes, it will always be whack a mole, but that's ok, we can improve from good to better without worrying about it not being perfect. I've merged Molinaro's PR and will take a look at incorporating the fork updates if @NACHC-CAD is unable to submit a PR.

Sounds good! And I can take a stab at a function for the extra indices 😃

@fdefalco I just put up PR for the index piece!

Hi, I am also apparently running into performance issues. I manually added the pull request provided by @NACHC-CAD and @katy-sadowski (thanks a lot!) - however, I am now stuck for a couple auf days (!) at the step of Running: insert_cost_v300.sql. I have created a population of 100.000 patients - I was hoping to actually go into 1M+ at some point. Are these figures completely off the limit? Is there something I might be doing wrong?

Re-opening issue to ensure work to split out the creation of the intermediate vocab & visit tables into a separate function is captured