OHDSI/ETL-Synthea

ETLSyntheaBuilder::LoadSyntheaTables Fails on large set of data

guybartal opened this issue · 11 comments

Tying to load 1M patient OMOP data which I've generated with synthea using RStudio and ETL-Synthea,
getting the following output when running ETLSyntheaBuilder::LoadSyntheaTables

Connecting using PostgreSQL driver
Loading: allergies.csv
  |==============================================================================================| 100%
Inserting data took 53.8 secs
|--------------------------------------------------|
|==================================================|
Loading: careplans.csv
  |==============================================================================================| 100%
Inserting data took 3.67 mins
|--------------------------------------------------|
|==================================================|
|--------------------------------------------------|
|==================================================|
Loading: claims.csv
  |==============================================================================================| 100%
Inserting data took 3.57 hours
Error in data.table::fread(file = paste0(syntheaFileLoc, "/", csv), stringsAsFactors = FALSE,  : 
  Opened 285.3GB (306376172821 bytes) file ok but could not memory map it. This is a 64bit process. There is probably not enough contiguous virtual memory available.

Machine memory became full:
Screenshot 2023-06-08 235620

Was this with the bulk load option enabled (bulkLoad = TRUE) or default (bulkLoad = FALSE). If false suggest trying it with bulkLoad = TRUE.

Thanks @fdefalco, tried again with this option on, but got exception on something else, is there a way to change default temp path so I could direct it to different disk? system disk is too small to handle such files:

ETLSyntheaBuilder::LoadSyntheaTables(connectionDetails,syntheaSchema,syntheaFileLoc, bulkLoad = TRUE)
Connecting using PostgreSQL driver
Loading: allergies.csv
Attempting to use bulk loading...
COPY 826480
Bulk load to PostgreSQL took 13.2 secs
|--------------------------------------------------|
|==================================================|
Loading: careplans.csv
Attempting to use bulk loading...
COPY 3835221
Bulk load to PostgreSQL took 1.39 mins
|--------------------------------------------------|
|==================================================|
|--------------------------------------------------|
|==================================================|
Loading: claims.csv
Attempting to use bulk loading...
could not stat file "C:/Users/OHDSI_~1/AppData/Local/Temp/2/RtmpsPrbDY/pdw_insert_2b4c48447c5f.csv": value too large
Error in `bulkLoadPostgres()`:
! Error while bulk uploading data, psql returned a non zero status. Status =  1
Run `rlang::last_trace()` to see where the error occurred.
> rlang::last_trace()
<error/rlang_error>
Error in `bulkLoadPostgres()`:
! Error while bulk uploading data, psql returned a non zero status. Status =  1
---
Backtrace:
    ▆
 1. └─ETLSyntheaBuilder::LoadSyntheaTables(...)
 2.   ├─base::suppressWarnings(...)
 3.   │ └─base::withCallingHandlers(...)
 4.   ├─DatabaseConnector::insertTable(...)
 5.   └─DatabaseConnector:::insertTable.default(...)
 6.     └─DatabaseConnector:::bulkLoadPostgres(connection, sqlTableName, sqlFieldNames, sqlDataTypes, data)
Run rlang::last_trace(drop = FALSE) to see 1 hidden frame.

Both R and RStudio installed on system drive, postgresql binaries are in the data disk.

Trying again with large disk failed on the same error, any Idea why? F drive has 31TB free space:

|==================================================|
Loading: careplans.csv
Attempting to use bulk loading...
COPY 3835221
Bulk load to PostgreSQL took 1.12 mins
|--------------------------------------------------|
|==================================================|
|--------------------------------------------------|
|==================================================|
Loading: claims.csv
Attempting to use bulk loading...
could not stat file "f:/temp/RtmpEdfh85/pdw_insert_7d448ce12d7.csv": value too large
Error in `bulkLoadPostgres()`:
! Error while bulk uploading data, psql returned a non zero status. Status =  1
Run `rlang::last_trace()` to see where the error occurred.
> rlang::last_trace()
<error/rlang_error>
Error in `bulkLoadPostgres()`:
! Error while bulk uploading data, psql returned a non zero status. Status =  1
---
Backtrace:
    ▆
 1. └─ETLSyntheaBuilder::LoadSyntheaTables(...)
 2.   ├─base::suppressWarnings(...)
 3.   │ └─base::withCallingHandlers(...)
 4.   ├─DatabaseConnector::insertTable(...)
 5.   └─DatabaseConnector:::insertTable.default(...)
 6.     └─DatabaseConnector:::bulkLoadPostgres(connection, sqlTableName, sqlFieldNames, sqlDataTypes, data)
Run rlang::last_trace(drop = FALSE) to see 1 hidden frame.

It appears to be related to a size limitation in certain versions of Postgres. This post claims the issue is fixed in v14 and above.

https://stackoverflow.com/questions/53523051/error-could-not-stat-file-xx-csv-unknown-error

Thanks @fdefalco, I'll check that

I've tried v15 on a smaller set of generated synthea data set and got:

> ETLSyntheaBuilder::LoadSyntheaTables(connectionDetails,syntheaSchema,syntheaFileLoc, bulkLoad = TRUE)
Connecting using PostgreSQL driver
Loading: allergies.csv
Attempting to use bulk loading...
COPY 82641
Bulk load to PostgreSQL took 4.57 secs
Loading: careplans.csv
Attempting to use bulk loading...
COPY 382391
Bulk load to PostgreSQL took 5.67 secs
|--------------------------------------------------|
|==================================================|
|--------------------------------------------------|
|==================================================|
Loading: claims.csv
Attempting to use bulk loading...
COPY 11028076
Bulk load to PostgreSQL took 4.95 mins
|--------------------------------------------------|
|==================================================|
Loading: claims_transactions.csv
Attempting to use bulk loading...
COPY 68749626
Bulk load to PostgreSQL took 24.8 mins
Loading: conditions.csv
Attempting to use bulk loading...
COPY 3734250
Bulk load to PostgreSQL took 48.8 secs
Loading: devices.csv
Attempting to use bulk loading...
COPY 8530
Bulk load to PostgreSQL took 0.388 secs
|--------------------------------------------------|
|==================================================|
|--------------------------------------------------|
|==================================================|
Loading: encounters.csv
Attempting to use bulk loading...
COPY 5887089
Bulk load to PostgreSQL took 1.67 mins
|--------------------------------------------------|
|==================================================|
Loading: imaging_studies.csv
Attempting to use bulk loading...
COPY 10557776
Bulk load to PostgreSQL took 2.29 mins
Loading: immunizations.csv
Attempting to use bulk loading...
COPY 1679662
Bulk load to PostgreSQL took 13.2 secs
Loading: medications.csv
Attempting to use bulk loading...
COPY 5140987
Bulk load to PostgreSQL took 1.3 mins
|--------------------------------------------------|
|==================================================|
Loading: observations.csv
Attempting to use bulk loading...
COPY 50819943
Bulk load to PostgreSQL took 8.42 mins
Loading: organizations.csv
Attempting to use bulk loading...
COPY 9113
Bulk load to PostgreSQL took 0.487 secs
Loading: patients.csv
Attempting to use bulk loading...
COPY 114424
Bulk load to PostgreSQL took 1.69 secs
|--------------------------------------------------|
|==================================================|
Loading: payer_transitions.csv
Attempting to use bulk loading...
ERROR:  invalid byte sequence for encoding "UTF8": 0xa9
CONTEXT:  COPY payer_transitions, line 2550396
Error in `bulkLoadPostgres()`:
! Error while bulk uploading data, psql returned a non zero status. Status =  1
Run `rlang::last_trace()` to see where the error occurred.
> rlang::last_trace()
<error/rlang_error>
Error in `bulkLoadPostgres()`:
! Error while bulk uploading data, psql returned a non zero status. Status =  1
---
Backtrace:
    ▆
 1. └─ETLSyntheaBuilder::LoadSyntheaTables(...)
 2.   ├─base::suppressWarnings(...)
 3.   │ └─base::withCallingHandlers(...)
 4.   ├─DatabaseConnector::insertTable(...)
 5.   └─DatabaseConnector:::insertTable.default(...)
 6.     └─DatabaseConnector:::bulkLoadPostgres(connection, sqlTableName, sqlFieldNames, sqlDataTypes, data)
Run rlang::last_trace(drop = FALSE) to see 1 hidden frame

This looks like the payer_transistions.csv file has a non-utf8 character. Which version of synthea are you using to generate your data? The data I see in my files resembles the following:

PATIENT,START_YEAR,END_YEAR,PAYER,OWNERSHIP
b13b0ba7-6c0b-a31f-d223-89711e261c13,1991,1991,7c4411ce-02f1-39b5-b9ec-dfbea9ad3c1a,Self
f5208a83-3665-db8f-3cc0-07fcdb1299d6,1977,1995,5059a55e-5d6e-34d1-b6cb-d83d16e57bcf,Guardian

Can you check what is occurring on line 2550396 and see what character might be causing an issue?

I'm using Synthea v3.0.0, on line 2550396 I see the following:

93c2080a-a682-0003-1c80-0cdf77574f76,1d2c536d-533a-78c1-f7f4-1c0f0f02b044,1994-12-20T07:25:13Z,1995-12-26T07:25:13Z,4d71f845-a6a9-3c39-b242-14d25ef86a8d,,Self,Ester635 Henríquez109

I guess this is the character which causes the problem: í

BTW, those characters appear much before this line (copied some examples from line 3103):

36af2cb3-2291-01b4-6120-cf8dfe10cf93,5372a4e2-7c9c-1080-27dd-aefda85f181f,1945-03-25T09:34:44Z,1946-03-31T09:34:44Z,047f6ec3-6215-35eb-9608-f9dda363a44c,,Guardian,Alfredo17 Suárez24
36af2cb3-2291-01b4-6120-cf8dfe10cf93,ad0418ff-2816-f4d5-7873-196151aa0c34,1946-03-31T09:34:44Z,1947-04-06T09:34:44Z,4d71f845-a6a9-3c39-b242-14d25ef86a8d,,Guardian,Alfredo17 Suárez24
36af2cb3-2291-01b4-6120-cf8dfe10cf93,ced3a66d-4762-777d-73a5-d761b4bf3e2a,1947-04-06T09:34:44Z,1948-04-11T09:34:44Z,5059a55e-5d6e-34d1-b6cb-d83d16e57bcf,,Guardian,Yolanda648 Griego535
36af2cb3-2291-01b4-6120-cf8dfe10cf93,ced3a66d-4762-777d-73a5-d761b4bf3e2a,1948-04-11T09:34:44Z,1949-04-17T09:34:44Z,5059a55e-5d6e-34d1-b6cb-d83d16e57bcf,,Guardian,Yolanda648 Griego535
36af2cb3-2291-01b4-6120-cf8dfe10cf93,5482ba34-683e-2cf6-b073-3178a48e72f4,1949-04-17T09:34:44Z,1950-04-23T09:34:44Z,6e2f1a2d-27bd-3701-8d08-dae202c58632,,Self,Rosalia943 Suárez24
36af2cb3-2291-01b4-6120-cf8dfe10cf93,905aa6c4-9236-d4df-7175-5ba7e44e69d3,1950-04-23T09:34:44Z,1951-03-04T09:34:44Z,5059a55e-5d6e-34d1-b6cb-d83d16e57bcf,,Self,Rosalia943 Suárez24
36af2cb3-2291-01b4-6120-cf8dfe10cf93,,1951-03-04T09:34:44Z,1951-09-02T09:34:44Z,b1c428d6-4f07-31e0-90f0-68ffa6ff8c76,,,
36af2cb3-2291-01b4-6120-cf8dfe10cf93,,1951-09-02T09:34:44Z,1952-09-07T09:34:44Z,b1c428d6-4f07-31e0-90f0-68ffa6ff8c76,,,
36af2cb3-2291-01b4-6120-cf8dfe10cf93,18cecf47-2282-bb4b-3c9f-589c89485419,1952-09-07T09:34:44Z,1953-06-07T09:34:44Z,42c4fca7-f8a9-3cd1-982a-dd9751bf3e2a,,Self,Rosalia943 Suárez24
36af2cb3-2291-01b4-6120-cf8dfe10cf93,,1953-06-07T09:34:44Z,1954-06-13T09:34:44Z,b1c428d6-4f07-31e0-90f0-68ffa6ff8c76,,,
36af2cb3-2291-01b4-6120-cf8dfe10cf93,42281269-8d3a-5170-efb3-34548d8a2134,1954-06-13T09:34:44Z,1955-06-19T09:34:44Z,6e2f1a2d-27bd-3701-8d08-dae202c58632,,Self,Rosalia943 Suárez24
36af2cb3-2291-01b4-6120-cf8dfe10cf93,1b9fe166-f6fd-842e-c05d-6154834179df,1955-06-19T09:34:44Z,1956-06-24T09:34:44Z,5059a55e-5d6e-34d1-b6cb-d83d16e57bcf,,Self,Rosalia943 Suárez24
36af2cb3-2291-01b4-6120-cf8dfe10cf93,ed0b7387-c60e-79ce-e244-45549ab9cecb,1956-06-24T09:34:44Z,1957-06-30T09:34:44Z,42c4fca7-f8a9-3cd1-982a-dd9751bf3e2a,,Self,Rosalia943 Suárez24
36af2cb3-2291-01b4-6120-cf8dfe10cf93,bc5d0f8c-b0e7-44c5-2b4c-20e964f52247,1957-06-30T09:34:44Z,1958-07-06T09:34:44Z,4d71f845-a6a9-3c39-b242-14d25ef86a8d,,Self,Rosalia943 Suárez24
36af2cb3-2291-01b4-6120-cf8dfe10cf93,bc5d0f8c-b0e7-44c5-2b4c-20e964f52247,1958-07-06T09:34:44Z,1959-03-01T09:34:44Z,4d71f845-a6a9-3c39-b242-14d25ef86a8d,,Self,Rosalia943 Suárez24
36af2cb3-2291-01b4-6120-cf8dfe10cf93,,1959-03-01T09:34:44Z,1959-09-06T09:34:44Z,b1c428d6-4f07-31e0-90f0-68ffa6ff8c76,,,
36af2cb3-2291-01b4-6120-cf8dfe10cf93,,1959-09-06T09:34:44Z,1960-09-11T09:34:44Z,b1c428d6-4f07-31e0-90f0-68ffa6ff8c76,,,
36af2cb3-2291-01b4-6120-cf8dfe10cf93,87e115e4-e640-68ac-19e4-6b9cb2b78e83,1960-09-11T09:34:44Z,1961-09-17T09:34:44Z,4d71f845-a6a9-3c39-b242-14d25ef86a8d,,Spouse,Mr. Mejía318
36af2cb3-2291-01b4-6120-cf8dfe10cf93,64fe08cb-1954-18d1-b2ef-4d490300c03f,1961-09-17T09:34:44Z,1962-09-23T09:34:44Z,42c4fca7-f8a9-3cd1-982a-dd9751bf3e2a,,Self,Rosalia943 Mejía318
36af2cb3-2291-01b4-6120-cf8dfe10cf93,3e168f68-73c0-3783-95f3-61148067de59,1962-09-23T09:34:44Z,1963-09-29T09:34:44Z,4d71f845-a6a9-3c39-b242-14d25ef86a8d,,Self,Rosalia943 Mejía318

I believe the files are suppose to be UTF-8. Are you running with the same property settings listed here?

https://github.com/synthetichealth/synthea/blob/f838c51c90fa9cd4d7494fe2d46b9b5ad724bc1e/src/main/resources/synthea.properties#L38

If Synthea is generating non UTF-8 characters, maybe a quick issue on their repo to see if this is intended. They are usually very responsive.

After configuring Synthea to use UTF-8, I've managed to run this.

exporter.encoding = UTF-8