This is a script testbed for importing GTFS CSV files from Transport for Ireland into a SQLite database.
Performance was compared for push based stream event parsing with popular CSV parsing libraries PapaParse and csv-parse for Node.js as well as pull based processing with pipeline
, and for await (...)
using PapaParse.
Parsing with csv-parse link-to-file followed excellent stream documentation example method here. Lines are formatted during each event and pushed to an array for batch database insertion. Passing the parser cast
option a separate castValues function proved more than twice as slow as performing type converion as part of the formatLine call.
PapaParse also supports file streams and documents setup options but does not provide complete code examples for their use. Current verion 5.4.0
of PapaParse also proved broken for stream inputs when using the header
option needed to return parsed lines as an object. Rolling back to version 5.3.0
proved successful.
PapaParse link-to-file performed around 50% faster compared to csv-parse when processing 13M records. (65 seconds vs. 117 seconds when skipping database insertion).
Pipeline with a for await loop greatly simplifies code complexity when working with streams by not requiring direct management of event listeners. I had high hopes, but pipeline
for await (const record of stream) {...}
proved not up to the task. PapaParse w/ pipeline link-to-file took 467 seconds compared to just 65 when compared to the event based code above. And this still doesn't include database inserts! This article by Dan Vanderkam here goes into the problem in more detail. Discussion is still active, such as this issue Performance of for await of (async iteration).
Visible feedback is important for long running tasks such as this project. Don't, however, take the naive approach of updating progress on every event record. Logging itself has a significant impact on performance.
PapaParse takes 54 seconds to parse 13M records when only logging the completion of each csv file. Adding a real time progress count when a SQL batch size limit is reached increases parse time to 65 seconds. Logging every event took 2446 seconds...40.76 minutes of wasted time.
Inserted in batches of 9_000 records
Parser | Notes | Average runtime |
---|---|---|
PapaParse | events | 4.70 |
PapaParse | pipeline & for await(...) | 7.24 |
csv-parse | events | 5.56 |
csv-parse | w/cast values function | 13.74 |
csv-parse | node-gtfs | 5.10 |
Parser | Notes | Average Runtime | in Minutes |
---|---|---|---|
PapaParse | w/ SQL PRAGMA | 300 | 5.05 |
PapaParse | w/ SQL PRAGMA & transactions | 287 | 4.78 |
node-gtfs | Batch limit: 800 | 1087 | 18.11 |
node-gtfs | Batch limit: 8_000 | 510 | 8.65 |
node-gtfs | Batch limit: 20_000 | 397 | 6.61 |
node-gtfs | Batch limit: 30_000 | 406.5 | 6.78 |
Parser | Notes | Runtime seconds |
---|---|---|
PapaParse | 65 | |
csv-parse | 118 | |
PapaParse | pipeline & for await(...) | 467 😢 |
PapaParse | log all 13M events | 2446 😵 40.76 minutes |
node-gtfs | Batch limit: 800 | 213 |
node-gtfs | Batch limit: 8_000 | 171 |
node-gtfs | Batch limit: 20_000 | 162 |
Parser | Notes | Runtime seconds |
---|---|---|
PapaParse | w/ PRAGMA & global transaction | 4.64 |
PapaParse | w/ PRAGMA transaction BATCH | 4.79 |
PapaParse | w/ global transaction | 4.718 |
PapaParse | no pragma or transactions | 4.72 |
npm install
npm run build
npm run import
- node-gtfs - Import and Export GTFS transit data into SQLite. Query or change routes, stops, times, fares and more.