go-sql-import

Import 1 million records into SQL from CSV Fixed-Length Format Files with good performance

In this demo, we import 1 Million Records into SQL from CSV Fixed-Length Format Files with high performance in GOLANG.

This article outlines strategies for high-performance batch processing in GOLANG, specifically for importing 1 million records into PostgreSQL from a CSV or fixed-length format file.

Import into SQL from CSV or fixed-length format files

Test Info

  • RAM: 12 GB
  • Disk: SSD KINGSTON SA400S37240G ATA Device
  • Exec File Size: 12M
  • Database: PosgreSQL 16
  • Total of rows: 1,018,584 rows
  • Total of columns: 76 columns.
    • 20% of columns are datetime, it means to take time to parse string to datetime.
  • Power Usage: Moderate

Results to import 1,018,584 rows into PostgreSQL:

Type File Size CPU RAM Disk Without data validation With data validation
Fix Length 1.15 GB 6.6% 33 M 3.1 M/s 5 min 16 sec 6 min 10 sec
Delimiter 0.975 GB 8% 34 M 2.8 M/s 5 min 9 sec 5 min 57 sec
CSV 0.975 GB 5.2% 26 M 3 M/s 5 min 12 sec 5 min 58 sec

Test data

  • For fix length format file, please unzip 'data/fixedlength_big.7z' file (1.15 GB)
  • For CSV file, please unzip 'data/delimiter_big.7z' file (0.975 GB)

Batch jobs

Differ from online processing:

  • Long time running, often at night, after working hours.
  • Non-interactive, often include logic for handling errors
  • Large volumes of data

Challenges in Batch Processing

  • Data Size: Large volumes can strain memory and I/O performance.
  • Database Constraints: SQL (Postgres, My SQL, MS SQL...) may struggle with handling high transaction loads.

High-Performance Strategy

Streaming and Buffered I/O:

  • Instead of loading the entire file into memory, stream the file line by line using Go's buffered I/O. This reduces memory consumption.
  • Example: Use bufio.Scanner to read records line by line.

Bulk Inserts:

Bulk inserts in SQL are typically faster because they reduce the overhead associated with individual inserts. Here’s why:

  • Transaction Handling: In a bulk insert, multiple rows are inserted in a single transaction, reducing the need for multiple commits, which can be expensive in terms of I/O operations.
  • Logging: Bulk inserts often minimize logging overhead, especially if the database is configured to use minimal logging for bulk operations (like in SQL Server with the "BULK_LOGGED" recovery model).
  • Constraints: When inserting data in bulk, constraints such as foreign keys and unique indexes may be deferred or optimized by the database engine.
  • Index Updates: Instead of updating indexes after each row insert, bulk inserts allow the database to update indexes in batches, improving performance.

However, it’s important to note that bulk inserts still need to ensure data integrity. Some databases provide options to temporarily disable constraints or logging to optimize performance further, but this can lead to data consistency issues if not handled properly.

Batch Inserts:

  • Insert records into PostgreSQL in batches (e.g., 1000 records per transaction). This reduces transaction overhead and improves performance.
  • In this sample, we use Batch Inserts. We still have a very good performance.

Error Handling and Logging:

  • Implement robust error handling and logging. Track failed records to reprocess or fix them later.

Conclusion

By carefully handling file I/O, database interactions, and error management, you can ensure high performance when importing large datasets into SQL.

Import flow

Import flow with data validation

Common Architectures

Layer Architecture

  • Popular for web development

Layer Architecture

Hexagonal Architecture

  • Suitable for Import Flow

Hexagonal Architecture

Based on the flow, there are 4 main components (4 main ports):

  • Reader, Validator, Transformer, Writer
Reader

Reader Adapter Sample: File Reader. We provide 2 file reader adapters:

  • Delimiter (CSV format) File Reader
  • Fix Length File Reader
Validator
  • Validator Adapter Sample: Schema Validator
  • We provide the Schema validator based on GOLANG Tags
Transformer

We provide 2 transformer adapters

  • Delimiter Transformer (CSV)
  • Fix Length Transformer
Writer

We provide many writer adapters: