noonchen/STDF-Viewer

Scalability improvements

Opened this issue · 1 comments

Hi, thanks for writing a nice and useful program. We've had use of it and I would like to share two improvements that helped us. I'll write down a description since the number of lines of code is very short, it's just easier to share it like this.

Scalability: memory use on building a large database for a large input file. We've used some large input files (gigabytes) with many PTR records.

In deps/rust_stdf_helper/src/lib.rs a channel is used to communicate between STDF parser and database writer.

It's communicating over a channel with an unbounded buffer, so if database write is slower than the parsing, then the parser thread can fill the channel with the complete contents of the STDF file(!). For us, this exhausted all the memory of a laptop, and used unnecessary amounts of memory on bigger machines.

The fix inside generate_database

  1. Use a bounded channel - this keeps the memory use bounded
  2. As an additional extra, use crossbeam-channel for improved cpu utilization, we used crossbeam-channel = "0.5.10"
  3. The code looked like this for us finally:
    const CHANNEL_MAX_BUFFER: usize = 10_000;

    // prepare channel for multithreading communication
    // use bounded channel so that memory use stays bounded
    let (tx, rx) = crossbeam_channel::bounded(CHANNEL_MAX_BUFFER);

Before: memory usage is gigabytes for gigabytes input file. After: We can build the database in less than 100 MB RAM resident, as it should be. 🙂

Scalability: database table keys and indexes. We experimented with adding some indexes to speed up the program and speed up export to excel/csv. The PTR_Data table is the most impactful of course, since it has so many records. We ended up changing the key like this (MPR_Data would also be relevant to change, probably):

PRIMARY KEY (DUTIndex, TEST_ID)) WITHOUT ROWID;  # old 
PRIMARY KEY (TEST_ID, DUTIndex)) WITHOUT ROWID;  # new

The new version corresponds better to some database lookups on TEST_ID - these are accelerated by the native order of the primary key if TEST_ID is the first component of the key. (An index would also work, but it requires more space.)

Hope this is helpful.

Hi bluss,

Thank you so much for your sharing! My test files are limited and it is very hard to cover all the scenarios, your input is really appreciated.

Look like you have already implemented the changes and tested with your files, you’re most welcome to create a pull request and I would gladly review it、!

Thanks again.