Transaction Processor

Code challenge to implement a simple transaction processor that takes a csv with transactions as input and outputs another csv with account states through stdout.

Running

Processing

To process inputs you can run in 2 modes (remember to run gen.sh first):

Single threaded: cargo run --release inputs/big/random.csv > result.csv
Multi threaded: cargo run --features multithreaded --release inputs/big/random.csv > result.csv

You can also enable verbose output of the invalid transactions with --features stderr but it slows down performance considerably so it should only be used in smaller inputs like cargo run --features stderr --release inputs/complicated.csv > result.csv.

Generating

You can generate input csv data with 2 commands, both of them generate 10,000,000 transactions.

Random cargo run --release inputs/big/random.csv genrandom
"Smart" Random cargo run --release inputs/big/random.csv gen or gen.sh

The weird ordering of the arguments is because of challenge constraints.

Docs

You can generate documentation by running doc.sh, it will automatically open in a browser tab.

Benchmarking

You can time the execution of the multi threaded and single threaded processing with time_multi.sh and time.sh respectively.

Profiling

You can generate a flamegraph profile of the single threaded implementation by running profile.sh, it requires that you have previously ran cargo install flamegraph. Remember adding

[profile.release]
debug = true

to Cargo.toml or the flamegraph will not have the debug symbols.

Testing

Run cargo test to execute unit tests.

Development decisions

From the start I chose the async approach because from experience it often ends up utilizing system resources better.
In the beginning I opted for async-std as the runtime because tokio did not support csv-async.
Also in the beginning I used csv-async + serde for both serialization and deserialization because of simplicity.
After I had verified that the processing worked correctly, I started searching for bottlenecks and found the first one in the runtime, I switched to smol which doubled performance.
After switching runtimes, it became apparent in the flamegraph that there were 2 distinct operations that could happen in parallel, parsing and processing (mostly hashmap lookups), so I implemented the multi threaded version and switched to a faster hashmap.
While testing the multi threaded implementation, initially it was slower, but I later found out it was because with enough (10 mil) transactions, if there is an even distribution, most accounts end up locked and therefore the processing step ends up being a single hashmap lookup. I then reduced the probability of the chargeback transactions in the generator which made the multi threaded approach faster.
The next bottleneck seemed to be parsing so I switched to a custom parser, this also allowed me to switch the transaction representation to a better one, from

struct Transaction{
    ty: TransactionType,
    client: ClientId,
    tx: TransactionId,
    amount: Option<f64>
}

pub enum Transaction {
    Deposit {
        client: ClientId,
        tx: TransactionId,
        amount: f64,
    },
    Dispute {
        client: ClientId,
        tx: TransactionId,
    },
    etc...
}

Parsing is one of the most dangerous steps in computing, so if this was a production system, a lot more testing and scrutiny would have to go into the parser as well as its lower level dependencies atoi, fast-float parse-display and of course the smol runtime itself.

If I had more time I would try to parallelize the parsing step by partitioning the file by lines, and feeding a chunk of lines to each task.

raycar5/transaction-processor