/hack-evening-2024-4

the first rust munich meetup where we hack one evening with a friendly competition

Primary LanguageRustGNU General Public License v3.0GPL-3.0

Billion Row Challenge

This challenge is inspired by the original Java version, including the rules, dataset and task. Thank you Gunnar Morling for coming up with the original idea!

This repository contains a template and different solutions for the Billion Row Challenge we ran for the Rust Munich meetup in December 2024. We tried to keep deviations from the original challenge to a minimum for the sake of comparability, but since the event takes place in a single evening there are slight adaptations.

Table of Contents

Challenge

Task Description

original source

Your mission, should you choose to accept it, is to write a program that retrieves temperature measurement values from a text file and calculates the min, mean, and max temperature per weather station. There's just one caveat: the file has 1,000,000,000 rows! That's more than 10 GB of data! 😱

The text file has a simple structure with one measurement value per row:

Hamburg;12.0
Bulawayo;8.9
Palembang;38.8
Hamburg;34.2
St. John's;15.2
Cracow;12.6
... etc. ...

The program should print out the min, mean, and max values per station, alphabetically ordered. The format that is expected varies slightly from language to language, but the following example shows the expected output for the first three stations:

{
  Abha=5.0/18.0/27.4,
  Abidjan=15.7/26.0/34.1,
  AbΓ©chΓ©=12.1/29.4/35.6,
  Accra=14.7/26.4/33.1,
  Addis Ababa=2.1/16.0/24.3,
  Adelaide=4.1/17.3/29.7,
  <...>
}

Oh, and this input.txt is different for each submission since it's generated on-demand. So no hard-coding the results! πŸ˜‰

Rules and Limitations

original source

  1. You may use external libraries (like optimized data structures, parsers, etc.) with the exception of libraries that can solve the full challenge by themselves (e.g. polars, embedded DuckDB, etc.) This rule deviates from the original challenge due to the short duration of the event.
  2. Implementations must implemented in main.rs/utils.rs only. Try to keep it relatively short; don't copy-paste a forbidden library into your solution as a cheat.
  3. The computation must happen at application runtime; you cannot process the measurements file at build time.
  4. Input value ranges are as follows:
    • Station name: non null UTF-8 string of min length 1 character and max length 100 bytes (i.e. this could be 100 one-byte characters, or 50 two-byte characters, etc.)
    • Temperature value: non-null double between -99.9 (inclusive) and 99.9 (inclusive), always with exactly one fractional digit
  5. There is a maximum of 10,000 unique station names.
  6. Implementations must not rely on specifics of a given data set. Any valid station name as per the constraints above and any data distribution (number of measurements per station) must be supported.

Project Particularities

Getting Started

Note: running export GITHUB_HANDLE=<your-github-handle> should make these instructions easier.

Please run the following commands:

git clone git@github.com:rust-meetup-munich/hack-evening-2024-4.git
cd hack-evening-2024-4
git checkout -b solutions/${GITHUB_HANDLE}
cargo new solutions/${GITHUB_HANDLE}

The commands above will create a new git branch and initialize a new binary Rust project for your solution. It will have the following simple structure; we expect your solution to be implemented in the main.rs file.

solutions/your-handle/
β”œβ”€β”€ Cargo.lock
β”œβ”€β”€ Cargo.toml
β”œβ”€β”€ src
β”‚   └── main.rs

How To Submit

-> Check out the submission guide for more details.

Helpful Commands

Task Folder Command
generate 1B rows of data data-generator cargo run --release -- --rows 1000000000
run a debug build solutions/${GITHUB_HANDLE} cargo run -- weather_1M.csv
run a release build solutions/${GITHUB_HANDLE} cargo run --release -- weather_1B.csv
run tests solutions/${GITHUB_HANDLE} cargo test
run flamegraph solutions/${GITHUB_HANDLE} sudo cargo flamegraph --root -- weather_1B.csv

Tips & Tricks

  • This event is rather short, so we recommend going for a functionally correct, simple solution first and then iteratively improve parts.
  • When developing, use debug builds and smaller datasets (e.g. 10M rows) to avoid long computation times.
  • Keep your intermediate solutions by committing them, both for discussions and comparisons.
  • Use a profiler like flamegraph to find out the actual bottlenecks in your program.

Advanced Techniques

This section lists a few advanced techniques you may want to consider:

  • use preallocations and avoid additional ones
  • implement specialized float parsing
  • custom hash functions
  • concurrency
    • reading the file with multiple threads in parallel
    • concurrent hashmaps like dash
    • SIMD instructions for parsing
  • I/O techniques
    • buffered I/O
    • memory mapped files
    • asynchronous I/O with io_uring

Setup Issues?

In case you have setup issues and cannot program locally, we prepared a Rust playground template to give you a chance to still develop an algorithmic idea and dip your toe into Rust πŸ¦€. Proper I/O, large input data and extensive benchmarking are not possible, so a proper submission will be tricky.

Rust playground

Please check out our Discord channel #hack-evening-2024-4 for further help and assistance.

Additional Resources