A small demo app that performs literal string searches against large compressed files using elixir and the Flow library to facilitate parallel computation.
- Elixir's dynamic functional nature makes it a joy compose data processing pipelines with it.
- Elixir's out-of-box support for multicore machines along with Flow's additional parallel processing features (e.g. backpressure support via GenStage) make it well suited to safely process large volumes of data.
- Elixir is built on top of the Erlang BEAM VM, a highly fault tolerant and distributed computation environment that has been powering critical infrastructure such as telephony systems for over 30 years.
git clone https://github.com/chadfennell/lfs.git
cd lfs
mix deps.get
mix compile
Locate a large compressed file and place it in the data
directory. Download a portion of the Digital Public Library of America corpus, for example:
aws s3 cp s3://dpla-provider-export/2020/04/all.jsonl/part-00001.gz data/part-00001.gz
Run the search:
mix lfs.search part-00001.gz "Mexican American|Mexico|El Salvador|Hispanic|Chicano|All Souls"
- The first argument is the data directory file name to be searched.
- The second argument is a pipe
|
delimited set of terms to search (searches are literal string matches and case insensitive) - Matches are added to the
data/matches.json
file.
Get a running count of records added to the matches.json
file:
watch "cat data/matches.json | wc -l"
Open a new terminal and run htop
to see Elixir taking advantage of multiple cores.