Large File Searcher

A small demo app that performs literal string searches against large compressed files using elixir and the Flow library to facilitate parallel computation.

Why Elixir and Flow For This Sort of Task?

Elixir's dynamic functional nature makes it a joy compose data processing pipelines with it.
Elixir's out-of-box support for multicore machines along with Flow's additional parallel processing features (e.g. backpressure support via GenStage) make it well suited to safely process large volumes of data.
Elixir is built on top of the Erlang BEAM VM, a highly fault tolerant and distributed computation environment that has been powering critical infrastructure such as telephony systems for over 30 years.

Requirements

elixir

Installation

git clone https://github.com/chadfennell/lfs.git
cd lfs
mix deps.get
mix compile

Demo

Locate a large compressed file and place it in the data directory. Download a portion of the Digital Public Library of America corpus, for example:

aws s3 cp s3://dpla-provider-export/2020/04/all.jsonl/part-00001.gz data/part-00001.gz

Run the search:

mix lfs.search part-00001.gz "Mexican American|Mexico|El Salvador|Hispanic|Chicano|All Souls"

The first argument is the data directory file name to be searched.
The second argument is a pipe | delimited set of terms to search (searches are literal string matches and case insensitive)
Matches are added to the data/matches.json file.

Get a running count of records added to the matches.json file:

watch "cat data/matches.json | wc -l"

Open a new terminal and run htop to see Elixir taking advantage of multiple cores.