mostlygeek/go-csv-gz-test

How can can we filter .csv.gz files?

Go

This was a fun little team project to see how we can filter S3 inventory .csv.gz files fastest!

Other implementations:

Usage

# get some working data, downloads 1GB from S3 into testdata/ subdirectory
> ./download.sh


# Processing using a one file at a time
> go run ./filter.go


# Processing in parallel (workers = num cpus)
> GOPAR=1 go run ./filter.go

My results (on my late 2017 13" MBP)

Strategy: One file at a time ...
Total: 31521045, Matched: 710093, Ratio: 2.25%
Time: 52.740166887s

Strategy: Parallel, 4 Workers ...
Total: 31521045, Matched: 710093, Ratio: 2.25%
Time: 27.207802611s