Phenix Challenge

This is my take at Carrefour's Phenix Challenge, just for fun !

Here is an overview of the algorithm implemented.

And here is a description of the challenge itself

Test

sbt test

Run With SBT

sbt "run data results 2017-05-14"

Package

sbt dist

The zipped package is located under target/universal/phenix-challenge-<VERSION>.zip

Once unzipped, you can run the program with a limited heap size like this:

phenix-challenge data results 2017-05-14 -J-Xmx512m

Load Test

You can generate a big amount of data with the script under generator. Change the hardcoded values in generator/generator.scala if needed then run:

sbt run

And voila: gigabytes of data are generated under generator/data/.

Performance

On my computer, it take 4685 seconds (or 1h18m) to process 129GB of data :

CPU : Intel(R) Core(TM) i7-6560U CPU @ 2.20GHz
RAM : capped at 512MB
stores = 3000
transactions per day = 1 million
references = 500 000
days = 7

Possible Improvements

There are probably a lot of untested corner cases to iron out.
Paralellize computations (if disk access is not the bottleneck)
The current implementation uses more temporary files than necessary when merging each day's aggregate.
Some intermediary results could be memoized (see combineByProduct).