This is my take at Carrefour's Phenix Challenge, just for fun !
Here is an overview of the algorithm implemented.
And here is a description of the challenge itself
sbt test
sbt "run data results 2017-05-14"
sbt dist
The zipped package is located under target/universal/phenix-challenge-<VERSION>.zip
Once unzipped, you can run the program with a limited heap size like this:
phenix-challenge data results 2017-05-14 -J-Xmx512m
You can generate a big amount of data with the script under generator
. Change the hardcoded values in generator/generator.scala
if needed then run:
sbt run
And voila: gigabytes of data are generated under generator/data/
.
On my computer, it take 4685 seconds (or 1h18m) to process 129GB of data :
- CPU : Intel(R) Core(TM) i7-6560U CPU @ 2.20GHz
- RAM : capped at 512MB
- stores = 3000
- transactions per day = 1 million
- references = 500 000
- days = 7
- There are probably a lot of untested corner cases to iron out.
- Paralellize computations (if disk access is not the bottleneck)
- The current implementation uses more temporary files than necessary when merging each day's aggregate.
- Some intermediary results could be memoized (see
combineByProduct
).