You will need docker
at a minimum.
This project was tested with:
Docker version 20.10.12, build 20.10.12-0ubuntu2~18.04.1
To run the app natively on your machine, install:
node v19
yarn
First copy the example env file:
cp example.env .env
Then change the postgres password if you wish.
docker compose up
Above command should automatically run the postgres
database, run the migrations, and then start the app.
By default the app will ingest stats from any games that already took place on the current day, and then monitor for live games and ingest stats in real time, for an indefinate time until it manually terminated.
To ingest games from past dates or seasons, use one of the supported environment variable combinations to configure the dates to ingest:
DATE=yyyy-mm-dd
- to ingest game stats from a particular day in the pastSTART_DATE=yyyy-mm-dd END_DATE=yyyy-mm-dd
- to ingest game stats from a range of days in the pastSEASON=yyyyyyyy
(ex:20222023
) - to ingest all game stats (for final games) for a given season
When running the app via docker
, these environment variables can be set in the .env
file.
After stats for all games from the specified time period are ingested, the app will exit.
If you wish to run the app natively, use yarn
to install the dependencies:
yarn
Use docker to run just postgres
:
docker compose up -d postgres
Then run the db migrations if you haven't already (see package.json
for command).
Change the POSTGRES_HOST
value in .env
to localhost
.
Then start the app with yarn
:
yarn start
Configure DATE environment configurations on the command line like:
DATE=2022-03-06 yarn start
I added a sampling of tests for demonstration purposes, but did not take the time to write full unit tests for all functionality.
In addition, I didn't take time to setup mocks for database calls.
So the postgres
container will need to be running before executing tests.
Also, POSTGRES_HOST
should be set to localhost
, when running tests natively.
Run the tests with:
yarn test
The data ingest/pipeline app consists of two primary components:
- The coordination logic in
coordinator.js
- The game specific ingest logic in
lib/ingest.js
.
I opted to run everything in a single OS processes and relied heavliy on async logic using promises -- for sudo parallelism. Since app is very IO, heavy I didn't see enough value in the extra complexity of running separate main threads or processes. The app should not be limited by raw parallel compute capacity (that would be enhanced with separate threads or processes). Instead it spends most of its time waiting on IO -- in the form of db queries and http requests -- which node's async capabilities are well suited for.
I did run into issues with high IO concurrency, when ingesting game stats for entire seasons. Too many separate game handlers would start trying to make http calls and db queries at the same time, leading to a lot of http and db connection timeouts. To mitigate this issue, I added a batch size to limit the number of games for which stats are ingested in parallel.
The postgres
database can be queried from any standard database client or from insde the psql
CLI tool instead the container which you can access (with the postgres
container running) with:
docker exec -it nhl-stats_postgres_1 psql -U postgres
Note: Database data is persisted across container restarts inside of the configured docker volume.
The database schema can be found in db/schemma.prisma
which should be straight forward to interpret.
Here's an example query for team stats for a particular game:
SELECT * FROM team_games
WHERE game_id=<id>
Or for player stats from a particular game:
SELECT * FROM player_games
WHERE game_id=<id>