Data mining project

This project is for a data mining class and analyzes the NBA dataset.

The order in which to run the scripts

In order to generate figures / data:

  • notebooks/Preprocessing.py
  • notebooks/Datacubing.py
  • notebooks/AnalysisBasicPlots.py
  • notebooks/AnalysisTDWeights.py

Then you can start building the reports. Some of the reports depend on figures and tables that are generated by the python scripts. Please do not submit generated figures or tables to Github. The scripts are build in a way, that if we discover a mistake in the data and figures, simply rerunning everything will automatically update the report.

Setup

Go to https://www.kaggle.com/wyattowalsh/basketball and download the basketball dataset. It should be a file named archive.zip. If you extract it, there should be a folder called archive. Copy this folder into the root of this project to set it up.

archive/basketball.sqlite
archive/daily_execution_pipeline.yml
...

Having the archive in the right place is essential so that everybody uses the same file path. Do not commit the database to GitHub, as the file is too large (>50MB) to be uploaded.

Required software

Make sure you have python >3.5 and numpy, matplotlib, ipykernel, and pandas installed.

pip install -r requirements.txt

Preprocessing

The preprocessing pipeline is in notebooks/Preprocessing.py. Running it with the VS Code python extension or just as a normal python script will create notebooks/reducedDataset.sqlite.

cd notebooks
python3 Preprocessing.py

Data cubing and figure generation

The project report depends on figures, that are generated through the Datacubing notebook. Do not submit the figures to Github, as they can automatically be updated in the project report if we decide to change anything in the datacubing script. Run

cd notebooks
python3 Datacubing.py

to generate the figures. It will also generate 4 files called cubeGames.csv, cubePlayers.csv, biometricCube.csv and biometricCubeRaw.csv that are used in subsequent analysis notebooks.

Building Project reports manually

For Python code highlighting a latex package called minted is needed. This depends on Pygments. If it is not already installed, it can be installed using

pip install Pygments

The pdf file can then be compiled using

cd reports/1
pdflatex -shell-escape Group3_project1.tex

Worst case: Overleaf has Pygments pre-installed.

Building Project reports with Docker

If you have docker installed, run

./dockerBuild.sh
./dockerTex.sh

to compile the report. Building the Dockerfile easily takes an hour because of the texlive-full installation.