/MH6151-project

MH6151 Data Mining project

Primary LanguageJupyter Notebook

MH6151-project

NTU-SPMS-MH6151 Data Mining project

1. Set-up

  • Install requirements with the following command:
pip install -r requirements.txt

2. Run the modelling files

  • Run the python files with the format modelling.py --model_name <model_name> --output_file <path> and save output to the folder ./outputs. For example, to run and save the output for random forest classifier, execute the following command:
python modelling.py --model_name random_forest --output_file outputs/random_forest.txt
  • To add oversampling step to the training data, simply add the --oversampling option in the command.
python modelling.py --model_name random_forest --output_file outputs/random_forest.txt --oversampling

3. Mass-run the modelling

3.1. For linux

scripts/modelling.sh && scripts/modelling_oversampling.sh

3.2. For windows

.\scripts\modelling.bat
.\scripts\modelling_oversampling.bat

4. Get the final performance metrics

python modelling_insights.py > outputs/performance.txt

References

  • Random oversampling and undersampling for imbalanced classification : Link.
  • AdaBoost Algorithm: Understand, Implement and Master AdaBoost : Link.
  • AdaBoost clearly explained (Josh Starmer) : Link (Youtube).