/plasticc

PLAsTiCC Astronomical Classification 3rd-place solution

Primary LanguagePython

PLAsTiCC Astronomical Classification 3rd-place solution

Overview of solution

https://www.kaggle.com/c/PLAsTiCC-2018/discussion/75131

environment

I used pyenv virtualenv to set up the environment. I think catboost==0.10.4.1 is important, but the versions of the other libraries won't affect the score so much.

$ pyenv install 3.5.1
$ pyenv virtualenv 3.5.1 plasticc
$ pyenv activate plasticc
$ pip install --upgrade pip
$ pip install cython==0.27.3
$ pip install numpy==1.13.0
$ pip install PyYAML==3.12
$ pip install -r requirements.txt

I used n1-standard-64 in Google Cloud Engine, which has 240GB RAM and 64 CPUs.
OS/Platform : Ubuntu 16.04

datasets & result files

I will upload prepare.zip for the host, which contains these directories.

  • buckets:
    It contains nyanp's train & test features.

  • data:
    It contains kaggle datasets. you can also download them via

kaggle competitions download -c PLAsTiCC-2018
  • features:
    It contains all of my train & test features.

  • fi:
    It contains feature names and the number of rounds used for training.

    • exp_*.npy
      numpy array that contains feature names.
    • exp_*rounds.pkl
      pickle object that contains the number of rounds.
    • whole_fn_s.npy
      numpy array that contains all feature names.
    • mamas_feature_names_*.npy
      the names of features that yuval used.
  • models:
    It contains trained models.

    • exp*.cbm
      trained catboost model.
  • others:
    It contains class weights.

    • W.npy
      numpy array that contains class weights.
  • sub:
    It contains submission files

    • experiment57_59(th985)_61_62.csv
      nyanp's averaged submission file.
    • pred*.csv
      yuval's submission file.

scripts

  • utils.py :
    It contains utility functions.
  • preprocess_*.py :
    I did easy preprocessing here, like converting .csv files into .feather files.
  • save_features_train_*.py :
    I saved test features here.
  • save_features_test_*.py :
    I saved train features here.
  • save_features_nyanp.py :
    I saved nyanp's train & test features here.
  • train.py :
    I trained models here.
  • predict.py :
    I made predictions here.
  • postprocess.py :
    I did postprocessing like ensembling and class99 handling here.

usage

  • full version :
    It will take a few months to run with a single machine (64 core, 240GB RAM).
    I never recommend you to run it.
cd mamas/
unzip prepare.zip
cp -r prepare/* .
rm features/*
rm models/*
cd ../scripts
python preprocess_01.py
python preprocess_02.py
python save_features_train_01.py
python save_features_train_02.py
python save_features_train_03.py
python save_features_train_04.py
python save_features_train_05.py
python save_features_train_06.py
python save_features_test_01.py
python save_features_test_02.py
python save_features_test_03.py
python save_features_test_04.py
python save_features_test_05.py
python save_features_test_06.py
python save_features_nyanp.py
python save_features_for_yuval.py
python train.py
python predict.py
python postprocess.py

Then, mamas/sub/host_sub.csv.gz will be generated.

  • short version :
    It's a short version, which will take about 4 hours. I use extracted features and trained model here.
cd mamas/
unzip prepare.zip
cp -r prepare/* .
cd scripts
python preprocess_01.py
python preprocess_02.py
python predict.py
python postprocess.py

Then, mamas/sub/host_sub.csv.gz will be generated.
It should score 0.680 on public LB, 0.700 on private LB.

directory

  • preds/ :
    prediction files.
  • curve/ :
    linear interpolated curve files with yuval's method.
  • fe_extract/ :
    feature extraction library.
  • notebook/:
    It contains .ipynb files.
  • scripts/:
    It contains scripts.