/MHG

Source code for the paper ``Maximum Homogeneity Grouping for High-Cardinality Categorical Variables in Binary Classification''

Primary LanguagePythonMIT LicenseMIT

Project Description

This is python source code/data files for our paper ''Maximum Homogeneity Grouping for High-Cardinality Categorical Variables in Binary Classification''

Files

dataset/            Datasets downloaded from various repositories
src/
    data/           Data set preprocessing scripts
	    util.py		Common functions
	    *.py		One file per data set. Load data file, convert into pandas.DataFrame, set column types, delete useless columns, take a sample (for large dataset)
    datasets.py     Load datasets for experiments
    encoder.py      Maximum Homogeneity Encoder
    preprocessor.py	Encode categorical column by MHE, TargetEncoding or One-hot encoding
    timer.py		Utility function that run another function up to specified time limit in new thread and kill the thread if time limit exceeds.
    exp2_mhe_vs_pca.py	Main programm running all computational experiments to compare MHE vs PCA+One-hot
    result_analysis.py  Compile experiment results, generate tables and figures
results/
    datasets        Results generated by datasets.py
    exp22_04_19     Experiment results generated by exp2_mhe_vs_pca.py
    exp2_mhe_vs_pca Various tables and figures generated by result_analysis.py

Required python packages

  1. category_encoders

    conda install -c conda-forge category_encoders

  2. colorama

    conda install -c anaconda colorama

Run experiments

  1. Analyze datasets, generate variable importances. In src/ dir run (about 1 hour),

    python datasets.py

    The generated results are stored in results/datasets/**

  2. Run experiments in src/ dir (about 1 month),

    python exp2_mhe_vs_pca.py

    Since there are eight datasets and six classifiers, the entire experiment will require huge amount of time. So we divided the experiments by datasets and classifiers and distribute the expirements over multiple computers. Try to edit the script to comments out unwanted datasets and algorithms before executing. The experiment results are stored in results/exp22_04_19/**

  3. Run result_analysis in src/ dir to generate tables and figures in the paper (about 10 minutes)

    python result_analysis.py

    Generated files are stored in results/exp2_mhe_vs_pca/**