This repository contains the code for a stacking framework designed for the Linking Writing Processes to Writing Quality Kaggle competition.
The folder structure of the framework is as follows:
bin
: This folder includes the binary file for DuckDB.data
: This folder includes the competition data.src
: This folder includes the framework's source code.
The framework consists of five sequential layers, each needing the previous layer's output as input.
The first layer includes a few Python scripts that primarily implement logging routines and define global settings for
the framework. The names of the scripts in this layer start with l0_
, such as l0_settings.py
.
The second layer consists of Python scripts that mainly implement data preprocessing, feature engineering, and simple
feature selection routines. The names of the scripts in this layer start with l1_
, such as l1_filter_features.py
.
The third layer includes the code for the base models used for stacking. Each Python file or script in this layer
implements one base model. The names of the scripts in this layer start with l2_
, such as l2_base_model_xgboost.py
.
The fourth layer includes the code for the meta-models or the stacking models. Each Python file or script in this layer
implements one meta-model. The names of the scripts in this layer start with l3_
, such as l3_meta_model_xgboost.py
.
The fifth layer includes the code for an ensemble model using simple blending. The Python script in this layer
implements blending by computing a weighted average for the predictions of the meta-models from the previous layer. The
name of the script(s) in this layer starts with l4_
, such as l4_blending_meta_models.py
.
To install the dependencies, you need to have Poetry installed. You can install Poetry via Pip using the following command:
pip install poetry
To initiate the Poetry environment and install the dependency packages, run the following commands in the shell in the root folder of this repository after downloading it.
poetry update && poetry init
After that, enter the Poetry environment by invoking the poetry's shell using the following command:
poetry shell
To run the entire framework as an end-to-end pipeline, execute the driver_script.py
.
The framework generates the following two main output files after running successfully as a pipeline:
src/submission.csv
: This is the final submission file for the competition.src/experiment_records.csv
: This file contains information about the performance of the models in the framework, including the performance of the base and meta/stacking models.
You can use DuckDB to work with the CSV files and see the performance of the models, which is recorded in
the experiment_records.csv
file. See the picture below for an example.
The data stored here (like the processed CSV files) are licensed under Creative Commons license. Please visit competition's webpage for licenses that apply to the original competition data.
The code in this repository is available under Apache License.