This is the supervisor repo of the "Data Science Laboratory Course" at KIT in 2023. Students worked on two subtasks:
- a practice competition from the platform drivendata.org
- one of two research tasks: a) active learning for SAT solving b) meta-learning for encoder selection
The repo provides files for preparing the datasets, some basic exploration, course-internal splitting, scoring, and demo submissions for that.
Additionally, Surveys/
contains exports of questionnaires (created with ILIAS v7
) to evaluate the students' satisfaction
(one survey at start of course, one after first task, one after second task).
We use Python with version 3.8
.
We recommend to set up a virtual environment to install the dependencies, e.g., with virtualenv
:
python -m virtualenv -p <path/to/right/python/executable> <path/to/env/destination>
or with conda
:
conda create --name ds-lab-2023 python=3.8
Next, activate the environment with either
conda activate ds-lab-2023
(conda
)source <path/to/env/destination>/bin/activate
(virtualenv
, Linux)<path\to\env\destination>\Scripts\activate
(virtualenv
, Windows)
Install the dependencies with
python -m pip install -r requirements.txt
If you make changes to the environment and you want to persist them, run
python -m pip freeze > requirements.txt
To make this environment available for notebooks, run
ipython kernel install --user --name=ds-lab-2023-kernel
To actually launch Jupyter Notebook
, run
jupyter notebook
We solve the challenge "Richter's Predictor: Modeling Earthquake Damage". It is an imbalanced ordinal classification/regression problem (with three classes), scored by micro-averaged F1 score (= accuracy here). Most features are categorical, many of them binary.
To use our code, download the files train_values.csv
, test_values.csv
, and train_labels.csv
from the competition platform (no need to store the 4th file, which demonstrates the submission format).
Place the data files in the folder data/
within the current folder.
The notebook Exploration.ipynb
contains some basic exploration (mainly statistics) of the CSVs.
The competition itself uses the micro-averaged F1 score, which corresponds to accuracy for single-label prediction. For course-internal scoring, we use Matthews Correlation Coefficient (MCC) instead.
split.py
creates one stratified holdout split.score.py
scores submissions for the holdout split.
We provide small demo scripts for creating submissions. The submission format is valid for the competition website as well as our course-internal scoring.
predict_majority.py
creates a baseline solution that constantly predicts the majority class.predict_tree.py
uses asklearn
decision tree (you can easily switch the model). The only preprocessing is encoding of categorical features.
This task stems from the field of SAT solving.
We use features of SAT instances from the Global Benchmark Database (GBD
)
and runtimes from the SAT Competition 2022 Anniversary Track
.
We have two prediction targets:
- Is the instance satisfiable or not (column
result
in databasemeta
)? - Does a solver run into a timeout on an instance or not (based on the runtime data)?
Besides exploring the data, students should use classification approaches. We investigate a traditional "passive" as well as an active-learning scenario for predictions. For active learning, we consider "standard" queries (which always return a label) as well as "timeout" queries (which only run for a limited amount of time). The latter can potentially save query costs but might not return a class label (if the solver did not finish within the given timeout).
prepare_data.py
pre-processes the dataset:
- download databases with meta data and instance features from
GBD
- download runtime data from SAT-Competition website
- merge databases
- filter instances:
- 2022 Anniversary Track
- known satisfiability result
- no NAs in instance features
The notebook Active_Learning_Demo.ipynb
demonstrates "standard" active learning with scikit-activeml
.
The notebook Timeout_Active_Learning_Demo.ipynb
uses a timeout-aware active-learning oracle
that we implemented in the module al_oracle
, while the query strategies still originate from scikit-activeml
.
We provide a scoring mechanism for the "passive" prediction scenario, using MCC as metric:
split.py
creates one stratified holdout split for both targets, satisfiability and timeouts. From the 28 available solvers, we choose the winner of the Anniversary Track for timeout prediction (which is not the fastest solver on our filtered instance sample but close to it).score.py
scores submissions for the holdout split, considering both prediction targets.
For the active-learning scenario, we do not have dedicated splitting and scoring scripts
since students should plot and compare complete learning curves.
Thus, students should determine the prediction quality for each learning iteration on their own.
For active learning with timeouts, al_oracle.py
provides splitting and (MCC-based) scoring
functionality in the class ALOracle
, but there still is no scoring script.
We provide small demo prediction scripts for the "passive" scenario, using the same approaches as for Task 1:
predict_majority.py
creates a baseline solution that constantly predicts the majority class.predict_tree.py
uses asklearn
decision tree (you can easily switch the model) without any preprocessing.