Exploring Hyperparameter Usage and Tuning in Machine Learning Research

This repository provides additional material for our paper: "Exploring Hyperparameter Usage and Tuning in Machine Learning Research", including metadata for the research papers and the statistics of the analyzed code repositories.

API Crawler

We developed an API crawler for three popular and widely used ML libraries: scikit-learn, TensorFlow and PyTorch. The API crawler can be found here.

Code Repository Analysis

Note that we developed plugins for the each ML library, which apply static code-analysis as well as control- and data-flow analysis to locate API calls from the corresponding library and extract their configuration settings. The plugins are integrated into the CfgNet. Its implementation can be found here. Our analysis script relies on the CfgNet and assumes that it is run on our Slurm cluster if the hostname is tesla or starts with brown. You can find our evaluation script in analysis/.

You can start the analysis by running run.sh. It takes an optional parameter which is a Git tree-ish (e.g. main) that can be used to get a certain version of CfgNet.

For this analysis, it is required to use the ml branch of the CfgNet, because only this branch contains the ML library plugins and extracts the API calls.

The result files will be in results/. You can find the modified repositories in out/.

Data and Scripts

The data/ directory contains all the data used in this paper, while the src/ directory contains all scripts used to process the data.

  • data/dblp/ : contains the data crawled from the DBLP digital bibliography

  • data/library_data/: contain the API data of the ML libraries

  • data/paper_analysis/: contains the metadata for each paper

  • data/statistics/: contains the statistics for the analyzed code repositories

  • src/cross_validation/: contains the script to calculate the inter-annotator agreement

  • src/dblp_results/: contains the script the calculate the number of papers dealing with hyperparameter importance and tuning

  • src/library_stats/: contains the script the calculate the total number of API calls and parameter for each library

  • src/repos/: contains the script to identify suitable code repositories from the papers with code corpus

  • src/: contains the scripts to process the data extracted from the code repostories and respective research papers