An End-to-End System for Reproducibility Assessment of Source Code Repositories via Their Readmes

This repo contains the code for research titled "An End-to-End System for Reproducibility Assessment of Source Code Repositories via Their Readmes" (2023)

Demo page is available at: https://repro-der.streamlit.app/

Content

An End-to-End System for Reproducibility Assessment of Source Code Repositories via Their Readmes
Content
How to Start?
- General Installation
- Requirements
Project Structure
- Data
- Models
- Notebooks
- End-to-end System
- Util
Training
Evaluation
Results
- Citation

How to Start?

General Installation

Clone the Repo
Prepare the data
1. Data (download)
2. Model (download)
- Models are also available at HuggingFace
Download all the data and model from the links provided above, unzip/ unarchive the data, and then copy the data ve model folders to the main directory in the repo.
Make sure you have Python 3.9.13 installed on your system
In order to use the GitHub API, you need to rename the example.config.ini file to config.ini and enter your api token.
Follow the steps specified in Requirements

Requirements

Use pip install poetry command to install poetry.
Install all the necessary using poetry install command.
Use poetry shell command to enter the poetry virtual environment.
(Optional) Install pytorch-cuda version with poe install command to run pytorch over GPU.

Note: CUDA must be installed in your system to do this step.

Project Structure

Data

acl: The directory contains the Readme files corresponding to articles collected from the ACL Anthology site, and the training data composed of these Readme contents.
constants: This directory is the repository for the static data utilized during the development process.
paperswithcode: This includes the data used during the system's evaluation.

Models

It is the directory where the pretrained models used in the system are located.

Notebooks

data-analysis-statistics: Data analyses and statistical analysis codes.
data-gathering: Data collection process codes.
data-preparation: Data preparation process codes.
e2e-system: Evaluation of the system codes.
labelling: Data labeling and agreement calculation codes.
training: Model training codes.

End-to-end System

The directory where the main code files of the system are located. You can import the classes here and use them in different codes.

Util

This directory contains helper classes such as readme parser, github helper.

Training

In the notebooks/training directory you can find the code for training both the hierarchical and the BERT model. It can be run after editing the data and model paths.

Evaluation

In the notebooks/e2e-system directory you can find the code for evaluating the system. It can be run after editing the data and model paths.

Results

Performance results of the system are below.

Labelling Method	Labelling Content	Scoring Type	Correlation	Agreement	Accuracy
Text Sim.	Content	Base	0.549	0.521	0.665
Text Sim.	Content	Consecutive	0.554	0.521	0.665
Text Sim.	Grouped	Base	0.579	0.542	0.697
Text Sim.	Grouped	Consecutive	0.581	0.542	0.697
Text Sim.	Header + Content	Base	0.578	0.523	0.685
Text Sim.	Header + Content	Consecutive	0.571	0.523	0.685
Text Sim.	Parent + Header + Content	Base	0.568	0.528	0.692
Text Sim.	Parent + Header + Content	Consecutive	0.569	0.528	0.692
Text Sim.	Parent + Header	Base	0.602	0.613	0.668
Text Sim.	Parent + Header	Consecutive	0.597	0.613	0.668
Text Sim.	Header	Base	0.497	0.479	0.637
Text Sim.	Header	Consecutive	0.473	0.479	0.637
Zero-Shot	Content	Base	0.582	0.563	0.662
Zero-Shot	Content	Consecutive	0.586	0.563	0.662
Zero-Shot	Grouped	Base	0.651	0.648	0.697
Zero-Shot	Grouped	Consecutive	0.661	0.648	0.697
Zero-Shot	Header + Content	Base	0.631	0.631	0.665
Zero-Shot	Header + Content	Consecutive	0.626	0.631	0.665
Zero-Shot	Parent + Header + Content	Base	0.617	0.556	0.698
Zero-Shot	Parent + Header + Content	Consecutive	0.624	0.556	0.698
Zero-Shot	Parent + Header	Base	0.594	0.540	0.608
Zero-Shot	Parent + Header	Consecutive	0.587	0.540	0.608
Zero-Shot	Header	Base	0.399	0.419	0.587
Zero-Shot	Header	Consecutive	0.383	0.419	0.587

Citation

If you find this repository useful in your research, please cite it as below:

@misc{akdeniz2023reproder,
      title={An End-to-End System for Reproducibility Assessment of Source Code Repositories via Their Readmes},
      author={Eyüp Kaan Akdeniz and Selma Tekir and Malik Nizar Asad Al Hinnawi},
      year={2023},
      eprint={2310.09634},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}