This is a source code for CODE4ML dataset construction.
In order to load data form Kaggle one should have a kaggle.json file with the username and key specified.
An example of collecting code snippets from Kaggle can be found [here].
This is an official repository for code snippets from Kaggle kernels collecting.
You can find the instructions above.
- Collection of kernels links to
kernel_lists
directory.
mkdir kernel_lists
python collect_kernels_from_competitions.py
The script does the following:
- Collects the links to the Kaggle competitions to .csv table
- Runs competition_kernels.sh, which collect kernels information of every competition
- Collects .csv files with the Kaggle kernels links to
kernel_lists
directory
Output: Kaggle kernels links .csv files
- Combining kernels links tables into one .csv table
python unite_kernel_lists.py
Input:
Output: .csv table with the links to the Kaggle kernels
python code_blocks_extraction.py
Input: .csv table with the links to the Kaggle kernels
Output: .csv table with the following columns: "kernel_id", "code_block", "code_block_id".