/bigcode-analysis

Repository for analysis notebooks and experimentes of the BigCode project.

Primary LanguageJupyter NotebookApache License 2.0Apache-2.0

BigCode Analysis

This repository is for the analysis done in BigCode Project. You can find analysis of datasets, models, architecture choices and more.

Contents

  • Data analysis: In the folder data_analysis, we analyze these two datasets: python-all-license and python-safe-license. We provide the following statistics:
    • percentage of near duplicates
    • percentage of configuration/test and uncommon files
    • file size distribution
    • loss analysis
    • natural language distribution in comments/docstrings and number of files that can be successfully compiled

We also provide code to run near-deduplication, and to detect natural language of comments in Python datasets.

  • Multi query attention experiments, for details refer here