/bigcode-analysis

Repository for analysis and experiments in the BigCode project.

Primary LanguageJupyter NotebookApache License 2.0Apache-2.0

BigCode Analysis

This repository is for the analysis done in BigCode Project. You can find analysis of datasets, models, architecture choices and more.

Contents

  • Data analysis: In the folder data_analysis, we provide code for data analysis:

    • Near deduplication
    • Python data analysis:
      • Natural language distribution in comments/docstrings
      • Data decontamination for HumanEval and MBPP benchmarks
      • Percentage of files that can be successfully compiled
      • Percentage of configuration and test files
      • Exploration of unimax sampling on The Stack Some notebooks with some early data and model loss analysis.
  • Multi-Query Attention experiments, for details please to multi_query_experiments/README.md)