bigcode-project/bigcode-dataset

Which languages to include?

lvwerra opened this issue ยท 20 comments

Which programming languages should be included in the pretraining dataset?

In contrast to natural language does multilinguality not increase the vocab size significantly so I think it could be interesting to be on the liberal side here to open the doors for interesting downstream use-cases.

Do you mean for training? Or also for releasing data?

Is there a language missing from the list below?

  • Assembly
  • Batchfile
  • C
  • C#
  • C++
  • CMake
  • CSS
  • Dockerfile
  • FORTRAN
  • GO
  • Haskell
  • HTML
  • Java
  • JavaScript
  • Julia
  • Lua
  • Makefile
  • Markdown
  • Perl
  • PHP
  • PowerShell
  • Python
  • Ruby
  • Rust
  • Scala
  • Shell
  • SQL
  • TeX
  • TypeScript
  • Visual Basic

@harm-devries MATLAB and Octave come to mind.

@harm-devries Maybe also adding hardware-related programming language (e.g., Verilog).

@harm-devries reStructuredText RST might be useful to include as many readthedocs sites use it

@harm-devries There was a good question on Twitter about languages to be supported. Can you provide guidance, please?

https://twitter.com/sardaaroo/status/1574479628944613376?s=20&t=113Vvos6hJRbn-DoLuMsfQ

Are you guys open to any language? For instance here at Mulesoft(part of salesforce) we would love to contribute to this project for dataweave language.

We do prefer widely adopted programming languages over niche languages for which there is little data. But we're pretty open to proposals, especially if you can make a compelling case it.

@cakiki I thought matlab was in binary format?
@hajipour interesting, I'd love to learn more about what kind of use cases that could enable!

From the MultiPL-E evaluation suite I saw our list of languages was missing Swift, D, R and Racket.

Randl commented

(Relatively) popular languages missing in the list above (some already mentioned):

  • R
  • Swift
  • Kotlin
  • Verilog/SystemVerilog
  • MATLAB
  • CUDA
  • OCaml

Also, it may be worth including some more rare languages with unique features to diversify the data, e.g., Isabelle/Agda/Coq/Idris (dependent types), Mathematica (symbolic calculations), Prolog (logical computing)

Prolog and Lisp would be cool.

I second Prolog and Lisp. If anything else, most of "classical AI" literature is written with these languages in mind. Leaving them out would exclude significant early work.

totally agree. Interestingly, one can find a lot of codes in various ftp servers and usenet groups archives still lying around.

Would be cool to see what's crawlable from software heritage https://www.softwareheritage.org/ . There's tons of old cobol code around for example and given the shortage in legacy language programmers, if copilot-like models could help future generation to help fix legacy codes, that would be a fantastic use case.
Not sure about stuff like forth but I've heard it's still super useful in embedded systems.

A few others that seem relevant to me:

  • Ada
  • Clojure
  • Crystal
  • Cython
  • Elixir
  • Elm
  • Erlang
  • fish language (the shell)
  • F#
  • Nim
  • Solidity
  • Terraform (HCL)
  • WebAssembly
  • Zig

Here a summary of the so far proposed languages. There is a language to extension list here. I added extensions where languages are missing from the list:

Math & Statistics:

  • Mathematica
  • Maple (.mpl)
  • Matlab
  • Octave (.oct)
  • SPSS (.sps)
  • SAS
  • R

Theorem proof assistants

  • OCaml
  • Isabelle
  • Lean

Natural language

  • reStructuredText

AI languages

  • Prolog
  • Common Lisp (PicoLisp, NewLisp, Emacs Lisp?)

Specialized Languages

  • SystemVerilog
  • Verilog
  • Cuda

Frontend

  • Elm
  • WebAssembly (.wat)

Crypto

  • Solidity (.sol)

DevOps

  • Terraform (.tf)

Audio

  • Csound (. csd )

Others

  • Erlang
  • Kotlin
  • Swift
  • D
  • Racket
  • Clojure
  • COBOL
  • Crystal
  • Cython
  • Elixir
  • fish
  • F# (.fs, .fsi, .ml, .mli, .fsx, .fsscript)
  • Nim (.nim, .nims, .nimble)
  • Zig (.zig)

Theorem proof assistants

  • OCaml (this should be in the Other category)
  • Suggested addition: Coq (I believe substantially larger community than Isabelle or Lean. However, good luck with the classification. It uses .v which is the same extension as verilog IIRC.)
  • Isabelle-
  • Lean

On the HF hub, someone was also interested in YAML files.

Hi,
I do think that all assembly languages could be included (at least x86, powerpc and arm, maybe m68k as an embedded plateform). There are tons of sources for those.

I'm not seeing any instance of Basic languages (real basic, gw, visual, etc.) or pascal-based languages too (turbo pascal, delphi, modula, etc.) ?

V also seems to be gaining popularity.

+1 on yaml, it can be useful for infra related stuff, kube definitions are all yaml, ansible uses yaml (and maybe others I forget).

Svelte for frontend, though I'm not sure you can count that as another language, it's kind of a special flavor of JS/TS in some way (same goes for Vue, React or Angular).
Maybe looking at extensions JSX and whatnot.

+1 on yaml please.

I have been working on compiling a dataset of shadercode, primarily from Shadertoy.com using their API. The project is still in progress as I am currently looking into annotating licenses. but an older version is already available on huggingface: https://huggingface.co/datasets/Vipitis/Shadertoys will hopefully have a newer version in a few days. but if you aren't afraid of crawling the site, there might be more than twice as many files which the API doesn't have access to.
Shadertoy uses a fragment shader subset of WebGL.

@Vipitis This looks great; I think shader code would be fantastic addition!

However, inclusion in the training data will have to be contingent upon very clearly permissively licensed code and I can see quite a few rows with no entry for the license. (sidenote: this also means you probably cannot license the dataset under cc-by-nc-sa-3.0)