Which languages to include?

Question

Which languages to include?

lvwerra opened this issue 2 years ago · 20 comments

Which programming languages should be included in the pretraining dataset?

In contrast to natural language does multilinguality not increase the vocab size significantly so I think it could be interesting to be on the liberal side here to open the doors for interesting downstream use-cases.

Answer 1 · 2022-09-23T12:14:14.000Z

Do you mean for training? Or also for releasing data?

Is there a language missing from the list below?

Assembly
Batchfile
C
C#
C++
CMake
CSS
Dockerfile
FORTRAN
GO
Haskell
HTML
Java
JavaScript
Julia
Lua
Makefile
Markdown
Perl
PHP
PowerShell
Python
Ruby
Rust
Scala
Shell
SQL
TeX
TypeScript
Visual Basic

Answer 2 · 2022-09-27T12:39:45.000Z

@harm-devries MATLAB and Octave come to mind.

Answer 3 · 2022-09-27T15:02:47.000Z

@harm-devries Maybe also adding hardware-related programming language (e.g., Verilog).

Answer 4 · 2022-09-30T16:45:44.000Z

@harm-devries reStructuredText RST might be useful to include as many readthedocs sites use it

Answer 5 · 2022-09-30T19:30:32.000Z

@harm-devries There was a good question on Twitter about languages to be supported. Can you provide guidance, please?

https://twitter.com/sardaaroo/status/1574479628944613376?s=20&t=113Vvos6hJRbn-DoLuMsfQ

Answer 6 · 2022-10-03T16:58:59.000Z

Are you guys open to any language? For instance here at Mulesoft(part of salesforce) we would love to contribute to this project for dataweave language.

We do prefer widely adopted programming languages over niche languages for which there is little data. But we're pretty open to proposals, especially if you can make a compelling case it.

@cakiki I thought matlab was in binary format?
@hajipour interesting, I'd love to learn more about what kind of use cases that could enable!

From the MultiPL-E evaluation suite I saw our list of languages was missing Swift, D, R and Racket.

Answer 7 · 2022-10-04T13:33:12.000Z

(Relatively) popular languages missing in the list above (some already mentioned):

R
Swift
Kotlin
Verilog/SystemVerilog
MATLAB
CUDA
OCaml

Also, it may be worth including some more rare languages with unique features to diversify the data, e.g., Isabelle/Agda/Coq/Idris (dependent types), Mathematica (symbolic calculations), Prolog (logical computing)

Answer 8 · 2022-10-11T16:01:58.000Z

Incase this is useful: https://madnight.github.io/githut/#/pull_requests/2022/3

Answer 9 · 2022-10-27T14:19:36.000Z

Prolog and Lisp would be cool.

Answer 10 · 2022-10-27T21:53:36.000Z

I second Prolog and Lisp. If anything else, most of "classical AI" literature is written with these languages in mind. Leaving them out would exclude significant early work.

Answer 11 · 2022-10-27T22:03:57.000Z

totally agree. Interestingly, one can find a lot of codes in various ftp servers and usenet groups archives still lying around.

Would be cool to see what's crawlable from software heritage https://www.softwareheritage.org/ . There's tons of old cobol code around for example and given the shortage in legacy language programmers, if copilot-like models could help future generation to help fix legacy codes, that would be a fantastic use case.
Not sure about stuff like forth but I've heard it's still super useful in embedded systems.

Answer 12 · 2022-10-28T12:35:33.000Z

A few others that seem relevant to me:

Ada
Clojure
Crystal
Cython
Elixir
Elm
Erlang
fish language (the shell)
F#
Nim
Solidity
Terraform (HCL)
WebAssembly
Zig

Answer 13 · 2022-11-03T10:27:08.000Z

Here a summary of the so far proposed languages. There is a language to extension list here. I added extensions where languages are missing from the list:

Math & Statistics:

Mathematica
Maple (.mpl)
Matlab
Octave (.oct)
SPSS (.sps)
SAS
R

Theorem proof assistants

OCaml
Isabelle
Lean

Natural language

reStructuredText

AI languages

Prolog
Common Lisp (PicoLisp, NewLisp, Emacs Lisp?)

Specialized Languages

SystemVerilog
Verilog
Cuda

Frontend

Elm
WebAssembly (.wat)

Crypto

Solidity (.sol)

DevOps

Terraform (.tf)

Audio

Csound (. csd )

Others

Erlang
Kotlin
Swift
D
Racket
Clojure
COBOL
Crystal
Cython
Elixir
fish
F# (.fs, .fsi, .ml, .mli, .fsx, .fsscript)
Nim (.nim, .nims, .nimble)
Zig (.zig)

Answer 14 · 2022-11-03T11:12:51.000Z

Theorem proof assistants

OCaml (this should be in the Other category)
Suggested addition: Coq (I believe substantially larger community than Isabelle or Lean. However, good luck with the classification. It uses .v which is the same extension as verilog IIRC.)
Isabelle-
Lean

Answer 15 · 2022-11-03T11:25:45.000Z

On the HF hub, someone was also interested in YAML files.

Answer 16 · 2022-11-03T15:36:54.000Z

Hi,
I do think that all assembly languages could be included (at least x86, powerpc and arm, maybe m68k as an embedded plateform). There are tons of sources for those.

I'm not seeing any instance of Basic languages (real basic, gw, visual, etc.) or pascal-based languages too (turbo pascal, delphi, modula, etc.) ?

Answer 17 · 2022-11-03T16:21:58.000Z

V also seems to be gaining popularity.

+1 on yaml, it can be useful for infra related stuff, kube definitions are all yaml, ansible uses yaml (and maybe others I forget).

Svelte for frontend, though I'm not sure you can count that as another language, it's kind of a special flavor of JS/TS in some way (same goes for Vue, React or Angular).
Maybe looking at extensions JSX and whatnot.

Answer 18 · 2022-11-03T18:24:23.000Z

+1 on yaml please.

Answer 19 · 2022-11-05T21:15:00.000Z

I have been working on compiling a dataset of shadercode, primarily from Shadertoy.com using their API. The project is still in progress as I am currently looking into annotating licenses. but an older version is already available on huggingface: https://huggingface.co/datasets/Vipitis/Shadertoys will hopefully have a newer version in a few days. but if you aren't afraid of crawling the site, there might be more than twice as many files which the API doesn't have access to.
Shadertoy uses a fragment shader subset of WebGL.

Answer 20 · 2022-11-06T09:31:31.000Z

@Vipitis This looks great; I think shader code would be fantastic addition!

However, inclusion in the training data will have to be contingent upon very clearly permissively licensed code and I can see quite a few rows with no entry for the license. (sidenote: this also means you probably cannot license the dataset under cc-by-nc-sa-3.0)