bigcode-project/bigcode-dataset

TF-Update The Stack with new languages and licenses

lvwerra opened this issue · 5 comments

The first version of The Stack included those weak copyleft licenses and we should exclude them.

FWIW, you may want to consider a more up-to-date license detection engine such as scancode-toolkit (that I maintain) as you may otherwise have several undetected licenses.

@pombredanne thanks! I didn't know about scancode-toolkit. We'll use that in the future :)

@harm-devries I am hopelessly biased of course, but this is considered not too shabby. Used by SWH in https://annex.softwareheritage.org/public/dataset/license-blobs/ (which likely may be something you would fancy), tern, ORT and many more. Ping me if you need help!

@lvwerra since you closed this can you elaborate what has been done exactly?

For the current iteration we used the same license information as for the original stack but extended it to ~150 permissive licenses and added roughly ~300 programming languages. There will likely be more updates to The Stacks where we can update the license detector.