TF-Update The Stack with new languages and licenses

Question

TF-Update The Stack with new languages and licenses

lvwerra opened this issue 2 years ago · 5 comments

The first version of The Stack included those weak copyleft licenses and we should exclude them.

Answer 1 · 2022-11-03T12:22:49.000Z

FWIW, you may want to consider a more up-to-date license detection engine such as scancode-toolkit (that I maintain) as you may otherwise have several undetected licenses.

Answer 2 · 2022-11-03T13:42:05.000Z

@pombredanne thanks! I didn't know about scancode-toolkit. We'll use that in the future :)

Answer 3 · 2022-11-03T13:47:48.000Z

@harm-devries I am hopelessly biased of course, but this is considered not too shabby. Used by SWH in https://annex.softwareheritage.org/public/dataset/license-blobs/ (which likely may be something you would fancy), tern, ORT and many more. Ping me if you need help!

Answer 4 · 2022-11-21T11:06:10.000Z

@lvwerra since you closed this can you elaborate what has been done exactly?

Answer 5 · 2022-11-21T11:09:09.000Z

For the current iteration we used the same license information as for the original stack but extended it to ~150 permissive licenses and added roughly ~300 programming languages. There will likely be more updates to The Stacks where we can update the license detector.