common-voice/cv-sentence-extractor

Legal issues

RobinvanderVliet opened this issue · 2 comments

As a non-American contributor to Common Voice and Wikipedia, this project worries me.

Has the use of this tool been properly discussed? As fair use does not exist in most parts of the world, the resulting dataset would not be public domain in a lot of countries, for example in my home country the Netherlands. I actually want to use the resulting dataset as public domain. I don't want to worry that the dataset gets contaminated with sources that are not public domain. Because of this project, I cannot do this anymore in my home country.

Besides that, American fair use law only permits very limited use of copyrighted material. I think copying 3 whole sentences from each and every article is way too much.

Has this project been cleared and analyzed by international copyright law experts?

A little side note to this: since some languages have already massively used sentences from Wikipedia there is no real way back for them. If the dataset can be only distributed as CC0 in the US it becomes much more important that the project provides pretrained moddels for more languages.

Would it be possible to provide the dataset as CC BY-SA in the rest of the world?

Hello,

The extraction process was consulted with Mozilla legal and also communicated to Wikipedia. Our dataset remains Public Domain worldwide.

https://discourse.mozilla.org/t/extending-our-sentence-collection-capabilities/38783

Feel free to open a conversation over Common Voice discourse if you have other concerns, since we just use github for technical issues about this script.

Thanks for your feedback.