Add WikiSource as target
MichaelKohler opened this issue · 4 comments
As per https://discourse.mozilla.org/t/legal-sentence-extraction-can-i-use-wikisource-cc0-for-sentence-collection/79917/8 we are allowed to get 3 sentences per article (same as Wikipedia). For now we do not explicitly treat CC0 content to make sure we can guarantee legal requirements. For the same reason exports will need to be done through the Sentence Extractor (also same as for Wikipedia).
Open questions:
- How similar is the dump file to Wikipedia?
- Can the WikiExtractor be used reliably for WikiSource as well?
If it's very similar to Wikipedia then we can probably even reuse the same logic. However it should still be a separate target to make sure both exports can be run separately.
The Github Action logic should be straightforward and possibly can be generalized a bit to reuse most of the Wikipedia pipeline.
Hey, I would like to test this for German and Esperanto. Both languages need an improved rule files since they have been used with older version of the wiki extractor. This mainly means switching from disallowed symbol to a whitelist with the alphabet of the language. I hope that I can test it this week and give some feedback afterwards.
That's great to hear, happy to help if you have any questions (though I don't know how much time I'll have, but I should be available for some questions!)
Wikisource is a valid resource, for Italian we are using it on the model side for the predictive part.
For our exporter in python: https://github.com/MozillaItalia/DeepSpeech-Italian-Model/blob/master/MITADS/wikisource_importer.py
I have tested this with (part) of the bn
WikiSource dump file. I was able to use the WikiExtractor and Sentence Extractor as explained in the README here, so I don't think we need much changes here. Obviously the output wasn't perfect as I used the English rules file, which is completely useless for a different script. However I think would a dedicated rules file for bn
this would work out as well. I didn't test with English or German as I am right now not able to download larger files.
I will integrate it into the automation process in the next few days.