common-voice/cv-sentence-extractor

Add de-duplication GitHub Actions step

MichaelKohler opened this issue · 1 comments

With #101 I've introduced a way of automatically do the Wikipedia extraction. If it's a language that has multiple split archive files, we run this script for each part. This means that the de-duplication of our script only works on each part, not all together.

Due to that, we should add a final step to the GitHub Action which takes the extracted sentences file and de-duplicates it, so no duplicate lines exist inside the file. The deduplicated file should then be uploaded (either by having the same name and overwriting or deleting the other file).