Add de-duplication GitHub Actions step
MichaelKohler opened this issue · 1 comments
With #101 I've introduced a way of automatically do the Wikipedia extraction. If it's a language that has multiple split archive files, we run this script for each part. This means that the de-duplication of our script only works on each part, not all together.
Due to that, we should add a final step to the GitHub Action which takes the extracted sentences file and de-duplicates it, so no duplicate lines exist inside the file. The deduplicated file should then be uploaded (either by having the same name and overwriting or deleting the other file).
That was already added in the initial Pull Request: https://github.com/Common-Voice/cv-sentence-extractor/pull/101/files#diff-2b6fa25316d8999e4b319e6f99403c65R41