Implement filtering by title
MichaelKohler opened this issue · 0 comments
MichaelKohler commented
To be able to re-run extractions from Wikipedia only for new articles, we need to implement filtering by title. There is already a workflow file querying Wikipedia for new articles: https://github.com/common-voice/cv-sentence-extractor/blob/main/.github/workflows/manual-dispatch-wikipedia-rerun.yml.
This needs to be extended in the following way:
- Add a new command flag to indicate which file to use for the titles
- If that flag is set, only articles matching a title in that file should be used
- Integration of the extraction in the workflow file linked above so that it can be used end-to-end
Additionally the script needs some improvements, for example there is a bug that the termination conditions are not right and the last page might be fetched over and over again.