common-voice/cv-sentence-extractor

Implement filtering by title

MichaelKohler opened this issue · 0 comments

To be able to re-run extractions from Wikipedia only for new articles, we need to implement filtering by title. There is already a workflow file querying Wikipedia for new articles: https://github.com/common-voice/cv-sentence-extractor/blob/main/.github/workflows/manual-dispatch-wikipedia-rerun.yml.

This needs to be extended in the following way:

  • Add a new command flag to indicate which file to use for the titles
  • If that flag is set, only articles matching a title in that file should be used
  • Integration of the extraction in the workflow file linked above so that it can be used end-to-end

Additionally the script needs some improvements, for example there is a bug that the termination conditions are not right and the last page might be fetched over and over again.