common-voice/cv-sentence-extractor

Excluding footnotes and literature

tarnh opened this issue · 1 comments

tarnh commented

I've just got the "sentence" Seltmann, Guntram; Holst, Otto: "The Bacterial Cell Wall. for proving in German.

It's an reference from the Wikipedia article Murein-Lipoprotein, but not packed into <ref>..</ref> like it should be.

IMHO it would be safer to exclude anything from == Literatur == or == Quellen ==

I'd be very happy to not have these parts excluded, that might get tricky though. I've looked at a few articles, and there the "Literatur" parts were just simple headers and content without any possibility to differentiate. Header information is stripped out by the WikiExtractor, so apart from the quite tricky lookback check this would need, it's also not easy to even identify the necessary headers.

However, given that these have quite similar patterns, adding these to the abbreviation config to filter them out might be the way to go.