Improve sentence separation
Closed this issue · 17 comments
There are a lot of partial sentences such as:
is applied to species he described.
a fully documented genealogy e.g.
are verbs.
These seem to occur commonly after an abbreviation with a full stop / period after it or something like "i.e." or "e.g.". It seems that the script is interpreting the full stops as the end of the sentence.
A way to improve this would be to only split up the sentence if the next character after the full stop is a capital letter.
These come from the WikiExtractor and therefore there is not much we can do here. I'd be glad if somebody would want to have a look at the WikiExtractor to improve that though.
Can you try again, I think this was solved with the rules that define how a sentence should end and start.
Good sentence tokenization can be hard. This is a long list of rules for different languages in SRX format. I know implementations in Java and C++.
@fiji-flo do you remember what did you use here? Can you see potential improvements by implementing SRX files?
We don't split sentences in this script. The sentences are already getting split by the WikiExtractor.
We don't split sentences in this script. The sentences are already getting split by the WikiExtractor.
Are you sure? After wikiextractor we get the whole articles in json format with the fields: url, text and title.
Yep, nevermind me.. Sorry, I'm too tired to think it seems. We're currently using the punkt sentence tokenizer: https://github.com/Common-Voice/common-voice-wiki-scraper/blob/master/src/extractor.rs#L30
Regarding punkt, NLTK docs mention:
Punkt is designed to learn parameters (a list of abbreviations, etc.)
unsupervised from a corpus similar to the target domain. The pre-packaged models
may therefore be unsuitable
Maybe it would be a good idea to first train punkt on the whole text, and then use the model instead of the pre-trained one. At least for Polish the results are very poor with the pre-trained models. Very common abbreviations for encyclopedia-style sentences are recognized wrongly as the sentence ending. My best guess is that the NLTK models were trained on texts like books and articles, which use a lot less (and different) abbreviations.
Edit: I did some testing and the problem isn't with Punkt but with rust-punkt. I checked how the python implementation of punkt behaves on the same text. Model loaded as pickle from NLTK split sentences correctly. Model loaded from json also works with the python implementation. The same model loaded into rust-punkt produces wrong output.
Thanks @Scarfmonster. That's unfortunate. I'd be happy to replace the library with something else if there is something which has better results. It's a small part of the code base and should be easy to replace.
Actually rust-punkt is without a maintainer as of now.
Summary:
- We are using rust-punkt: https://github.com/Common-Voice/cv-sentence-extractor/blob/master/src/extractor.rs#L6
- We do not do any additional training for it
- rust-punkt is unmaintained - and does not work well with languages we do want to support
- Python's NLTK Punkt implementation works better than rust-punkt (thanks to @Scarfmonster for testing this) - ferristseng/rust-punkt#16
Here are my thoughts:
- We should use something that works for as many different languages as possible, as we have quite a lot of different languages we can leverage bigger data sets for
- Preferably there would be a rust implementation that works for us - however I couldn't find anything that might work for us yet
- Using NLTK might be an option - though I'm not sure yet how we could best integrate that so we do not end up with "just another script that needs to be run separately" while also making sure that it can be used for different sources
The rust-punkt page says "The Punkt algorithm allows you to derive all the necessary data to perform sentence tokenization from the document itself." Is there a way we can use this for the languages that don't have already compiled data from NLTK?
Well, given @Scarfmonster's comment above it doesn't sound like we can use the precompiled models from NLTK. Or did I misunderstand that comment? And for training rust-punkt, we might need to create the models depending on the source. If we train it on Wikipedia data, that might or might not work for other sources. So worst case we might need to integrate the training itself into the extraction amd always use the given source for it.
I certainly won't have time to look into this in the next 3 weeks.
Hi, I just came across this issue searching for a fast sentence boundary detection library in Rust for nlprule. For your use case here nnsplit might work but it has some speed issues when using from Rust (~6ms for 40 chars on my i5 8600k) but the splits should have a high quality.
An SRX parser or working Punkt implementation in Rust would be great. I might take a look at writing one in the near future (bminixhofer/nlprule#15).
I don't think that in the end there will be any sentence splitter that works well for any given language.
Probably in the configuration file we should allow each language to pick their preferred splitter.
--
For Thai sentence splitters, we have few:
- thai-segmenter https://github.com/Querela/thai-segmenter - trained on ORCHID corpus (formal/scientific document), features including POS
- CRFcut https://pythainlp.github.io/docs/2.3/api/tokenize.html#module-pythainlp.tokenize.crfcut - trained on TED Talk subtitles, so the output could be more like a clause, not a sentence (but that's very debatable in Thai linguistic circle anyway if the language has a construct that is really a "sentence" like in some other languages)
Should have linked https://discourse.mozilla.org/t/future-of-the-sentence-extractor-your-input-is-required/78139 here as well.
Thanks everyone for the input here. I've added the possibility to use Python-based segmenters instead of rust-punkt
in #150. You can find more information about it here: https://github.com/common-voice/cv-sentence-extractor#using-a-different-segmenter-to-split-sentences. Note that this is experimental for now, but would solve quite a few issues. I'll close this issue here, and we can file more issues if further adjustments are needed.