common-voice/cv-sentence-extractor

Punkt dependency and related issues

dodomorandi opened this issue · 6 comments

I was talking with @Mte90 about the needs for the nightly compiler on this project, and looking carefully this is probably the last problem.

You already know that punkt, which is the reason cv-sentence-extractor requires nightly compiler, is not currently maintained. This is obviously an issue, if you look carefully things get worse:

  1. There is a bad usage of unsafe blocks, which cause undefined behavior. I tried to demonstrate it using miri, but I needed to create a specific test (miri can be painfully slow and resource hungry for a normal test).
  2. The main reason for which a nightly compiler was needed is related to proc-macro hygiene. This was already solved in the past by David Tolnay with proc_macro_hack, but the feature has also been implemented in stable rust since 1.45.
  3. Benchmarks using nightly can be replaced with criterion and a bit of visibilty management depending on a feature.

I decided to fix all the previous points on my fork. If the dependency is valuable, I think that Mozilla should consider to help punkt in some way, because, if I am not wrong, it can be very important for the ecosystem of Common Voice. Keep in mind that, IMHO, there are lots of things that could be done to improve punkt, first of all replace all the cryptographically secure hashmaps and hashsets with something like fnv.

Mte90 commented

Let's see what @MichaelKohler say about this.

Thanks for filing this. These are definitely valid points, and should indeed be addressed if we keep using rust-punkt. However I currently doubt that keeping to use rust-punkt provides major value for the Sentence Extractor, and we most likely should re-evaluate architectural choices we've made in the past. One of that decisions IMHO is the usage of Rust for this project.

This opinion is mostly based on the following two points:

  • Availability of language related helper tools, such as tokenizers
  • Sentence splitting issues: #11

While fixing rust-punkt to work nicely for the languages it supports would be a first step, I'm concerned about supporting languages that rust-punkt does not support. Note that this is for now my opinion, but as it is right now, I personally think we'd be better of if we'd use Python for this project due to its massive usage for language related tasks.

Keep in mind that, IMHO, there are lots of things that could be done to improve punkt, first of all replace all the cryptographically secure hashmaps and hashsets with something like fnv.

While this is a laudable goal and possibly would improve rust-punk itself, this would not really help us in our current situation.

I understand your points, and obviously the amount of tools already available in a specific ecosystem can be determinant for a project like this.
The only thing I would like to point out is that the issue you linked is more than two years old, and performing major changes to the project (like changing from a tokenizer from another one) takes time. My two cents is that it is much better to adopt temporary solution than leave the project with a bugged and unmaintained library. Moreover, from my experience you only face the downsides of a new shiny tool only after it's being implemented -- this just to say that maybe a Python library will work flawlessly, but you need to be prepared for some sort of breakage.

And just a personal opinion, given by someone totally unaware of the problems around this project (so take it with a grain of salt): I searched for rust tokenizer and I found these two projects. Maybe these are not good solutions for cv-sentence-extractor, but at least it could worth give them a look.

Of course. My point was more about my own resources, I'd spend time investigating more general approaches rather than investing time in rust-punkt. Didn't make that that clear, sorry.

Maybe these are not good solutions for cv-sentence-extractor, but at least it could worth give them a look.

Thanks!

@dodomorandi I'm working on a way to use Python as sentence splitting scripts to solve #11. However I might still be interested in incorporating your fixes to rust-punkt and using that for the default instead of the currently published crate. Would you say your fork would be ready to be used? If so, I'll have a look at your changes.

It should be fine enough: as you can imagine from my initial post, I just improved the situation a little bit, in order to make the project compile on stable. I just decided to fix some clippy lints, just to cleanup things a bit more. If you find any particular issue just let me know.