common-voice/cv-sentence-extractor

Missing numbers in sentences

lovasoa opened this issue · 26 comments

If this is not the right place to report this, then feel free to close this issue. I am not really sure about where to report this.

When using common voice in french, I regularly get sentences that do not mean anything, and in these cases, it often feels like there was a number in the original sentence, that somehow got stripped out.

Could it be because of the way sentences are handled by this scraper ?

Hi, thanks for filing this issue. Do you happen to remember one of these sentences? If not, can you please add it here the next time you encounter one? This will help us figure out where exactly it came from.

In theory, if there is no bug, sentences with numbers completely get rejected, we do not just remove the number. However, in the sentence collector you get a message that it got rejected, and maybe somebody just removed the numbers and submitted anyway?

Is there a way to access the list of sentences that have been reported as containing a grammatical error on voice.mozilla.org ?

That may be faster than waiting until I find an instance of the problem in the sentences I am presented again.

Is there a way to access the list of sentences that have been reported as containing a grammatical error on voice.mozilla.org ?

Maybe @phirework could help here :)

@mbransn I was told devs are busy, can you give access to these sentences? We're interested in the reported French sentences.

@MichaelKohler we're still formalizing access to reported sentences to ensure privacy & data compliance and sadly it's not as simple as granting access to a specific segment; e.g. french sentences with grammar errors. Building the tagging mechanism was one thing, building in access to view reports is another that sadly is in our backlog. 😬

For this to happen currently we'd have to employee programming time to dig for those, and as you've heard, the devs are quite focused on infrastructure and code base improvements to support our current scale. Thanks for the patience as we triage and work up to features like these. cc @phirework for visibility.

If privacy is a problem, maybe you could quickly have a look at the reported (english) sentences yourself, and paste here a few ones for which the scraper seems to be the issue ? This could already be very helpful.

Let's not make a mess here. This is not per se about access to all reported sentences, let's keep that in an issue in the voice-web repo, if there isn't one already. I've changed the title back. Having access to these is only one way to investigate the issue here. @lovasoa please report here once you encounter another one of these.

have a look at the reported (english) sentences

Why English? I thought the initial report was for French sentences?

@MichaelKohler community with access to kibana can query the reported_sentences table and might be possible to generate a report, can you check with the contributors who have access?

Hi all, apologies as I misunderstood the initial ask. I've attached a text file containing the French sentences that have been reported for having a grammar or spelling issue in 2020, hope that helps.
reported_fr_grammar.txt

Thank you :)

Hey @phirework could you create a file like this for German and Esperanto too please? I am preparing another PR for these anyway so I could delete the sentences on the run.

It would be great if a list like this could be available for every language two month before every dataset release or so, so that the communities have the chance to clean their dataset.

@phirework Great, thank you very much, this is exactly what we needed!

Here is an example of a sentence with a missing number:

Elle se rencontre à d'altitude dans la municipalité de Guaraqueçaba.

There should be a number and an unit of measurement between à and d'altitude. This must be a problem with template expansion.

Bruce Toussaint est né le à Asnières-sur-Seine.

Here, a date is missing after le.

Le vin de pays des Alpilles labellise environ hectolitres par an.

A number is missing after environ.

La rencontre a lieu le au Costa Rica.

A date is missing after le.

Le quatuor français compte jusqu'à à de la ligne d'arrivée.

Two numbers are missing: before and after à.

Enfin, les censeurs du début du se soucient énormément de la protection de l'enfance.

A roman literal is missing.

The majority of the sentences in the file is missing numbers. This is good news, because if this is indeed coming from a single bug, it means that fixing it will drastically reduce the overall number of errors.

Also, I would advise checking other languages, because it doesn't look like a French-specific issue.

I've quickly checked where these are common from and they all come from the Wikipedia extraction. So this issue is indeed in the right place. My assumption right now would be that the WikiExtractor drops certain formatting. Of course that's just an assumption.

Thanks for providing these examples, I could figure out what is going wrong here. Here are two source code examples from Wikipedia:

des Alpilles labellise environ {{formatnum:6000}} hectolitres
Bruce Toussaint est né le {{date de naissance-|17|octobre|1973}}

These involve Magic Words, which it seems the WikiExtractor has issues with. There is a bug filed for this, but there doesn't seem to be a solution we could use here: attardi/wikiextractor#189. This unfortunately means we can't do anything about this :(

Maybe we could preprocess the Wikipedia dump before passing it to wikiextractor ?

Edit: does not work as expected. Work continues.

this could work, I have not tested it.
Place to put it (also to be tested): before # Drop tables, in: def wiki2text(self, text):
text = re.sub(r'{{formatnum:\s?(?P[.,0-9]*)}}', '\g', text)
If this works, it could be permanently added. I have suggested the same solution for the 'As Of'-tags, also to be tested.

Maybe we could preprocess the Wikipedia dump before passing it to wikiextractor ?

Given how long it already takes to download, extract and run the WikiExtractor, I don't think any additional script is feasible. Additionally, I strongly think this should be fixed on the WikiExtractor side so others can benefit from it too.

Additionally, I do not see this as a major issue. While we of course want as little as possible in terms of error rate, there is a threshold we're comfortable with.

Place to put it (also to be tested): before # Drop tables, in: def wiki2text(self, text):
text = re.sub(r'{{formatnum:\s?(?P[.,0-9]*)}}', '\g', text)

@HjalmarrSv thanks for this idea. I've only analyzed two of the reported sentences, and it seems it's not all about formatnum only. There is for example also {{date de naissance-|17|octobre|1973}}.

Let's keep the discussion in attardi/wikiextractor#189 though.

Edit: does not work as expected

Sorry about that!: I had to escape the < tags in This comment field!
looks like I forgot the variable name: ?P<num> and \g<num>
text = re.sub(r'{{formatnum:\s?(?P<num>[.,0-9]*)}}', '\g<num>', text)
It may be easier to just Clean the tags instead of using a variable.

Note that this solution is not [[TEMPLATE]] friendly. For templateexpansion to work this can be tested:
text = re.sub(r'{{formatnum:\s?(?P<num>[^}]*)}}', '\g<num>', text)

@HjalmarrSv : You can write code in github comments by putting it between batckticks :

```python
# your python code here
```

@MichaelKohler Is it in the scope of mozilla common voice to work on fixing upstream issues like this one ? Is someone from mozilla going to work on this ?

Let's not make a mess here. This is not per se about access to all reported sentences, let's keep that in an issue in the voice-web repo, if there isn't one already. I've changed the title back.

@MichaelKohler I definitely mistook this repo for voice-web, apologies. Thanks for the correction. 🤦‍♂ Agreed re: keeping here.

Is it in the scope of mozilla common voice to work on fixing upstream issues like this one ? Is someone from mozilla going to work on this ?

@lovasoa the scope of the wiki extractor work is in discussion with the Mozilla Common Voice team and on our roadmap for review / scoping in Q1 (outlook is FEB for initial discussions). We are looking to balance the needs of the Sentence Collector and the Wiki Extractor alongside voice.mozilla.org.