common-voice/cv-sentence-extractor

A lot of names

Mte90 opened this issue · 4 comments

Mte90 commented

I see in the sentences a lot of names/surnames of people in a single row like:

Luís Rocha
Beatriz Ortiz
Laura López
Jon Elmore
Fatal Fury
Brian Joy
Pedro Sass

So I don't think that is cv-tools failing but the scraper that takes them maybe from the page title or from a link to a bio?

Those are very difficult for Italians and also people don't understand why there are them.

Without knowing where exactly these are coming from, it's hard to come to a conclusion here.

Mte90 commented

They are extracted by the scraper, seems that cvtools cannot detect them probably because some of them are verbs in surnames as example.
I think that an option in the scraper that can detect names like:

  • if the sentence is two words
  • one of them is a name like John ignore it

Can help on getting only sentences and not names, I don't know why there are lines with only a name, maybe because is the title of the page that is extracted by the tool itself.

if the sentence is two words

That is already configurable.

one of them is a name like John ignore it

That will be very hard to detect.

I think this could be the same as #57, I'm duping this for now.

Mte90 commented

Yes two words is configurable but is not enough as control for our needs.