common-voice/cv-sentence-extractor

Convert symbols / abbreviations to words

Closed this issue · 8 comments

Some sentences have symbols or abbreviations that need to be spelled out directly, such as:

The main provider of international bus connection in Bosnia & Herzegovina is Eurolines.

He was buried at St. Peter’s, and was succeeded by Pope Paschal I.

First publications of his compositions included the String Quartet No.

I found sentences with the following:

& - should be "and"
Jr. - should be "Junior"
Sr. - should be "Senior"
No. - should be "Number"
Mt. - should be "Mount"

There are also a lot of Roman numerals, but I'm not really sure what to do with them as it's not quite as simple as just find and replace. An "I" in a sentence can mean "one", "the first" or the literal letter "I".

For the German rules I just discard known abbreviations. However, it might be nice to provide a way to configure replacements so we do not lose sentences (mostly relevant to non-major Wikipedia languages)

I agree. German is a large language, so it is easy to skip sentences, but for Frisian the wiki-corpus is very small, so every sentence is needed. A list with known abbreviations is probably not large, I guess around 10 abbreviations.

@MichaelKohler please unsure that replacement is being proceeded before abbreviation_patterns proceeding.

Can someone tell me what the status is of abbreviations while scraping sentences?
Are we going to allow them for scraping?

  • If no, how to skip these sentences (being a "." being not the end of a line)?
  • If yes, do we keep them within the sentences ‘as is’, and let the user read the full explanation of the abbreviation (e.g.: "etc." becomes spoken "etcetera"), or are we going to do a search and replace somewhere during the scrapingprocess?
    I really like to know before proceeding with the next run.

@Fjoerfoks @dabinat I've created a PR for this. Would you mind having a quick look at the README changes at https://github.com/Common-Voice/common-voice-wiki-scraper/pull/70/files#diff-04c6e90faac2675aa89e2176d2eec7d8 and tell me if that's how you imagined it working?

That's the idea indeed.
First thought is that I am going to use the empty replacement for abbreviations with multiple possible replacements, like ‘Mr.’, which can be ‘mister’ or ‘master’. Leaving it blank will still give a valid sentence. (probably needs a space after the . too).

Previous question still stands: Are we going to allow abbreviations for scraping?
If we don't, will the script just jump to the next line of the Wikipedia-article and, if valid, put it in the textfile. If that is the case it might be better to disallow abbreviations, which I think will speed up the scraping process significantly. The total number of sentences will not decrease then I guess.

Are we going to allow abbreviations for scraping?

No, that's what abbreviation_patterns is for. Anything matching that will be filtered out. Since that filters out quite a few sentences, we now will have that PR, which allows you to replace abbreviations before the filter check is done, therefore not losing these sentences (if there is a replacement pattern for it).