Convert symbols / abbreviations to words

Question

Convert symbols / abbreviations to words

Closed this issue 5 years ago · 8 comments

dabinat commented 5 years ago

Some sentences have symbols or abbreviations that need to be spelled out directly, such as:

The main provider of international bus connection in Bosnia & Herzegovina is Eurolines.

He was buried at St. Peter’s, and was succeeded by Pope Paschal I.

First publications of his compositions included the String Quartet No.

I found sentences with the following:

& - should be "and"
Jr. - should be "Junior"
Sr. - should be "Senior"
No. - should be "Number"
Mt. - should be "Mount"

There are also a lot of Roman numerals, but I'm not really sure what to do with them as it's not quite as simple as just find and replace. An "I" in a sentence can mean "one", "the first" or the literal letter "I".

Answer 1 · 2019-07-12T05:06:48.000Z

For the German rules I just discard known abbreviations. However, it might be nice to provide a way to configure replacements so we do not lose sentences (mostly relevant to non-major Wikipedia languages)

Answer 2 · 2019-12-11T10:23:37.000Z

I agree. German is a large language, so it is easy to skip sentences, but for Frisian the wiki-corpus is very small, so every sentence is needed. A list with known abbreviations is probably not large, I guess around 10 abbreviations.

Answer 3 · 2020-01-07T07:12:06.000Z

@MichaelKohler please unsure that replacement is being proceeded before abbreviation_patterns proceeding.

Answer 4 · 2020-01-12T15:28:48.000Z

Can someone tell me what the status is of abbreviations while scraping sentences?
Are we going to allow them for scraping?

If no, how to skip these sentences (being a "." being not the end of a line)?
If yes, do we keep them within the sentences ‘as is’, and let the user read the full explanation of the abbreviation (e.g.: "etc." becomes spoken "etcetera"), or are we going to do a search and replace somewhere during the scrapingprocess?
I really like to know before proceeding with the next run.

Answer 5 · 2020-01-12T16:16:33.000Z

@Fjoerfoks answered at https://discourse.mozilla.org/t/technical-feedback-needed-wikipedia-extractor-script-beta/42983/45

Answer 6 · 2020-01-16T22:27:26.000Z

@Fjoerfoks @dabinat I've created a PR for this. Would you mind having a quick look at the README changes at https://github.com/Common-Voice/common-voice-wiki-scraper/pull/70/files#diff-04c6e90faac2675aa89e2176d2eec7d8 and tell me if that's how you imagined it working?

Answer 7 · 2020-01-17T12:55:23.000Z

That's the idea indeed.
First thought is that I am going to use the empty replacement for abbreviations with multiple possible replacements, like ‘Mr.’, which can be ‘mister’ or ‘master’. Leaving it blank will still give a valid sentence. (probably needs a space after the . too).

Previous question still stands: Are we going to allow abbreviations for scraping?
If we don't, will the script just jump to the next line of the Wikipedia-article and, if valid, put it in the textfile. If that is the case it might be better to disallow abbreviations, which I think will speed up the scraping process significantly. The total number of sentences will not decrease then I guess.

Answer 8 · 2020-01-17T14:00:17.000Z

Are we going to allow abbreviations for scraping?

No, that's what abbreviation_patterns is for. Anything matching that will be filtered out. Since that filters out quite a few sentences, we now will have that PR, which allows you to replace abbreviations before the filter check is done, therefore not losing these sentences (if there is a replacement pattern for it).