Convert symbols / abbreviations to words
Closed this issue · 8 comments
Some sentences have symbols or abbreviations that need to be spelled out directly, such as:
The main provider of international bus connection in Bosnia & Herzegovina is Eurolines.
He was buried at St. Peter’s, and was succeeded by Pope Paschal I.
First publications of his compositions included the String Quartet No.
I found sentences with the following:
& - should be "and"
Jr. - should be "Junior"
Sr. - should be "Senior"
No. - should be "Number"
Mt. - should be "Mount"
There are also a lot of Roman numerals, but I'm not really sure what to do with them as it's not quite as simple as just find and replace. An "I" in a sentence can mean "one", "the first" or the literal letter "I".
For the German rules I just discard known abbreviations. However, it might be nice to provide a way to configure replacements so we do not lose sentences (mostly relevant to non-major Wikipedia languages)
I agree. German is a large language, so it is easy to skip sentences, but for Frisian the wiki-corpus is very small, so every sentence is needed. A list with known abbreviations is probably not large, I guess around 10 abbreviations.
@MichaelKohler please unsure that replacement is being proceeded before abbreviation_patterns proceeding.
Can someone tell me what the status is of abbreviations while scraping sentences?
Are we going to allow them for scraping?
- If no, how to skip these sentences (being a "." being not the end of a line)?
- If yes, do we keep them within the sentences ‘as is’, and let the user read the full explanation of the abbreviation (e.g.: "etc." becomes spoken "etcetera"), or are we going to do a search and replace somewhere during the scrapingprocess?
I really like to know before proceeding with the next run.
@Fjoerfoks @dabinat I've created a PR for this. Would you mind having a quick look at the README changes at https://github.com/Common-Voice/common-voice-wiki-scraper/pull/70/files#diff-04c6e90faac2675aa89e2176d2eec7d8 and tell me if that's how you imagined it working?
That's the idea indeed.
First thought is that I am going to use the empty replacement for abbreviations with multiple possible replacements, like ‘Mr.’, which can be ‘mister’ or ‘master’. Leaving it blank will still give a valid sentence. (probably needs a space after the . too).
Previous question still stands: Are we going to allow abbreviations for scraping?
If we don't, will the script just jump to the next line of the Wikipedia-article and, if valid, put it in the textfile. If that is the case it might be better to disallow abbreviations, which I think will speed up the scraping process significantly. The total number of sentences will not decrease then I guess.
Are we going to allow abbreviations for scraping?
No, that's what abbreviation_patterns
is for. Anything matching that will be filtered out. Since that filters out quite a few sentences, we now will have that PR, which allows you to replace abbreviations before the filter check is done, therefore not losing these sentences (if there is a replacement pattern for it).