German IPA transcriptions from the English Wiktionary are incomplete

Question

German IPA transcriptions from the English Wiktionary are incomplete

Opened this issue 3 years ago · 9 comments

The German .tsv files only have around 35.000 transcriptions.

However, there are certainly more than 600.000 IPA transcriptions in the German Wiktionary. I recently obtained ~ 670.000 IPA transcriptions with this script:

Download in .txt : https://mega.nz/file/cVg1iAKJ#xB_qoctX9eYaD5JDfiqLnYh4PWKdpEOL_J4fkBGubFY

Therefore, it seems that less than 10% of IPA transcriptions from German Wiktionary are published on the .tsv. The same situation happens with the French IPA transcriptions. I recently obtained ~ 286.000 IPA transcriptions with another script and the French .tsv of this project contains only ~ 57.000.

It seems that many IPA transcriptions are being ignored by "Wikipron".

Would it be possible to comprehensively scrape all the IPA transcriptions ?

I am interested in making "Pronunciation Dictionaries" for GoldenDict that are useful to language learners. I already made German and French dictionaries and published them freely.

I would like to obtain a comprehensive list of IPA transcriptions for other languages (English, Italian, etc) to convert them into .dsl format for GoldenDict.

sonofthomp commented a year ago

Yep

Answer 1 · 2021-11-20T23:47:29.000Z

Hi, I have a few questions to make this actionable on our side:

Is it possible that German data has been enriched in the last year or so since we last ran the pipeline to generate the pronunciation TSV files? If so that would account for the issue and can be fixed by us simply running the pipeline again. (One could likely test this just by running wikipron --phonemic ger or wikipron --phonetic ger and seeing how much it pulls down in November 2021 vs. last year.)
When a pronunciation is in that dump you mention but not in WikiPron, how does the page differ? Often we need to use custom HTML extractors for particular languages and it coudl be that German is one of them.

I'm not familiar with GoldenDict but you're encouraged to repackage the data for whatever so long as licensing permits.

Answer 2 · 2021-11-21T02:04:57.000Z

Is it possible that German data has been enriched in the last year or so since we last ran the pipeline to generate the pronunciation TSV files?

No. I am sure that the German Wiktionary had more than 500.000 IPA transcriptions the last year. The German editors are very pro-active regarding IPA and audio recordings. Almost all entries have IPA and there are more than 700.000 audios currently.

When a pronunciation is in that dump you mention but not in WikiPron, how does the page differ?_

As an example, the adjective "wonnetrunken" (blissful, merry) is missing in the .tsv. Adjectives in German have declinations. In this case, the IPA transcriptions of the declinations is also missing.

https://de.wiktionary.org/wiki/wonnetrunken

Here is a screenshot:

Wiki pages (Wiktionary, Wikipedia, etc) can be downloaded as .zim files and used in GoldenDict. The word ""wonnetrunken" shows the IPA in the .zim format :

For more examples of missing entries, please check the .txt file that I uploaded above.

The .tsv containing IPA transcriptions on "Wikipron" are incomplete. Many languages are being scraped only partially, but German is a good example.

@bmilde and @tatuylonen
Would you have any idea to help us solve this issue ?

I understand that each Wiktionary has different templates, and it might be very hard to deal with those variations. Please let me know if I could help ! Thanks the hard work !

Answer 3 · 2021-11-21T02:07:56.000Z

Oh, I understand now. A larger number of pronunciations in IPA for German are available but only when on the German edition of Wiktionary itself. We don't have means to scrape anything but the English edition---it's the richest overall, though often the X edition of X pages (where X is a language) have more information overall than the English edition of X pages. It's something I'd like to target eventually but I haven't given it too much thought yet.

Thanks for the help clarifying the issue.

Answer 4 · 2021-11-21T02:16:03.000Z

The "Wikipron" script is also missing IPA transcriptions of English terms in the ENWiktionary.

It seems that there are more than 100.000 IPA transcriptions on the .JSON files published by @tatuylonen :
https://kaikki.org/index.html

The .tsv file of "Wikipron" only contain around 50.000

Hope the issue coulld be solved. Thanks again for your great work !

Answer 5 · 2023-08-18T05:02:38.000Z

This is a really interesting, and it could potentially be done for German but seems like it might be impractical to implement generally.

The script for extracting the IPA from a word page is done through an XPATH. The DOM for word pages on German Wiktionary is structured totally differently than that of English Wiktionary so we'd have to write a different XPATH to extract the IPA. We'd also have to do this to the other 100+ language versions of Wiktionary. Also, languages that use other writing systems sometimes don't use IPA on Wiktionary (e.g. Chinese and Bopomofo), so that would be a whole other problem.

We could still try to do this for a few languages, even if we can't generalize it fully, I'll try that in the morning.

Answer 6 · 2023-08-18T13:29:22.000Z

So, I don't think we want (at present) to support non-English Wiktionary pages. It's just too big a lift and how long does it go?

However, if what's called for is that the German DOM is actually different we already have a relatively mature plug-in interface for this.

default extractor
example of a non-default extractor for Mandarin (many more in that directory---all of the existing ones are for East Asian languages)
how we register the non-default extractors

All that would be needed is to create a non-default extractor deu.py for German and then register it and rescrape.

Answer 7 · 2023-08-20T22:55:52.000Z

However, if what's called for is that the German DOM is actually different we already have a relatively mature plug-in interface for this.

I don't think this is necessary, the German extractor seems to be working mostly fine so I don't see the need for a custom extractor. Also, this wouldn't solve the issue of IPA transcriptions that are only included on German Wiktionary not being scraped which seems to be the main reason that German is missing so many transcriptions.

Answer 8 · 2023-08-21T16:50:37.000Z

Okay, shall I close this then?