German IPA transcriptions from the English Wiktionary are incomplete
Opened this issue · 9 comments
The German .tsv files only have around 35.000 transcriptions.
However, there are certainly more than 600.000 IPA transcriptions in the German Wiktionary. I recently obtained ~ 670.000 IPA transcriptions with this script:
Download in .txt : https://mega.nz/file/cVg1iAKJ#xB_qoctX9eYaD5JDfiqLnYh4PWKdpEOL_J4fkBGubFY
Therefore, it seems that less than 10% of IPA transcriptions from German Wiktionary are published on the .tsv. The same situation happens with the French IPA transcriptions. I recently obtained ~ 286.000 IPA transcriptions with another script and the French .tsv of this project contains only ~ 57.000.
It seems that many IPA transcriptions are being ignored by "Wikipron".
Would it be possible to comprehensively scrape all the IPA transcriptions ?
I am interested in making "Pronunciation Dictionaries" for GoldenDict that are useful to language learners. I already made German and French dictionaries and published them freely.
I would like to obtain a comprehensive list of IPA transcriptions for other languages (English, Italian, etc) to convert them into .dsl format for GoldenDict.
Hi, I have a few questions to make this actionable on our side:
- Is it possible that German data has been enriched in the last year or so since we last ran the pipeline to generate the pronunciation TSV files? If so that would account for the issue and can be fixed by us simply running the pipeline again. (One could likely test this just by running
wikipron --phonemic ger
orwikipron --phonetic ger
and seeing how much it pulls down in November 2021 vs. last year.) - When a pronunciation is in that dump you mention but not in WikiPron, how does the page differ? Often we need to use custom HTML extractors for particular languages and it coudl be that German is one of them.
I'm not familiar with GoldenDict but you're encouraged to repackage the data for whatever so long as licensing permits.
Is it possible that German data has been enriched in the last year or so since we last ran the pipeline to generate the pronunciation TSV files?
No. I am sure that the German Wiktionary had more than 500.000 IPA transcriptions the last year. The German editors are very pro-active regarding IPA and audio recordings. Almost all entries have IPA and there are more than 700.000 audios currently.
- When a pronunciation is in that dump you mention but not in WikiPron, how does the page differ?_
As an example, the adjective "wonnetrunken" (blissful, merry) is missing in the .tsv. Adjectives in German have declinations. In this case, the IPA transcriptions of the declinations is also missing.
https://de.wiktionary.org/wiki/wonnetrunken
Wiki pages (Wiktionary, Wikipedia, etc) can be downloaded as .zim files and used in GoldenDict. The word ""wonnetrunken" shows the IPA in the .zim format :
For more examples of missing entries, please check the .txt file that I uploaded above.
The .tsv containing IPA transcriptions on "Wikipron" are incomplete. Many languages are being scraped only partially, but German is a good example.
@bmilde and @tatuylonen
Would you have any idea to help us solve this issue ?
I understand that each Wiktionary has different templates, and it might be very hard to deal with those variations. Please let me know if I could help ! Thanks the hard work !
Oh, I understand now. A larger number of pronunciations in IPA for German are available but only when on the German edition of Wiktionary itself. We don't have means to scrape anything but the English edition---it's the richest overall, though often the X edition of X pages (where X is a language) have more information overall than the English edition of X pages. It's something I'd like to target eventually but I haven't given it too much thought yet.
Thanks for the help clarifying the issue.
The "Wikipron" script is also missing IPA transcriptions of English terms in the ENWiktionary.
It seems that there are more than 100.000 IPA transcriptions on the .JSON files published by @tatuylonen :
https://kaikki.org/index.html
The .tsv file of "Wikipron" only contain around 50.000
Hope the issue coulld be solved. Thanks again for your great work !
This is a really interesting, and it could potentially be done for German but seems like it might be impractical to implement generally.
The script for extracting the IPA from a word page is done through an XPATH. The DOM for word pages on German Wiktionary is structured totally differently than that of English Wiktionary so we'd have to write a different XPATH to extract the IPA. We'd also have to do this to the other 100+ language versions of Wiktionary. Also, languages that use other writing systems sometimes don't use IPA on Wiktionary (e.g. Chinese and Bopomofo), so that would be a whole other problem.
We could still try to do this for a few languages, even if we can't generalize it fully, I'll try that in the morning.
So, I don't think we want (at present) to support non-English Wiktionary pages. It's just too big a lift and how long does it go?
However, if what's called for is that the German DOM is actually different we already have a relatively mature plug-in interface for this.
- default extractor
- example of a non-default extractor for Mandarin (many more in that directory---all of the existing ones are for East Asian languages)
- how we register the non-default extractors
All that would be needed is to create a non-default extractor deu.py
for German and then register it and rescrape.
However, if what's called for is that the German DOM is actually different we already have a relatively mature plug-in interface for this.
I don't think this is necessary, the German extractor seems to be working mostly fine so I don't see the need for a custom extractor. Also, this wouldn't solve the issue of IPA transcriptions that are only included on German Wiktionary not being scraped which seems to be the main reason that German is missing so many transcriptions.
Okay, shall I close this then?
Yep