Scraping POS
mjuliao-dot opened this issue · 1 comments
mjuliao-dot commented
(suggestion) It would be great to have access to the POS of the words
since there are many cases where for the same writing the pronunciation is different for different POS. v.g. per'mit (verb) vs 'permit (noun).
kylebgorman commented
The way we can do what we do with pronunciations is that for all but a handful of East Asian languages (which we hard coded, with great effort), the HTML format pronunciations are identical. I do not think this is true with POS. Our experience with UniMorph (https://unimorph.github.io/) is such that morphological coding on Wiktionary is more or less language-specific and requires a domain expert’s input. In your example of “permit”, they don’t say that one pronunciation is a verb and one is a noun, with some standard syntax, they say “most verb senses”, etc. So I think this is way out of scope: it’s just too hard.If you just care about English I have a corpus labeled for that (https://github.com/google-research-datasets/TextNormalizationCoveringGrammars ). It is not terribly common for homographs to be purely POS disambiguated across the world’s languages and it’s an open question whether morphology can help for g2p. On Dec 7, 2022, at 10:36 AM, mjuliao-dot ***@***.***> wrote:
(suggestion) It would be great to have access to the POS of the words
since there are many cases where for the same writing the pronunciation is different for different POS. v.g. per'mit (verb) vs 'permit (noun).
—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>