A text file containing 12209 English words that look like pinyin, i.e., can be segmented into Chinese syllables. See pinyin_like_english_words.txt
.
Certain English words can be segmented into Chinese syllables and interpreted as pinyin, for example
- cache as "ca che",
- siren as "si ren",
- genre as "gen re",
- Chihuahua as "chi hua hua",
and sometimes the dramatic differences in the resulted pronunciations are funny to me.
Out of curiosity, I wrote a simple Python script main.py
to find these pinyin-like English words.
words_alpha.txt
is obtained from english-words and contains about 370k English words.
pinyin.txt
is obtained from ISO 7098:2015 (Annex A: Table of Chinese syllable forms) and contains 410 Chinese syllables.
- The longest pinyin-like word is humuhumunukunukuapuaa. The pronounciations in English and pinyin are quite similar.
- About 3% (12209/370103) of words are pinyin-like.
I'll write more in my blog post. It'll be in Chinese though.