How to extract uncommon words from eng_dict.txt?

Question

How to extract uncommon words from eng_dict.txt?

Closed this issue 3 years ago · 2 comments

Hi, thank you very much for the extension; it is extremely helpful for me!

For me, 25% in the extension is the best setting. Would it be possible to export all uncommon words (from 75%) as a txt file? Or, could you please explain a bit about how you set up and adjust word frequency?

By doing this, I can write a script for extracting uncommon words from PDF files and we don't have to be limited to web pages.

Thanks so much for the help!

Answer 1 · 2022-01-24T02:40:27.000Z

Thanks for the feedback!
The words are stored as eng_dict.txt file and they are sorted by frequency in it.
Since there are 110151 entries in the file you can just take all entries starting with 0.25 * 110151 = 27537 line.

To get word frequency I used Reddit corpus: https://www.reddit.com/r/datasets/comments/3bxlg7/i_have_every_publicly_available_reddit_comment/ - as I remember I just calculated freqencies of all valid English words in the corpus.
Hope it helps!

Answer 2 · 2022-01-24T05:54:52.000Z

Many thanks for the reply! Apparently you've done a lot of work and your reply is incredibly useful!