How to extract uncommon words from eng_dict.txt?
Closed this issue · 2 comments
Hi, thank you very much for the extension; it is extremely helpful for me!
For me, 25% in the extension is the best setting. Would it be possible to export all uncommon words (from 75%) as a txt file? Or, could you please explain a bit about how you set up and adjust word frequency?
By doing this, I can write a script for extracting uncommon words from PDF files and we don't have to be limited to web pages.
Thanks so much for the help!
Thanks for the feedback!
The words are stored as eng_dict.txt file and they are sorted by frequency in it.
Since there are 110151 entries in the file you can just take all entries starting with 0.25 * 110151 = 27537 line.
To get word frequency I used Reddit corpus: https://www.reddit.com/r/datasets/comments/3bxlg7/i_have_every_publicly_available_reddit_comment/ - as I remember I just calculated freqencies of all valid English words in the corpus.
Hope it helps!
Many thanks for the reply! Apparently you've done a lot of work and your reply is incredibly useful!