Why are there ~1500 duplicate words here?
farzher opened this issue · 5 comments
farzher commented
Shouldn't the list be deduplicated?
kylemcdonald commented
Yes, it looks like 20k has 1470 duplicates, and 10 in the usa
file:
$ wc -l < 20k.txt
19999
$ sort 20k.txt | uniq | wc -l
18529
$ wc -l < google-10000-english-usa.txt
9999
$ sort google-10000-english-usa.txt | uniq | wc -l
9989
whitten commented
I don't know. Is it a case sensitivity issue?
Does sort or uniq only get one of "this" and "This" ?
farzher commented
koseki commented
It seems to combine two different sources into 20k.txt
.
I'm checking the frequency rankings of this list using 20k.txt
, and the result is this.
The original count_1w.txt
shows the straight graph.
worldlywisdom commented
Great catch - not sure why the the original source has duplicates. I appreciate the fix.