Why are there ~1500 duplicate words here?

Question

Why are there ~1500 duplicate words here?

farzher opened this issue 10 years ago · 5 comments

Shouldn't the list be deduplicated?

Answer 1 · 2015-06-26T03:55:28.000Z

Yes, it looks like 20k has 1470 duplicates, and 10 in the usa file:

$ wc -l < 20k.txt 
   19999
$ sort 20k.txt | uniq | wc -l
   18529
$ wc -l < google-10000-english-usa.txt 
    9999
$ sort google-10000-english-usa.txt | uniq | wc -l
    9989

Answer 2 · 2015-06-26T15:55:30.000Z

I don't know. Is it a case sensitivity issue?
Does sort or uniq only get one of "this" and "This" ?

Answer 3 · 2015-06-26T17:14:05.000Z

It's not a case issue. It's exact duplicates. Check using any random dedupe tool.

Apparently Word is in there 9 times

Answer 4 · 2016-07-18T05:42:46.000Z

It seems to combine two different sources into 20k.txt.

I'm checking the frequency rankings of this list using 20k.txt, and the result is this.

The original count_1w.txt shows the straight graph.

Answer 5 · 2016-07-18T17:23:48.000Z

Great catch - not sure why the the original source has duplicates. I appreciate the fix.