Word length filter faulty? (TICCL-indexer and/or TICCL-LDcalc)
martinreynaert opened this issue · 3 comments
We observe that short words may exceed the limit imposed by the --low= parameter. In a prior run we had specified --low=4 in for TICCL-indexer, but not for TICCL-LDcalc. This resulted in longer word string being included in the file *short.ldcalc. We took this to be due to the fact that TICCL-LDcalc has default value = 5.
We repeated this run with the --low and --high parameters set equally to 4 and 60 for both tools.
reynaert@red:~$ cut -d '~' -f 1 RUNNEW2HYPHEN.wordfreqlist.1to3.tsv.clean.Low4High60.short.ldcalc |sort -u >bla ^C
reynaert@red:~$ grep '^.$' bla |wc
0 0 0
reynaert@red:~$ cd /reddata/PILOTS/MORSE/RUNNEW2HYPHEN/zzz/TICCL
reynaert@red:/reddata/PILOTS/MORSE/RUNNEW2HYPHEN/zzz/TICCL$ cut -d '~' -f 1 RUNNEW2HYPHEN.wordfreqlist.1to3.tsv.clean.Low4High60.short.ldcalc |sort -u >bla ^C
reynaert@red:/reddata/PILOTS/MORSE/RUNNEW2HYPHEN/zzz/TICCL$ grep '^.$' bla |wc
18 18 36
reynaert@red:/reddata/PILOTS/MORSE/RUNNEW2HYPHEN/zzz/TICCL$ grep '^..$' bla |wc
173 173 541
reynaert@red:/reddata/PILOTS/MORSE/RUNNEW2HYPHEN/zzz/TICCL$ grep '^...$' bla |wc
504 504 2056
reynaert@red:/reddata/PILOTS/MORSE/RUNNEW2HYPHEN/zzz/TICCL$ grep '^....$' bla |wc
748 748 3797
reynaert@red:/reddata/PILOTS/MORSE/RUNNEW2HYPHEN/zzz/TICCL$ grep '^.....$' bla |wc
1016 1016 6192
reynaert@red:/reddata/PILOTS/MORSE/RUNNEW2HYPHEN/zzz/TICCL$ grep '^......$' bla |wc
0 0 0
reynaert@red:/reddata/PILOTS/MORSE/RUNNEW2HYPHEN/zzz/TICCL$ grep '^.......$' bla |wc
0 0 0
We see we still get many variants of length 5 incorporated in *short.ldcalc.
We are not sure on what basis a pair is currently sent to *short.ldcalc. We assume that this may also be caused by the length of the CC. However, we think pairs should be sent to *short.ldcalc purely on the basis of the variant's length.
Also we observe that the same variant is sometimes (?) assigned to *short.ldcalc as well as to *ldcalc proper.
reynaert@red:/reddata/PILOTS/MORSE/RUNNEW2HYPHEN/zzz/TICCL$ grep '^ñones' /reddata/PILOTS/MORSE/RUNNEW2HYPHEN/zzz/TICCL/RUNNEW2HYPHEN.wordfreqlist.1to3.tsv.clean.Low4High60.short.ldcalc |head
ñones~4~4~AGNES~1000000261~1000022384~0~2~3~1~0~1~0~2
ñones~4~4~Agnes~1000022114~1000022384~0~2~3~1~0~1~0~6
ñones~4~4~Anes~1000000163~1000000214~0~2~3~1~0~1~0~1
ñones~4~4~Annes~1000000675~1000000678~0~2~3~1~0~1~0~4
ñones~4~4~Arnes~1000000016~1000000033~0~2~3~1~0~1~0~1
ñones~4~4~BANES~1000000030~1000000337~0~2~3~1~0~1~0~1
ñones~4~4~BOCES~1000000192~1000000203~0~2~3~1~0~1~0~3
ñones~4~4~BOEs~1000000010~1000000088~0~2~3~1~0~1~0~1
ñones~4~4~BONE~1000000049~1000050928~0~2~3~1~0~0~0~1
ñones~4~4~BONES~1000000073~1000045592~0~1~4~1~0~1~0~3
reynaert@red:/reddata/PILOTS/MORSE/RUNNEW2HYPHEN/zzz/TICCL$ grep '^ñones' /reddata/PILOTS/MORSE/RUNNEW2HYPHEN/zzz/TICCL/RUNNEW2HYPHEN.wordfreqlist.1to3.tsv.clean.Low4High60.ldcalc |head
ñones~4~4~AGNES~1000000261~1000022384~10528371854~2~3~1~0~1~0~0
ñones~4~4~ALONEs~1000000001~1000000013~21138833638~2~4~1~0~1~0~0
ñones~4~4~ANES~1000000013~1000000214~13335164745~2~3~1~0~1~0~0
ñones~4~4~ANEs~1000000001~1000000214~13335164745~2~3~1~0~1~0~0
ñones~4~4~ARNES~1000000016~1000000033~1358116023~2~3~1~0~1~0~0
ñones~4~4~A_ones~1000000002~1000000006~14556224838~2~4~1~0~1~0~0
ñones~4~4~Agnes~1000022114~1000022384~10528371854~2~3~1~0~1~0~0
ñones~4~4~Aines~1000000005~1000000005~47091031~2~3~1~0~1~0~0
ñones~4~4~Alones~1000000008~1000000013~21138833638~2~4~1~0~1~0~0
ñones~4~4~Anes~1000000163~1000000214~13335164745~2~3~1~0~1~0~0
This is not at all desirable. Also, the sets of CCs are different...
It would be nice if this could be sorted out first. Please give this priority.
Ok, this is based on a misconception:
the --low and --high filters are only imposed on the words from the .clean file
The "recently" .short file option doesn't use this value, but an implicit value of 5.
I will adapt LDcalc to use the same 'low' value there too.