LanguageMachines/ticcltools

Word length filter faulty? (TICCL-indexer and/or TICCL-LDcalc)

martinreynaert opened this issue · 3 comments

We observe that short words may exceed the limit imposed by the --low= parameter. In a prior run we had specified --low=4 in for TICCL-indexer, but not for TICCL-LDcalc. This resulted in longer word string being included in the file *short.ldcalc. We took this to be due to the fact that TICCL-LDcalc has default value = 5.

We repeated this run with the --low and --high parameters set equally to 4 and 60 for both tools.

reynaert@red:~$ cut -d '~' -f 1 RUNNEW2HYPHEN.wordfreqlist.1to3.tsv.clean.Low4High60.short.ldcalc |sort -u >bla ^C
reynaert@red:~$ grep '^.$' bla |wc
      0       0       0
reynaert@red:~$ cd /reddata/PILOTS/MORSE/RUNNEW2HYPHEN/zzz/TICCL
reynaert@red:/reddata/PILOTS/MORSE/RUNNEW2HYPHEN/zzz/TICCL$ cut -d '~' -f 1 RUNNEW2HYPHEN.wordfreqlist.1to3.tsv.clean.Low4High60.short.ldcalc |sort -u >bla ^C
reynaert@red:/reddata/PILOTS/MORSE/RUNNEW2HYPHEN/zzz/TICCL$ grep '^.$' bla |wc
     18      18      36
reynaert@red:/reddata/PILOTS/MORSE/RUNNEW2HYPHEN/zzz/TICCL$ grep '^..$' bla |wc
    173     173     541
reynaert@red:/reddata/PILOTS/MORSE/RUNNEW2HYPHEN/zzz/TICCL$ grep '^...$' bla |wc
    504     504    2056
reynaert@red:/reddata/PILOTS/MORSE/RUNNEW2HYPHEN/zzz/TICCL$ grep '^....$' bla |wc
    748     748    3797
reynaert@red:/reddata/PILOTS/MORSE/RUNNEW2HYPHEN/zzz/TICCL$ grep '^.....$' bla |wc
   1016    1016    6192
reynaert@red:/reddata/PILOTS/MORSE/RUNNEW2HYPHEN/zzz/TICCL$ grep '^......$' bla |wc
      0       0       0
reynaert@red:/reddata/PILOTS/MORSE/RUNNEW2HYPHEN/zzz/TICCL$ grep '^.......$' bla |wc
      0       0       0

We see we still get many variants of length 5 incorporated in *short.ldcalc.

We are not sure on what basis a pair is currently sent to *short.ldcalc. We assume that this may also be caused by the length of the CC. However, we think pairs should be sent to *short.ldcalc purely on the basis of the variant's length.

Also we observe that the same variant is sometimes (?) assigned to *short.ldcalc as well as to *ldcalc proper.

reynaert@red:/reddata/PILOTS/MORSE/RUNNEW2HYPHEN/zzz/TICCL$ grep '^ñones' /reddata/PILOTS/MORSE/RUNNEW2HYPHEN/zzz/TICCL/RUNNEW2HYPHEN.wordfreqlist.1to3.tsv.clean.Low4High60.short.ldcalc |head
ñones~4~4~AGNES~1000000261~1000022384~0~2~3~1~0~1~0~2
ñones~4~4~Agnes~1000022114~1000022384~0~2~3~1~0~1~0~6
ñones~4~4~Anes~1000000163~1000000214~0~2~3~1~0~1~0~1
ñones~4~4~Annes~1000000675~1000000678~0~2~3~1~0~1~0~4
ñones~4~4~Arnes~1000000016~1000000033~0~2~3~1~0~1~0~1
ñones~4~4~BANES~1000000030~1000000337~0~2~3~1~0~1~0~1
ñones~4~4~BOCES~1000000192~1000000203~0~2~3~1~0~1~0~3
ñones~4~4~BOEs~1000000010~1000000088~0~2~3~1~0~1~0~1
ñones~4~4~BONE~1000000049~1000050928~0~2~3~1~0~0~0~1
ñones~4~4~BONES~1000000073~1000045592~0~1~4~1~0~1~0~3
reynaert@red:/reddata/PILOTS/MORSE/RUNNEW2HYPHEN/zzz/TICCL$ grep '^ñones' /reddata/PILOTS/MORSE/RUNNEW2HYPHEN/zzz/TICCL/RUNNEW2HYPHEN.wordfreqlist.1to3.tsv.clean.Low4High60.ldcalc |head
ñones~4~4~AGNES~1000000261~1000022384~10528371854~2~3~1~0~1~0~0
ñones~4~4~ALONEs~1000000001~1000000013~21138833638~2~4~1~0~1~0~0
ñones~4~4~ANES~1000000013~1000000214~13335164745~2~3~1~0~1~0~0
ñones~4~4~ANEs~1000000001~1000000214~13335164745~2~3~1~0~1~0~0
ñones~4~4~ARNES~1000000016~1000000033~1358116023~2~3~1~0~1~0~0
ñones~4~4~A_ones~1000000002~1000000006~14556224838~2~4~1~0~1~0~0
ñones~4~4~Agnes~1000022114~1000022384~10528371854~2~3~1~0~1~0~0
ñones~4~4~Aines~1000000005~1000000005~47091031~2~3~1~0~1~0~0
ñones~4~4~Alones~1000000008~1000000013~21138833638~2~4~1~0~1~0~0
ñones~4~4~Anes~1000000163~1000000214~13335164745~2~3~1~0~1~0~0

This is not at all desirable. Also, the sets of CCs are different...

It would be nice if this could be sorted out first. Please give this priority.

  • reformatted for readability.
  • split the second problem into separate issue: #31

Ok, this is based on a misconception:
the --low and --high filters are only imposed on the words from the .clean file

The "recently" .short file option doesn't use this value, but an implicit value of 5.
I will adapt LDcalc to use the same 'low' value there too.

fix in GIT now.
#31 was caused by this, after all.