some regex not making any change in train/test data
aarora8 opened this issue · 2 comments
Laia/egs/iam/steps/iam_tokenize.py
Line 18 in c6bb8ab
Hi Joan,
To better understand function of each regex in steps/iam_tokenize.py, I did the following:
For File a)
- commented out all regex in steps/iam_tokenize.py.
- ran steps/prepare_iam_text.sh
For file b) - commented out all regex except one in steps/iam_tokenize.py.
- ran steps/prepare_iam_text.sh
Diff file a and file b.
For some regex, I understood that it is making following changes:
- Mr. -> Mr .
- settlers' -> settlers '
- Foot's -> Foot 's
- don't -> do n't
- cannot -> can not
- Bru"cke -> Bru " cke
- you've -> you 've
But some regex are not making any change, for example:
- STARTING_QUOTES
- PARENS_BRACKETS
- CONTRACTIONS3
next to STARTING_QUOTES, it is mentioned that: # This line changes: do not replace "
But I am not able to understand purpose if these regex. Are they for brown, wellington corpus?
Thanks,
Ashish
Hi!
The script was inherited from NLTK's Penn Tree Bank tokenizer. So, there are parts that may not be used in practice. Whether those parts that you pointed are used or not, I'm not sure.
I just want to stress that tokenization is only important for Language Modeling, and the CER/WER that we report is always based on the original GT.
Thanks,
Yeah, I also observed that tokenization file is only used in decode_lm and prepare external text.
And decode_net is using only ground truth data. Only minor modification like removing space between words and joining back contractions is done for ground truth data.