misbehavior of data_utils.tokenize

Question

misbehavior of data_utils.tokenize

Luolc opened this issue 7 years ago · 4 comments

Hi! I am working on personalized task-oriented dialog as well and investigating on your model.

I find a minor bug:
The data_utils.tokenize cannot treat "middle-aged" as one word.

Actually, tokenize('male middle-aged') returns ['male', 'middle', '-', 'aged']. I think it should be ['male', 'middle-aged'] instead. This might affect the performance of baseline memN2N

Answer 1 · 2017-12-09T06:02:50.000Z

Thanks for the interest and for pointing this out! I shall work on a fix and run tests to confirm if this is actually having an impact on the performance itself.

On initial thought, I don't think it is. Our experiments showed us that accuracy of all models for the 6 profiles were almost the same. (For example, see the section on Multi-task Learning in the paper: https://arxiv.org/abs/1706.07503.) In this case, the model must have trained embeddings for 'middle' and 'aged' separately, but when storing the profile memory, it takes a sum of the various embeddings it is composed of. What do you think?

Regardless, splitting 'middle-aged' is not the behavior we intended, so I shall fix soon.

Answer 2 · 2017-12-09T06:05:40.000Z

P.S. Please make a pull request if you have already made a fix for it! :)

Answer 3 · 2017-12-09T09:44:09.000Z

I did fix it on my local code, but I also changed the code a lot to build my own model. It might be hard to create PR, sorry for that.
Actually it is a one-line fix at:
https://github.com/chaitjo/personalized-dialog/blob/master/MemN2N/data_utils.py#L56
re.split('(\W+)?', sent) -> re.split('([^A-Za-z0-9-]+)', sent)
add - and the ? is unnecessary.

P.S. should also on data_utils in other folders.

Answer 4 · 2017-12-09T09:48:09.000Z

I also found that the accuracy won't be affected by this minor thing based on my experiment. It is reasonable as 'middle-aged' could be seen as the sum of 'middle' and 'aged' in the embedding space. I agree with this.