error tokenize

Question

error tokenize

HoiBunCa opened this issue 5 years ago · 1 comments

sau khi kiểm tra code của file alignment_utils.py, em nhận ra bpe_tokens và other_tokens khác nhau đối với từ "gì vậy"
dòng thứ nhất là bpe_tokens
dòng thứ 2 là other_tokens
dòng thứ 3 là ''.join(bpe_tokens)
dòng thứ 4 là ''.join(other_tokens)
từ "gì vậy" được tokenize thành 2 token "g" và " unk ", dẫn đến việc không thể lỗi "cannot align"

các từ khác, ví dụ như "gì thế", hay "gì cơ" không xảy ra lỗi trên
em mong được mọi người giúp đỡ giải quyết lỗi này

Answer 1 · 2020-03-04T16:55:14.000Z

The solution is you should use a different tokenizer, e.g. rdrsegmenter.