StonyBrookNLP/ircot

Dataset encoding format

foreverlove944 opened this issue · 1 comments

What encoding method is used for the data set you provided? I opened it in UTF-8 encoding format. English characters are normal, but Russian and other languages are not normal.
屏幕截图 2024-04-03 205121

The contexts/paragraphs were taken from the original source datasets. However, I did apply ftfy at runtime. See commaqa/inference/dataset_readers.py for example. You might want to give it a try.