Dataset encoding format

Question

Dataset encoding format

foreverlove944 opened this issue 9 months ago · 1 comments

What encoding method is used for the data set you provided? I opened it in UTF-8 encoding format. English characters are normal, but Russian and other languages are not normal.

Answer 1 · 2024-06-12T01:41:59.000Z

The contexts/paragraphs were taken from the original source datasets. However, I did apply ftfy at runtime. See commaqa/inference/dataset_readers.py for example. You might want to give it a try.