De-tokenized text

Question

De-tokenized text

Opened this issue 5 months ago · 1 comments

Hi @afshinrahimi, @yuan-li, do you still keep the raw data (un-tokenized one) ? Also, which tokenizer did you use for this dataset ?

I need to work with the raw form of the text, and de-tokenizing is untrivial for many languages (e.g. korean, japanese).

Answer 1 · 2024-08-25T21:39:43.000Z

Hi Thang,
Have you looked at the data? It was a long time ago. What I remember is that we did not use any special tokenizer. For this reason, we didn't use JP, KO, or other languages for which a simple whitespace tokenizer wouldn't work. You can see the list of languages used in the paper as a footnote:

af, ar, bg, bn, bs, ca, cs, da, de, el, en, es, et, fa, fi, fr, he, hi, hr, hu, id, it, lt, lv, mk, ms, nl, no, pl, pt, ro, ru, sk, sl, sq, sv, ta, tl, tr, uk and vi.

This is the original paper that published wikiann dataset: https://aclanthology.org/P17-1178.pdf. In the paper there is a link to the original dataset that doesn't work anymore but you might be able to use archive.org's time machine https://web.archive.org/ to find it.

Also look at the data (not hosted on huggingface) here: https://www.dropbox.com/s/12h3qqog6q4bjve/panx_dataset.tar, if there is a ja dataset there, we have not tokenized it, I believe.

Apologies I can't help more than this.