MiuLab/Taiwan-LLM

Poor quality of the dataset

Closed this issue · 3 comments

I would like to know where the dataset https://huggingface.co/datasets/yentinglin/zh_TW_c4 was collected from?
This dataset appears to contain a lot of advertisements, spam and bias, and such data used for training may reduce the capabilities of the model.

zh_TW_c4 was from https://huggingface.co/datasets/mc4 with language filtering

The quality of the instruction dataset is also puzzling. For example at the end of the dataset:

{
    "id": "translation_fff9af4e72601abcab2f835f7846f37672812d96",
    "conversations": [
        { "from": "human", "value": "https://news.ltn.com.tw/news/world/paper/1218762" },
        { "from": "gpt", "value": "《中英對照讀新聞》Koko, The World’s Most Famous Signing Gorilla, Dies At 46 全球最知名的手語大猩猩「可可」享壽46歲" }
    ]
},
{
    "id": "translation_fffe65e81bf47973d3343aae8fdad3546e5a69fe",
    "conversations": [
        { "from": "human", "value": "中英對照讀新聞》The Age of Quantum AI 量子人工智慧時代來臨" },
        { "from": "gpt", "value": "https://news.ltn.com.tw/news/world/paper/1349153" }
    ]
}

This doesn't make sense, since the user only provided a news link without any further instructions. I think this led to the strange response like:

image
image

The actual news title is "《中英對照讀新聞》Parrot teaches boy of 4 to talk鸚鵡教4歲男童說話". This response changes every session, even for non-existent links. I think this kind of data could lead to significant and dangerous illusions. I'm wondering where the instruction dataset comes from?

Maybe a further data cleaning is needed, otherwise the frequency of nonsensical response narrow down the usage...