Poor quality of the dataset
Closed this issue · 3 comments
I would like to know where the dataset https://huggingface.co/datasets/yentinglin/zh_TW_c4 was collected from?
This dataset appears to contain a lot of advertisements, spam and bias, and such data used for training may reduce the capabilities of the model.
zh_TW_c4 was from https://huggingface.co/datasets/mc4 with language filtering
The quality of the instruction dataset is also puzzling. For example at the end of the dataset:
{
"id": "translation_fff9af4e72601abcab2f835f7846f37672812d96",
"conversations": [
{ "from": "human", "value": "https://news.ltn.com.tw/news/world/paper/1218762" },
{ "from": "gpt", "value": "《中英對照讀新聞》Koko, The World’s Most Famous Signing Gorilla, Dies At 46 全球最知名的手語大猩猩「可可」享壽46歲" }
]
},
{
"id": "translation_fffe65e81bf47973d3343aae8fdad3546e5a69fe",
"conversations": [
{ "from": "human", "value": "中英對照讀新聞》The Age of Quantum AI 量子人工智慧時代來臨" },
{ "from": "gpt", "value": "https://news.ltn.com.tw/news/world/paper/1349153" }
]
}
This doesn't make sense, since the user only provided a news link without any further instructions. I think this led to the strange response like:
The actual news title is "《中英對照讀新聞》Parrot teaches boy of 4 to talk鸚鵡教4歲男童說話". This response changes every session, even for non-existent links. I think this kind of data could lead to significant and dangerous illusions. I'm wondering where the instruction dataset comes from?
Maybe a further data cleaning is needed, otherwise the frequency of nonsensical response narrow down the usage...