使用簡體中文素材時出現錯誤
cauzp opened this issue · 4 comments
cauzp commented
Aidenzich commented
嗨!感謝你提出這個問題,方便提供給我你使用的資料嗎? 我想可能是來自於資料所使用的編碼或許不是"utf-8"
cauzp commented
嗨!感謝你提出這個問題,方便提供給我你使用的資料嗎? 我想可能是來自於資料所使用的編碼或許不是"utf-8"
非常感謝你的回复,我是用的是UTF-8編碼,而且只會在topic3出現同樣的錯誤。這使得我非常困惑,我將分享我所使用的數據,透過如下的github專案連結。https://github.com/cauzp/data
Aidenzich commented
Ok 我來研究一下
Aidenzich commented
哈囉,我嘗試用以下代碼檢查了以下你的檔案,裏面有些行數包含了非 utf-8 編碼的資訊:
import chardet
def detect_line_encoding(line):
result = chardet.detect(line)
return result
file_path = 'YOUR_FILE_PATH.csv'
with open(file_path, 'rb') as f:
for line_number, line in enumerate(f, start=1):
encoding_info = detect_line_encoding(line)
encoding = encoding_info['encoding']
confidence = encoding_info['confidence']
if encoding != 'utf-8':
print(f"Line {line_number}: Detected encoding: {encoding}, Confidence: {confidence}")
檢查結果:
Line 1: Detected encoding: ascii, Confidence: 1.0
Line 6: Detected encoding: None, Confidence: 0.0
Line 93: Detected encoding: None, Confidence: 0.0
Line 104: Detected encoding: None, Confidence: 0.0
Line 109: Detected encoding: None, Confidence: 0.0
Line 113: Detected encoding: None, Confidence: 0.0
Line 144: Detected encoding: None, Confidence: 0.0
Line 155: Detected encoding: None, Confidence: 0.0
Line 162: Detected encoding: None, Confidence: 0.0
Line 177: Detected encoding: None, Confidence: 0.0
Line 366: Detected encoding: None, Confidence: 0.0
Line 369: Detected encoding: None, Confidence: 0.0
Line 373: Detected encoding: None, Confidence: 0.0
Line 404: Detected encoding: None, Confidence: 0.0
Line 448: Detected encoding: None, Confidence: 0.0
Line 452: Detected encoding: None, Confidence: 0.0
Line 474: Detected encoding: None, Confidence: 0.0
Line 497: Detected encoding: None, Confidence: 0.0
Line 508: Detected encoding: None, Confidence: 0.0
Line 598: Detected encoding: None, Confidence: 0.0
Line 647: Detected encoding: None, Confidence: 0.0
Line 701: Detected encoding: None, Confidence: 0.0
Line 710: Detected encoding: None, Confidence: 0.0
Line 746: Detected encoding: None, Confidence: 0.0
Line 759: Detected encoding: None, Confidence: 0.0
Line 770: Detected encoding: None, Confidence: 0.0
Line 819: Detected encoding: None, Confidence: 0.0
Line 827: Detected encoding: None, Confidence: 0.0
Line 835: Detected encoding: None, Confidence: 0.0
Line 889: Detected encoding: None, Confidence: 0.0
Line 892: Detected encoding: TIS-620, Confidence: 0.20892844569841748
Line 959: Detected encoding: None, Confidence: 0.0
Line 990: Detected encoding: None, Confidence: 0.0
Line 992: Detected encoding: None, Confidence: 0.0
Line 1043: Detected encoding: None, Confidence: 0.0
Line 1056: Detected encoding: None, Confidence: 0.0
Line 1105: Detected encoding: None, Confidence: 0.0
Line 1290: Detected encoding: None, Confidence: 0.0
Line 1365: Detected encoding: None, Confidence: 0.0
Line 1404: Detected encoding: None, Confidence: 0.0
Line 1426: Detected encoding: None, Confidence: 0.0
Line 1463: Detected encoding: None, Confidence: 0.0
Line 1490: Detected encoding: None, Confidence: 0.0
Line 1507: Detected encoding: Windows-1252, Confidence: 0.2509375
Line 1528: Detected encoding: None, Confidence: 0.0
Line 1545: Detected encoding: None, Confidence: 0.0
Line 1551: Detected encoding: None, Confidence: 0.0
Line 1554: Detected encoding: None, Confidence: 0.0
Line 1575: Detected encoding: None, Confidence: 0.0
Line 1581: Detected encoding: None, Confidence: 0.0
Line 1591: Detected encoding: None, Confidence: 0.0
Line 1597: Detected encoding: None, Confidence: 0.0
Line 1602: Detected encoding: None, Confidence: 0.0
Line 1626: Detected encoding: None, Confidence: 0.0
Line 1711: Detected encoding: None, Confidence: 0.0
Line 1712: Detected encoding: None, Confidence: 0.0
Line 1713: Detected encoding: TIS-620, Confidence: 0.30224981811037727
Line 1734: Detected encoding: None, Confidence: 0.0
Line 1758: Detected encoding: None, Confidence: 0.0
Line 1808: Detected encoding: None, Confidence: 0.0
Line 1817: Detected encoding: None, Confidence: 0.0
Line 1827: Detected encoding: None, Confidence: 0.0
Line 1845: Detected encoding: None, Confidence: 0.0
在直接爬下來或比較舊的中文資料裡面常常會有一些資料是使用非 utf-8 的編碼,然而因為多數語言模型都是使用 utf-8 編碼的資料進行訓練的,所以需要首先確保數據是正確解碼並轉換為UTF-8格式。
可以先試著把這些行數去掉看看亂碼是否消失,之後再想辦法把他們也轉換成utf-8編碼