Aidenzich/HelloBERTopic

使用簡體中文素材時出現錯誤

Opened this issue · 4 comments

您好!您的項目給我提供了很多幫助!我clone了你的項目,但是更換使用簡體中文的語料時,部分主題出現了亂碼,您能提供一些支持嗎?如果我希望使用您的項目處理簡體中文的語料?特別是需要具體修改哪些部分?因爲我髮現您的項目相較於原始樣本做出了較多修改。

image

嗨!感謝你提出這個問題,方便提供給我你使用的資料嗎? 我想可能是來自於資料所使用的編碼或許不是"utf-8"

嗨!感謝你提出這個問題,方便提供給我你使用的資料嗎? 我想可能是來自於資料所使用的編碼或許不是"utf-8"

非常感謝你的回复,我是用的是UTF-8編碼,而且只會在topic3出現同樣的錯誤。這使得我非常困惑,我將分享我所使用的數據,透過如下的github專案連結。https://github.com/cauzp/data

Ok 我來研究一下

哈囉,我嘗試用以下代碼檢查了以下你的檔案,裏面有些行數包含了非 utf-8 編碼的資訊:

import chardet


def detect_line_encoding(line):
    result = chardet.detect(line)
    return result

file_path = 'YOUR_FILE_PATH.csv'


with open(file_path, 'rb') as f:
    for line_number, line in enumerate(f, start=1):
        encoding_info = detect_line_encoding(line)
        encoding = encoding_info['encoding']
        confidence = encoding_info['confidence']
        if encoding != 'utf-8':
            print(f"Line {line_number}: Detected encoding: {encoding}, Confidence: {confidence}")

檢查結果:

Line 1: Detected encoding: ascii, Confidence: 1.0
Line 6: Detected encoding: None, Confidence: 0.0
Line 93: Detected encoding: None, Confidence: 0.0
Line 104: Detected encoding: None, Confidence: 0.0
Line 109: Detected encoding: None, Confidence: 0.0
Line 113: Detected encoding: None, Confidence: 0.0
Line 144: Detected encoding: None, Confidence: 0.0
Line 155: Detected encoding: None, Confidence: 0.0
Line 162: Detected encoding: None, Confidence: 0.0
Line 177: Detected encoding: None, Confidence: 0.0
Line 366: Detected encoding: None, Confidence: 0.0
Line 369: Detected encoding: None, Confidence: 0.0
Line 373: Detected encoding: None, Confidence: 0.0
Line 404: Detected encoding: None, Confidence: 0.0
Line 448: Detected encoding: None, Confidence: 0.0
Line 452: Detected encoding: None, Confidence: 0.0
Line 474: Detected encoding: None, Confidence: 0.0
Line 497: Detected encoding: None, Confidence: 0.0
Line 508: Detected encoding: None, Confidence: 0.0
Line 598: Detected encoding: None, Confidence: 0.0
Line 647: Detected encoding: None, Confidence: 0.0
Line 701: Detected encoding: None, Confidence: 0.0
Line 710: Detected encoding: None, Confidence: 0.0
Line 746: Detected encoding: None, Confidence: 0.0
Line 759: Detected encoding: None, Confidence: 0.0
Line 770: Detected encoding: None, Confidence: 0.0
Line 819: Detected encoding: None, Confidence: 0.0
Line 827: Detected encoding: None, Confidence: 0.0
Line 835: Detected encoding: None, Confidence: 0.0
Line 889: Detected encoding: None, Confidence: 0.0
Line 892: Detected encoding: TIS-620, Confidence: 0.20892844569841748
Line 959: Detected encoding: None, Confidence: 0.0
Line 990: Detected encoding: None, Confidence: 0.0
Line 992: Detected encoding: None, Confidence: 0.0
Line 1043: Detected encoding: None, Confidence: 0.0
Line 1056: Detected encoding: None, Confidence: 0.0
Line 1105: Detected encoding: None, Confidence: 0.0
Line 1290: Detected encoding: None, Confidence: 0.0
Line 1365: Detected encoding: None, Confidence: 0.0
Line 1404: Detected encoding: None, Confidence: 0.0
Line 1426: Detected encoding: None, Confidence: 0.0
Line 1463: Detected encoding: None, Confidence: 0.0
Line 1490: Detected encoding: None, Confidence: 0.0
Line 1507: Detected encoding: Windows-1252, Confidence: 0.2509375
Line 1528: Detected encoding: None, Confidence: 0.0
Line 1545: Detected encoding: None, Confidence: 0.0
Line 1551: Detected encoding: None, Confidence: 0.0
Line 1554: Detected encoding: None, Confidence: 0.0
Line 1575: Detected encoding: None, Confidence: 0.0
Line 1581: Detected encoding: None, Confidence: 0.0
Line 1591: Detected encoding: None, Confidence: 0.0
Line 1597: Detected encoding: None, Confidence: 0.0
Line 1602: Detected encoding: None, Confidence: 0.0
Line 1626: Detected encoding: None, Confidence: 0.0
Line 1711: Detected encoding: None, Confidence: 0.0
Line 1712: Detected encoding: None, Confidence: 0.0
Line 1713: Detected encoding: TIS-620, Confidence: 0.30224981811037727
Line 1734: Detected encoding: None, Confidence: 0.0
Line 1758: Detected encoding: None, Confidence: 0.0
Line 1808: Detected encoding: None, Confidence: 0.0
Line 1817: Detected encoding: None, Confidence: 0.0
Line 1827: Detected encoding: None, Confidence: 0.0
Line 1845: Detected encoding: None, Confidence: 0.0

在直接爬下來或比較舊的中文資料裡面常常會有一些資料是使用非 utf-8 的編碼,然而因為多數語言模型都是使用 utf-8 編碼的資料進行訓練的,所以需要首先確保數據是正確解碼並轉換為UTF-8格式。
可以先試著把這些行數去掉看看亂碼是否消失,之後再想辦法把他們也轉換成utf-8編碼