/mtkdl

Primary LanguageJupyter Notebook

深度學習

Word2Vec

資料集

PTT小資料集

PTT大資料集

標點符號去除

punct = set(u''':!),.:;?]}¢'"、。〉》」』】〕〗〞︰︱︳﹐、﹒﹔﹕﹖﹗﹚﹜﹞!),.:;?|}︴︶︸︺︼︾﹀﹂﹄﹏、~¢々‖•·ˇˉ―--′’”([{£¥'"‵〈《「『【〔〖([{£¥〝︵︷︹︻︽︿﹁﹃﹙﹛﹝({“‘-—_…~/ -*➜■─★☆=@<>◉é''')
filter(lambda x: x not in punct, jieba.cut(content))

網址Regex

content = re.sub(r'https?:\/\/.*[\r\n]*', '', content)

預處

import re
def process(content):
    content = re.sub(r'https?:\/\/.*[\r\n]*', '', content)
    punct = set(u''':!),.:;?]}¢'"、。〉》」』】〕〗〞︰︱︳﹐、﹒﹔﹕﹖﹗﹚﹜﹞!),.:;?|}︴︶︸︺︼︾﹀﹂﹄﹏、~¢々‖•·ˇˉ―--′’”([{£¥'"‵〈《「『【〔〖([{£¥〝︵︷︹︻︽︿﹁﹃﹙﹛﹝({“‘-—_…~/ -*➜■─★☆=@<>◉é''')
    cut = filter(lambda x: x not in punct, jieba.cut(content))
    return " ".join(cut)
df["content"] = df["content"].apply(process)
df

練習colab(0610)

w2v

gan

練習colab(0615)

w2v

gan

練習colab(1215)

w2v

gan

練習colab(0419)

w2v

fasttext

gan

練習colab(0516)

w2v

fasttext

gan

Face(GPU)