PTT小資料集
PTT大資料集
punct = set(u''':!),.:;?]}¢'"、。〉》」』】〕〗〞︰︱︳﹐、﹒﹔﹕﹖﹗﹚﹜﹞!),.:;?|}︴︶︸︺︼︾﹀﹂﹄﹏、~¢々‖•·ˇˉ―--′’”([{£¥'"‵〈《「『【〔〖([{£¥〝︵︷︹︻︽︿﹁﹃﹙﹛﹝({“‘-—_…~/ -*➜■─★☆=@<>◉é''')
filter(lambda x: x not in punct, jieba.cut(content))
content = re.sub(r'https?:\/\/.*[\r\n]*', '', content)
import re
def process(content):
content = re.sub(r'https?:\/\/.*[\r\n]*', '', content)
punct = set(u''':!),.:;?]}¢'"、。〉》」』】〕〗〞︰︱︳﹐、﹒﹔﹕﹖﹗﹚﹜﹞!),.:;?|}︴︶︸︺︼︾﹀﹂﹄﹏、~¢々‖•·ˇˉ―--′’”([{£¥'"‵〈《「『【〔〖([{£¥〝︵︷︹︻︽︿﹁﹃﹙﹛﹝({“‘-—_…~/ -*➜■─★☆=@<>◉é''')
cut = filter(lambda x: x not in punct, jieba.cut(content))
return " ".join(cut)
df["content"] = df["content"].apply(process)
df
w2v
gan
w2v
gan
w2v
gan
w2v
fasttext
gan
w2v
fasttext
gan
Face(GPU)