/BIR-Course

Implement similar word search with Zipf Distribution, Porter's Also, edit distance, pos tagging, and etc.

Primary LanguageHTML

Project3 for the Biomedical Information Retrieval Course

tags: NCKU python 生醫資訊

Environment

  • macOS
  • python3
  • Flask 2.0.2
  • matplotlib 3.4.3
  • nltk 3.6.5
  • torch 1.10.0
  • sklearn

Requirement

  • Implement word2vec for a set of text documents from PubMed.
  • Choose one of the 2 basic neural network models to preprocess the text set from document collection.
    • Continuous Bag of Word (CBOW): use a window of word to predict the middle word.
    • Skip-gram (SG): use a word to predict the surrounding ones in window. Window size is not limited. Computer languages are not limited.

Overview

Flask & Bootstrap 5

參考 : https://hackmd.io/@yyyyuwen/BIR_Project2

Word to Vector

Word2Vec是從大量文本中以非監督的方式學習語義的的一種模型,被大量用在NLP之中。Word2Vec是以詞向量的方式來表示語義,如果語義上有相似的單字,則在空間上距離也會很近,而 Embedding是一種將單字從原先的空間映射到新的多維空間上,也就是把原先詞所在空間嵌入到一個新的空間中。 以 $f(x) = y$來看,$f()$可以視為一個空間的概念,而$x$則是embedding也就是表示法,$y$是我們預期的結果。 我們最常見的方式就是利用one-hot編碼建立一個詞彙表,而我訓練文檔有大約14,000個不重複的單詞,代表每一個詞彙就會是一個用0和1表示的14,000維的向量。

CBOW & Skip-gram

CBOW和Skip-gram的model是非常相似的,主要的差異是CBOW是周圍的自在預測現在的字,而Skip-gram則是用現在的字去預測周圍的字。其中Window size是上下文的範圍(ex. Window size = 1指說取前後一個單字。)

Model Architecture

SkipGram_Model(
  (embeddings): Embedding(14086, 600, max_norm=1)
  (linear): Linear(in_features=600, out_features=14086, bias=True)
)
# Input Layer : 1 x 14,086
# Hidden Layer : 14,000 x 600
# Output Layer : 600 x 14,086

Data pre-Processing

1. 讀檔

讀取4000篇.xml,取Title、Label、AbstractText

2. 將文章分段轉成Sentences,取Stopword

sentences = sent_tokenize(text)
sentences = [re.sub(r'[^a-z0-9|^-]', ' ', sent.lower()) for sent in sentences]
clean_words = []
for sent in sentences:
    words = [word for word in sent.split() if not word.replace('-', '').isnumeric()]
    words = stop_word(words)
    clean_words.append(' '.join(words))

3. 將句子切成單字

tokens = [x.split() for x in text]

4. Lemmatizer

Lemmatization in Python

首先先將各個單字做詞性標註,最後再將字還原回去。

def get_wordnet_pos(word):
    """Map POS tag to first character lemmatize() accepts"""
    tag = nltk.pos_tag([word])[0][1][0].upper()
    tag_dict = {"J": wordnet.ADJ,
                "N": wordnet.NOUN,
                "V": wordnet.VERB,
                "R": wordnet.ADV}

    return tag_dict.get(tag, wordnet.NOUN)

lemmatizer = WordNetLemmatizer()
lemma_word = [lemmatizer.lemmatize(w, get_wordnet_pos(w)) for sentence in sentences for w in sentence]

5. 建立詞彙表

將各個單字建立詞彙表,並單獨標示編號。

{'map': 4314, 'html': 4315, 'interchange': 4316, 'vtm': 4317, 'restrictive': 4318, 'pre-analytic': 4319, 'disadvantageous': 4320, 'unidirectional': 4321, 'wiley': 4322, 'periodical': 4323, 'alternate': 4324, 'low-throughput': 4325}

6. 建立pair

將詞彙表的編號建立成pair,window_size = 2

[(0, 1), (0, 2), (1, 0), (1, 2), (1, 3), (2, 0), (2, 1), (2, 3), (2, 4), (3, 1), (3, 2), (3, 4), (3, 5), (4, 2), (4, 3), (4, 5), (4, 6), (5, 3), (5, 4), (5, 6)]

visualization

PCA (Principal Components Analysis)

sklearn.decomposition.PCA 將word vectors降維至二維的樣子,而關聯度高的單字會聚集在一起。 Input word: covid-19

Demo

前面參考:https://hackmd.io/@yyyyuwen/BIR_Project2

點選Skip Gram輸入單字,會列出該單字前15關聯性的單字。 Input word: covid-19

Reference