PubMed-Knowl-Graph

Overview

本次Project是將20萬筆醫療相關文獻透過QA模型將問題(Question) 和答案(Answer)用 Knowledge Graph 去呈現

此Repo會教如何用Spark做資料前處理、QA模型設定以及用Neo4j Grpah去呈現最後成果

資料分析與前處理

資料下載

有關 PubMed 200k RCT dataset 作者的介紹可以參考 : https://github.com/Franck-Dernoncourt/pubmed-rct
```
!wget https://raw.githubusercontent.com/Franck-Dernoncourt/pubmed-rct/master/PubMed_20k_RCT/train.txt
```
資料觀察
```
pubmed = sc.textFile("./train.txt")
pubmed.count()
```
總筆數大概21萬
```
!head -n 20 train.txt
```
可以觀察到一個完整的資料需要包含一個 Abstract 和 Sentence

Abstract 代表後面句子是屬於 ['OBJECTIVE', 'METHODS', 'RESULTS', 'CONCLUSIONS', 'BACKGROUND'] 哪種性質

因此在丟到QA模型前要排除 #數字和空白字串的資料而且將一筆資料分為 [Abstract,Sentence]

資料清理

def separate(content):
  try:
    label,sentence = content.split('\t')
    return label,sentence
  except:
    return "None"
 Real_Content = pubmed.map(lambda x : separate(x)).filter(lambda x : x != "None")
 Real_Content.count()

比清理前少了3萬筆

在觀察處理過後的資料

Real_Content.take(20)

可以發現每筆都有包含一個Label和一段Sentence 資料處理完後可以開始進入到QA模型的步驟

QA模型 Setting

本次專案是用roberta-base-squad2的閱讀理解模型

QA模型的Input是給定一段問句和一篇文章或句子 Output會回傳對應問句的答案

簡單的Example:

Question : what is the better therapy for HIV?
Context : In patients with advanced HIV disease , zidovudine appears to be more effective than didanosine as initial therapy ; however , some patients with advanced HIV disease may benefit from a change to didanosine therapy after as little as 8 to 16 weeks of therapy with zidovudine

Model Output : zidovudine

從範例當中可以了解到如果給定與問題相關的文章或句子

模型有辦法回答正確回答問題的答案

因此我們初步的想法是先挑選有包含問句相關詞彙的文章再和問句一起當作Model的Input

再用模型回答的Output(答案) 與 Question(問句)做成Knowledge Graph

假定今天問 what is the therapy for the HIV?

先挑選有關 HIV 與 therapy 的資料當作 Input的Content
將 Input(挑選的資料與問句) 分別丟入模型並將回答的Output 、參考的文章、文章的Label 存成DataFrame的個格式方便做Knowledge Graph的資料庫
答案與文章的Label

可以觀察到有些相同的答案參考文章的Label也不同

之後可以根據 Label 與問句的關係來挑選要參考文章說不定有時候找Label是Conclusion的文章會比 Method來的好 (有可能Method只有提到Experiment的想法並不能當答案)

此Repo有附上使用BM25的方式來搜尋文章

建立Neo4j Knowledge Graph

Neo4j 介紹

Neo4j 是目前圖型化資料庫中最受歡迎的在DB-ENGINES ranking 中長期名列前矛近年台灣也有政府機關及企業團體慢慢導入Neo4j的技術

如果之前有碰過SQL類似的語法那後面的教學會很輕易的上手

以下會介紹簡單的安裝與資料庫語法

Setup environment

!pip install neo4j

Connect Cloud Server 連接Neo4j 雲端資料庫 https://neo4j.com/cloud/aura/

from neo4j import GraphDatabase
import pandas as pd
uri = "your_uri"
user = "your_username"
password = "your_password"
driver = GraphDatabase.driver(uri,auth=(user, password))
def neo4j_query(query, params=None):
  with driver.session() as session:
    result = session.run(query, params)
    return pd.DataFrame([r.values() for r in result],columns=result.keys())

Create Node 建立Node

建立Question Node

#create Question
neo4j_query("""
UNWIND $data as item
MERGE (a:Question {id:item})
SET a.text = item
RETURN count(a)
""",{"data":Q})

建立Answer Node

#create Answer
neo4j_query("""
UNWIND $data as item
MERGE (a:Answer {id:item["tokens"]})
SET a.label = item["label"]
SET a.sentence = item["sentence"]
RETURN count(a)
""",{"data":paper})

建立Label Node

#create types
types = ['OBJECTIVE', 'METHODS', 'RESULTS', 'CONCLUSIONS', 'BACKGROUND']
neo4j_query("""
UNWIND $data as item
MERGE (a:LABEL {id:item})
SET a.label = item
RETURN count(a)
""",{"data":types})

Create Relation Between Node 建立Node之間的關係

有相同的Label 的Node 相連

# Match sentence with their label
neo4j_query("""
MATCH (a:Answer)
WITH a
UNWIND a.label as type
MATCH (types:LABEL) where types.label = type
MERGE (a)-[:屬於]->(types)
""")

建立Question 與 Answer 的關係

# Match Question with Answer
neo4j_query("""
MATCH (q:Question)
WITH q
MATCH (ans:Answer)
MERGE (q)-[:答案為]->(ans)
""")

有關其他Neo4j語法可以參考我的Notion筆記 : https://alpine-friction-207.notion.site/Neo4j-983d4798e63d417bba635c089f81a0e1

Conclusion

可以發現有此專案還有蠻多地方值得去研究

第1點

比如說在挑選與問句相關的文章時

可以考慮用 NER 或是 Word2Vec 去做更多的文章挑選

因為跟問句有關的文章不一定要包含當中的字詞

比如說 what is the therapy for the HIV? 也可以用 what is the treatment for the HIV? 來代替

用 treatment 來檢索更多文章應該是值得去嘗試的

第2點

可以用Label 篩選文章

可能 Method 性質的文章並不是正確的文章提及的只是一種醫療實驗的方法

並沒有解答到問題但很適合研究類型的問句如 what are the possible treatment for the HIV?

反過來 Conclusion 性質的文章較適合拿來當作一般問句的Input Content

第3點

Knowledge Graph 內 Node 的 Properties 和之間的 Relation 可以更細部的設定

以 Answer Node 來說有答案 參考文章 參考文章的性質 3種Key

Question Node 來說只有內容 1種Key

2個Node之間可以再多增加Key 或是更明確的關係

像是問句可以分等級 what is the best therapy for the HIV? 跟 what is the therapy for the HIV?

Answer 與 Question 的關係也能分 答案一定是 跟 答案有可能是

或是 Answer Node 參考文章的Value 能夠在track至其他有用文章給模型去當作Input Content

以上是這次專案的介紹如果對於這個Project有疑問的地方或是想與我更進一步的討論歡迎寄信至 nchu_hank@smail.nchu.edu.tw

Credit

HankyStyle
范耀中教授
參考網站 : https://www.kdnuggets.com/2021/12/analyzing-scientific-articles-finetuned-scibert-ner-model-neo4j.html?fbclid=IwAR2bfOlJdOMPBdwTzWNh0s4Q2rCqvx3rcB2ni_ZQYgQqWDRwgrVRpSMQK0c 、 https://neo4j.com/

HankyStyle/PubMed-Knowl-Graph