Implemented a NLP problem to automate the Entity-Relation Graph generated from an unstructured text, also developed an interactive visualization graph which can be used for search and recommendations.
- The entity-relation graph is built from free-text extracted from Wikipedia articles, and shows relations between various players, awards, teams or Tournaments using Stanford’s Core-NLP library algorithms.
- Used D3.js to create an interactive visualization graph which can be used in search, recommendation applications.
FOR OUTPUT OPEN GRAPH1.HTML OR GRAPH2.HTML
python web_scrape.py https://en.wikipedia.org/wiki/Lists_of_National_Basketball_Association_players
#saves player links in files links.txt
python scrape_players.py/allstarts.py
#extracts text from links.txt and append in new doc/extracts with a heading
directory-> Stanford-OpenIE-Python
python main.py -f input.txt -g
Will generate a [GraphViz DOT](http://www.graphviz.org/) graph and its related PNG file in `/tmp/openie/`
and a output.txt file with all the extracted relations
open out.dot file in text editor and copy paste data into a new text file relation.txt
sed "s/;/;\n/g" relation.txt > foo_Out.txt
#show each relation in new line
convert foo_Out.txt to required .json file in format same as player.json
OR
convert the tab separated file (output.txt) to to required .json file in format same as player.json
(format of tsv -> entity[1] Label entity[2])
add images in images folder and add their path to Entity_relation_graph.html/graph1.html/graph2.html
use python duplicate_lines.py to remove duplicate relations from foo_Out.txt
- images
- contains the images for node icons
- allstarscrape.py
- extract the wikilinks from allstarlinks.txt and append the data to allstarsdata.txt
- web_scrape.py
- extract links for all the nba players (around 4500) and append those links to links.txt
- scrape_players.py
- take wikilinks from links.txt and append text for each player in data.txt
- Standford-OpenIE-Python
- directory which contain the python wrapper for relation extraction. input.txt - The input file containing text data. (copy paste the text from data.txt here) allstarsdata.txt - The input file containing text data for allstars team players output.txt - Tab separated Ouput File standford-openie - contaisn the java api for standford-corenlp-openie
- relation.txt
- contains data in text format from out.dot file (created in home/tmp/openie)
- foo_Out.txt
- same as relation.txt but has each relation in separate line
- player.json
- file with the entity-relations in the format required by d3
- graph1.html
- output of graph with node icons as images
- Entity_relation_graph.html/graph2.html
- output of graph with node icons filled with colours accordingly if they are place, player, tittle/award, team. and display node images on mouseover