/WikipediaAnchorExtractor

Primary LanguageJavaGNU General Public License v3.0GPL-3.0

WikipediaAnchorExtractor

  • Download the latest Wikipedia dump. For example: wget http://download.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2

  • Use the WikiExtractor.py script from Giuseppe Attardi to extract Wikipedia python2.7 WikiExtractor.py -o extracted -ls enwiki-latest-pages-articles.xml.bz2

  • Set path to extracted Wikipedia articles in run.sh and run the following: sh run.sh