Utilities for creating Word2Vec vectors for Dbpedia Entities via a Wikipedia Dump.
Within the release of Word2Vec the Google team released vectors for freebase entities trained on the Wikipedia. These vectors are useful for a variety of tasks.
This Tool will allow you to generate those vectors.
Instead of mids
entities will be addressed via DbpediaIds
which correspond to wikipedia article's titles.
Vectors are generated for (i) words appearing inside wikipedia (ii) vectors for topics i.e: dbpedia/Barack_Obama
.
-
Requires Java, sbt
-
Run
sbt assembly
to create a fat-jartarget/scala-2.10/wiki2vec-assembly-1.0.jar
-
Download Wikipedia article dump
multistreaming-xml.bz2
, put it underdata
folder -
Use
com.chunlianglyu.bliki2:bliki2
to create a Wikipedia tsv dump, each line in the format ofid
,title
,redirect
,text
separated by TAB -
Run the following to process the generated file using Spark, this will stem and tokenize article contents
./spark-1.2.0-bin-hadoop2.4/bin/spark-submit --class "org.idio.wikipedia.word2vec.Word2VecCorpus" wiki2vec-assembly-1.0.jar /gds/cllu/workspace/wiki2vec/ReadableWiki fakeRedirectFile /gds/cllu/workspace/wiki2vec/corpus
-
Run
cat corpus/part* > enwiki-20150205.corpus
to generate a single corpus file The generated corpus file is around 21G.
Then you can feed the corpus file to standard word2vec program.
Creates a Tokenized corpus which can be fed into tools such as Gensim to create Word2Vec vectors for Dbpedia entities.
- Every Wikipedia link to an article within wiki is replaced by :
DbpediaId/DbpediaIDToLink
. i.e:
if an article's text contains:
[[ Barack Obama | B.O ]] is the president of [[USA]]
is transformed into:
DbpediaID/Barack_Obama B.O is the president of DbpediaID/USA
- Articles are tokenized (At the moment in a very naive way)
- Make sure you got a
Readable Wikipedia
- Download Spark : http://d3kbcqa49mib13.cloudfront.net/spark-1.2.0-bin-hadoop2.4.tgz
- In your Spark folder do:
bin/spark-submit --class "org.idio.wikipedia.word2vec.Word2VecCorpus" target/scala-2.10/wiki2vec-assembly-1.0.jar /PathToYourReadableWiki/readableWiki.lines /Path/To/RedirectsFile /PathToOut/Word2vecReadyWikipediaCorpus
- Feed your corpus to a word2vec tool
By default the word2vec corpus is always stemmed. If you don't want that to happen:
pass None as an extra argument
sudo sh prepare.sh es_ES /mnt/data/ None
will work on the spanish wikipedia and won't stem words
Pass None as an extra argument when calling spark
bin/spark-submit --class "org.idio.wikipedia.word2vec.Word2VecCorpus" target/scala-2.10/wiki2vec-assembly-1.0.jar /PathToYourReadableWiki/readableWiki.lines /Path/To/RedirectsFile /PathToOut/Word2vecReadyWikipediaCorpus None
- Remove hard coded spark params
- Handle Wikipedia Redirections
- Intra Article co-reference resolution