/wiki-to-neo4j-csv-parser

Convert Wikipedia dumps to Neo4j loadable CSVs, efficiently transforming Wikipedia data for graph database usage.

Primary LanguageJava

Convert Wikipedia Dumps to Neo4j Loadable CSVs

This application enables you to transform and download Wikipedia dumps into Neo4j loadable CSVs. If you only need the results once, using the remote version of the InputFile is recommended, as it automatically downloads the latest dump. For multiple runs, such as for testing or adding features to the code, it's advisable to download and unzip the files beforehand to optimize performance.

The CSV files will be output to the data/{wiki} directory.

Options

Local file URIs expect the files to be located in the {user_home}/wikidata folder.

InputFile URI Output Folder FileLoader Expected Runtime
EN_WIKI enwiki-latest-pages-articles.xml en_wiki PLAIN 20min
EN_WIKI_BZ2 enwiki-latest-pages-articles.xml.bz2 en_wiki BZ2 1h
EN_WIKI_BZ2_REMOTE https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2 en_wiki REMOTE_BZ2 1h 30min
SIMPLE_WIKI simplewiki-latest-pages-articles.xml simple_wiki PLAIN 20s
SIMPLE_WIKI_BZ2 simplewiki-latest-pages-articles.xml.bz2 simple_wiki BZ2 50s
SIMPLE_WIKI_BZ2_REMOTE (Default) https://dumps.wikimedia.org/simplewiki/latest/simplewiki-latest-pages-articles.xml.bz2 simple_wiki REMOTE_BZ2 4min

To run the application, use the following command:

mvn compile exec:java -Dexec.args="{InputFile}"

Sample:

mvn compile exec:java -Dexec.args="SIMPLE_WIKI_BZ2_REMOTE"

Output Data

The output consists of two types of CSV files: "links" and "pages", both chunked at 256 megabytes. The format is consistent with their respective header files located under headers/.

Pages

The format of the pages output CSVs is defined in headers/pages_headers.csv.

Field Example Description
urlTitle:ID Alan_Turing This field is used as a unique ID in Neo4j and is the title formatted to work in the Wikipedia URL path.
title:string Alan Turing The title of the page.
pageId:long 1261710 Wikipedia's internal page ID, which can be used with the Wikipedia API.
isRedirect:boolean false Indicates if the page is a redirect to another page.

Links

The format of the links output CSVs is defined in headers/links_headers.csv.

Field Example Description
:START_ID Alan_Turing The URL title of the page where the link is located.
:END_ID Human_rights The URL title of the target page.
text:string human rights The link text visible on the website.
isRedirect:boolean false Indicates if the :START_ID page is a redirect page.
index:long 45 The position of the link on the page.

Dumps

You can find dump files at https://dumps.wikimedia.org/.

wiki url
enwiki https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2
simplewiki https://dumps.wikimedia.org/simplewiki/latest/simplewiki-latest-pages-articles.xml.bz2

Expected Disk/Network Usage

InputFile Input File Size Total CSV Size
EN_WIKI 87G 20G
EN_WIKI_BZ2* 19G 20G
SIMPLE_WIKI 1.1G 333M
SIMPLE_WIKI_BZ2* 236M 333M

Neo4j Import Usage

wiki Import Time Node Count Relationship Count Data Volume Size
en_wiki 14min 22781670 235766142 26G
simple_wiki 40s 430468 3353337 900M

Sample Importing CSVs in docker container

There is a minimal docker compose configuration for neo4j in the file: docker-compose.yaml. You can run the following commands to set up neo4j and import the CSVs.

If you want to use password authentication you have to create a file called ".env" in the same directory as the docker-compose.yaml file. The file should look like this:

NEO4J_AUTH=neo4j/YOUR_PASSWORD

It is important that the username is "neo4j" otherwise neo4j will not start.

Setup & Import Simple-Wiki

echo "Remove existing installation if exists"
docker compose down --volumes
docker compose rm --volumes --stop --force

echo "Start and Stop Neo4j to initialize the volumes"
docker compose up --detach --wait
docker compose down

echo "Import the CSV files"
header_dir="$(pwd)/headers"
data_dir="$(pwd)/data/simple_wiki"
report="$(pwd)/import.report"
echo -n "" >"${report}"
docker run --interactive --tty --rm \
  --volume=wiki-to-neo4j-csv-parser_data:/data \
  --volume="${data_dir}":/import \
  --volume="${header_dir}":/import-headers \
  --volume="${report}":/var/lib/neo4j/import.report \
  neo4j:5.19 \
  neo4j-admin database import full --overwrite-destination --bad-tolerance=10000000000 \
  --nodes=Page=/import-headers/page_headers.csv,/import/pages-\\d+.csv \
  --relationships=LINKS_TO=/import-headers/link_headers.csv,/import/links-\\d+.csv \
  --skip-bad-relationships

Setup & Import En-Wiki

echo "Remove existing installation if exists"
docker compose down --volumes
docker compose rm --volumes --stop --force

echo "Start and Stop Neo4j to initialize the volumes"
docker compose up --detach --wait
docker compose down

echo "Import the CSV files"
header_dir="$(pwd)/headers"
data_dir="$(pwd)/data/en_wiki"
report="$(pwd)/import.report"
echo -n "" >"${report}"
docker run --interactive --tty --rm \
  --volume=wiki-to-neo4j-csv-parser_data:/data \
  --volume="${data_dir}":/import \
  --volume="${header_dir}":/import-headers \
  --volume="${report}":/var/lib/neo4j/import.report \
  neo4j:5.19 \
  neo4j-admin database import full --overwrite-destination --bad-tolerance=10000000000 \
  --nodes=Page=/import-headers/page_headers.csv,/import/pages-\\d+.csv \
  --relationships=LINKS_TO=/import-headers/link_headers.csv,/import/links-\\d+.csv \
  --skip-bad-relationships

Start

The admin GUI is available on http://localhost:7474/browser/ you can authenticate with the "authentication type" "no authentication" or "username / password" depending on if you have set a password in the ".env" file. The database can be reached on port 7687.

docker compose up --detach --wait

Stop

docker compose down

Sample Cypher

Create Index

When you want to do path searches over the "urlTitle" it is advised to create an index on it to increase the lookup speed.

CREATE CONSTRAINT pages_urlTitle
FOR (p:Page) REQUIRE p.urlTitle IS UNIQUE

Find Page

MATCH (p:Page { urlTitle: 'Switzerland' })
RETURN p

Find Shortest Path

MATCH path=shortestPath((start:Page)-[:LINKS_TO*1..20]->(end:Page))
WHERE start.urlTitle = "Switzerland" AND end.urlTitle = "United_States"
RETURN path