importbench: A Java repository from dalaro

Tests behavior and performance when importing data into the titan graph database http://thinkaurelius.github.io/titan/
The data source is from Java objects (not external text files).

How to run

Adapt the Config.java to your environment. Changing the BERKELEY_PATH should be enough.
Run the App.java.

Components and Configuration

Titan graph 0.3.1
Embedded berkeley
No external indexer active (such as ElasticSearch)
One simple vertex with 4 fields (string, string, int, bool) one of them indexed
Single thread for importing
One transaction does 10k vertices
10 mio vertices in total
Before each insert, I perform a string-based (dummy) lookup to see if a vertex with that name
already exists. It never does… it’s just to be closer to my real app.

The problem

I’ve noticed non-linear execution time when batch-importing data into titan graph with berkeley.
The problem is that it slows down so dramatically in my application that the importer won’t run
to the end in reasonable time (days).

This project tries to reproduce it, and succeeds on small scale, I believe.

Here are the numbers for how it behaves “on my machine”.
My machine is a Windows 7 64bit workstation, the data folder is on a secondary ssd (no other work).
The project starts with an empty db (it creates it).

	begin	after 5mio	after 10 mio	restart app	after 10 mio	after 15 mio
	1999	2411	2225		2106	1738
One transaction is	1801	1741	2041		1999	2586
one line and	1713	2002	2147		1909	1876
inserts 10k vertices	1767	2673	3070		1794	1811
without any edges.	1872	1773	2107		1890	1798
	1588	1758	2359		1611	2419
All numbers are in ms.	1599	2426	2813		1665	1804
	2123	1814	1869		1819	1825
	1544	1887	1862		1616	1819
	1633	2449	1790		1593	2542
TOTAL of 10 tx = 100k vertices	17639	20934	22283		18002	20218

With an empty db, the first 100k vertices took 17.639 seconds.
After 10 mio vertices the transaction commit time increased to 22.283 seconds.
Then stopping the app, and continuing to insert into the same db brings the same
execution times as before. The longer the importer runs, the slower it gets.
DB size seems irrelevant.

Why? What can I do about it?

dalaro/importbench

How to run

Components and Configuration

The problem