/jRDF2Vec

A high-performance Java Implementation of RDF2Vec

Primary LanguageJavaMIT LicenseMIT

jRDF2Vec

Java CI Coverage Status License

jRDF2Vec is a Java implementation of RDF2Vec. It supports multi-threaded, in-memory (or disk-access-based) walk generation and training. You can generate embeddings for any NT, NQ, OWL/XML, RDF HDT, or TTL file.

Found a bug? Don't hesitate to open an issue.

How to cite?

Portisch, Jan; Hladik, Michael; Paulheim, Heiko. RDF2Vec Light - A Lightweight Approach for Knowledge Graph Embeddings. Proceedings of the ISWC 2020 Posters & Demonstrations. 2020. [to appear]

An open-access version of the paper is available here.

How to use the jRDF2Vec Command-Line Interface?

Download this project, execute mvn clean install. Alternatively, you can download the packaged JAR of the latest successful: commit here.

System Requirements

  • Java 8 or later.
  • Python 3 with the dependencies described in requirements.txt installed.

Command-Line Interface (jRDF2Vec CLI) for Training and Walk Generation

Use the resulting jar from the target directory.

Minimal Example

java -jar jrdf2vec-1.1-SNAPSHOT.jar -graph ./kg_file.hdt

Required Parameters

  • -graph <graph_file>
    The file containing the knowledge graph for which you want to generate embeddings.

Optional Parameters

jRDF2Vec follows the convention over configuration design paradigm to increase usability. You can overwrite the default values by setting one or more optional parameters.

Parameters for the Walk Configuration

  • -onlyWalks
    If added to the call, this switch will deactivate the training part so that only walks are generated. If training parameters are specified, they are ignored. The walk generation also works with the -light parameter.
  • -light <entity_file>
    If you intend to use RDF2VecLight, you have to use this switch followed by the file path ot the describing the entities for which you require an embedding space. The file should contain one entity (full URI) per line.
  • -numberOfWalks <number> (default: 100)
    The number of walks to be performed per entity.
  • -depth <depth> (default: 4)
    This parameter controls the depth of each walk. Depth is defined as the number of hops. Hence, you can also set an odd number. A depth of 1 leads to a sentence in the form <s p o>.
  • -walkGenerationMode <MID_WALKS | MID_WALKS_DUPLICATE_FREE | RANDOM_WALKS | RANDOM_WALKS_DUPLICATE_FREE> (default for light: MID_WALKS, default for classic: RANDOM_WALKS_DUPLICATE_FREE)
    This parameter determines the mode for the walk generation (multiple walk generation algorithms are available).
  • -threads <number_of_threads> (default: (# of available processors) / 2)
    This parameter allows you to set the number of threads that shall be used for the walk generation as well as for the training.
  • -walkDirectory <directory where walk files shall be generated/reside>
    The directory where the walks shall be generated into. In case of -onlyTraining, the directory where the walks reside.

Parameters for the Training Configuration

  • -onlyTraining
    If added to the call, this switch will deactivate the walk generation part so that only the training is performed. The parameter -walkDirectory must be set. If walk generation parameters are specified, they are ignored.
  • -trainingMode <cbow | sg> (default: sg)
    This parameter controls the mode to be used for the word2vec training. Allowed values are cbow and sg.
  • -dimension <size_of_vector> (default: 200)
    This parameter allows you to control the size of the resulting vectors (e.g. 100 for 100-dimensional vectors).
  • -minCount <number> (default: 1)
    This parameter controls the minimum word count for the word2vec training. Unlike in the gensim defaults, this parameter is set to 1 by default because for knowledge graph embeddings, a vector for each node/arc is desired.
  • -noVectorTextFileGeneration | -vectorTextFileGeneration
    A switch which indicates whether a text file with the vectors shall be persisted on the disk. This is enabled by default. Use -noVectorTextFileGeneration to disable the file generation.
  • -sample <rate> (default: 0.0)
    The threshold for configuring which higher-frequency words are randomly downsampled, a useful range is, according to the gensim framework, (0, 1e-5).
  • -window <window_size> (default: 5)
    The size of the window in the training process.
  • -epochs <number_of_epochs> (default: 5)
    The number of epochs to use in training.

Command-Line Interface (jRDF2Vec CLI) - Additional Services

Besides generating walks and training embeddings, the CLI offers additional services which are described below.

Generating a Text Vector File

jRDF is compatible with the evaluation framework for KG embeddings (GEval). This framework requires the vectors to be present in a text file. If you have a gensim model or vector file, you can use the following command to generate this file:

java -jar jrdf2vec-1.1-SNAPSHOT.jar -generateTextVectorFile ./path-to-your-model-or-vector-file

Analyzing the Embedding Vocabulary

For RDF2Vec, it is not always guaranteed that all concepts in the graph appear in the embedding space. For example, some concepts may only appear in the object position of statements and may never be reached by random walks. In addition, the word2vec configuration parameters may filter out infrequent words depending on the configuration (see -minCount above, for example). To analyze such rather seldom cases, you can use the -analyzeVocab function specified as follows:

java -jar jrdf2vec-1.1-SNAPSHOT.jar -analyzeVocab <model> <training_file|entity_file>
  • <model> refers to any model representation such as gensim model file, .kv file, or .txt file. Just make sure you use the correct file endings.

  • <training_file|entity_file> refers either to the NT/TTL etc. file that has been used to train the model or to a text file containing the concepts you want to check (one concept per line in the text file, make sure the file ending is .txt).

A report will be printed. For large models, you may want to redirect that into a file ([...] &> somefile.txt).

How to use the jRDF2Vec as library in Java projects?

Stable releases are available through the maven central repository:

<dependency>
    <groupId>de.uni-mannheim.informatik.dws</groupId>
    <artifactId>jrdf2vec</artifactId>
    <version>1.0</version>
</dependency>

Run jRDF2Vec using Docker

Optionally, Docker can be used to run jRDF2Vec. This functionality has been added by Vincent Emonet.

Run

The image can be pulled from DockerHub 🐳

Test run to get help message:

docker run -it --rm vemonet/jrdf2vec

Mount volumes on /data in the container to provide input files and generate embeddings:

  • $(pwd) to use current working directory on Linux and MacOS
  • ${PWD} to use current working directory on Windows (also make the command a one-line)
docker run -it --rm \
  -v $(pwd)/src/test/resources:/data \
  vemonet/jrdf2vec \
  -light /data/sample_dbpedia_entity_file.txt \
  -graph /data/sample_dbpedia_nt_file.nt

Embeddings will be generated in the shared volume (/data in the container).

Build

From source code:

docker build -t jrdf2vec .

Developer Documentation

The most recent JavaDoc sites generated from the latest commit can be found here.

Frequently Asked Questions (FAQs)

I have Python installed, but it is not accessible via command python. How to resolve this?
Create a file python_command.txt in directory ./python_server (created when first running the jar). Write the command to call Python 3 in the first line of the file.

The program starts and immediately shuts down. Nothing seems to happen.
Make sure your system is set-up correctly, in particular whether you have installed Python 3 and the required dependencies.