Knowledge Graph / Open Information Extraction

Motivation

Graphene is an information extraction pipeline which extracts Knowledge Graphs from texts (n-ary relations and rhetorical structures extracted from complex factoid discourse). Given a sentence or a text, Graphene outputs a semantic representation of the text which is a labeled directed graph (a knowledge graph). This knowledge graph can be later used for addressing different AI tasks, such as building Question Answering systems, extracting structured data from text, supporting semantic inference, among other tasks. Differently from existing open relation extraction tools, which focus on the main relation expressed in a sentence, Graphene aims at maximizing the extraction of contextual relations. For example:

Trump withdrew his sponsorship after the second Tour de Trump in 1990 because his business ventures were experiencing financial woes.

In order to capture all the contextual information, Graphene performs the following steps:

Resolves co-references.
Transforms complex sentences (for example, containing subordinations, coordinations, appositive phrases, etc), into simple independent sentences (one clause per sentence).
Identifies rhetorical relations between those sentences
Extract binary relations (subject, predicate and object) from each sentence.
Merge all the extracted relations into a relation graph (knowledge graph).

Graphene’s extracted graphs are represented by our RDFNL format, an simple format that facilitates the representation of complex contextual relations in a way that balances machine representation with human legibility. A description of the RDFNL format can be found here. In order to increase further processability of the extracted relations, Graphene can materialize its relations into a proper RDF graph serialized under the N-Triples specification of the RDF standard. A description of the RDF format can be found here. Alternatively, developers can use the direct output class of the API, which is serializable and deserializable as a JSON object.

Example Extractions

Sentence Extraction

Although the Treasury will announce details of the November refunding on Monday, the funding will be delayed if Congress and President Bush fail to increase the Treasury's borrowing capacity.

The serialized class: JSON
The RDFNL format:

# Although the Treasury will announce details of the November refunding on Monday , the funding will be delayed if Congress and President Bush fail to increase the Treasury 's borrowing capacity .

bacf06771e0f4fc5a8e68c30fc77c9c4    0    the Treasury    will announce    details of the November refunding
    S:TEMPORAL    on Monday .
    L:CONTRAST    948eeebd73564adab7dee5c6f177b3b9

948eeebd73564adab7dee5c6f177b3b9    0    the funding    will be delayed        
    L:CONDITION 006a71e51295440fab7a8e8c697d2ba6
    L:CONDITION e4d86228cff443b7a8e9f6d8a5c5987b
    L:CONTRAST    bacf06771e0f4fc5a8e68c30fc77c9c4

006a71e51295440fab7a8e8c697d2ba6    1    Congress    fail    to increase the Treasury 's borrowing capacity
    L:LIST    e4d86228cff443b7a8e9f6d8a5c5987b

e4d86228cff443b7a8e9f6d8a5c5987b    1    president Bush    fail    to increase the Treasury 's borrowing capacity
    L:LIST    006a71e51295440fab7a8e8c697d2ba6

The RDF N-Triples format: NT

Full text extraction of the Barack Obama Wikipedia Page (2017-11-06):

The serialized class: JSON
The RDFNL format: RDFNL
The RDF N-Triples format: RDF

Contributors (alphabetical order)

Andre Freitas
Bernhard Bermeitinger
Christina Niklaus
Leonardo Souza
Matthias Cetto
Siegfried Handschuh

Requirements

Java 8 (OpenJDK or Oracle)
Maven 3.3.9
Docker version 17.03+
docker-compose version 1.12+

Setup

Compiling and packaging requires two additional packages:

Sentence Simplification

cd /tmp
wget https://github.com/Lambda-3/SentenceSimplification/archive/v5.0.0.tar.gz -O SentenceSimplification.tar.gz
tar xfa SentenceSimplification.tar.gz
cd SentenceSimplification
mvn -DskipTests install

Discourse Simplification

cd /tmp
wget https://github.com/Lambda-3/DiscourseSimplification/archive/v8.0.0.tar.gz -O DiscourseSimplification.tar.gz
tar xfa DiscourseSimplification.tar.gz
cd DiscourseSimplification
mvn -DskipTests install

More dependencies (requires docker)

Prior to running Graphene, two additional dependencies must be met:

Both are provided with the docker images:

Setup of Graphene

For using coreference resolution, you must have a PyCobalt instance running, it is provided in the docker-compose-core.yml. Start it with docker-compose -f docker-compose-core.yml up. Then create a config file conf/graphene.conf pointing to the PyCobalt service:

graphene {
	coreference.url = "http://localhost:5128/resolve"
}

Graphene-Core is build with

mvn clean package -DskipTests

If you want the server part, you have to specify that profile:

mvn -P server clean package -DskipTests

If you want the command line part, you have to specify that profile:

mvn -P cli clean package -DskipTests

To build both interfaces, you can specify both profiles:

mvn -P cli -P server clean package -DskipTests

Docker-Compose

You can build and start the composed images by running:

docker-compose up

A short video tutorial on the Graphene setup for CLI usage (without coreference resolution) is provided here.

Usage

Graphene-Core

Graphene comes with a Java API which is described here.

In order to use the Graphene API within your own Java application, you can import it as a Maven dependency. For this task, install Graphene-Core into your local repository:

mvn clean install -DskipTests

and add the following lines to your project's pom.xml file:

<dependency>
    <groupId>org.lambda3.graphene</groupId>
    <artifactId>graphene-core</artifactId>
    <version>3.0.0-SNAPSHOT</version>
</dependency>

Graphene-Sever

For simplified access, we wrapped the Graphene-Core library inside a REST-like web-service.

docker-compose up

The usage of the Graphene-Server is described here.

Graphene-CLI

Another way of accessing our service is provided by a command-line interface, which is described here.

Citation

@InProceedings{cetto2018graphene,
  author    = {Matthias Cetto and Christina Niklaus and Andr\'{e} Freitas and Siegfried Handschuh},
  title     = {Graphene: Semantically-Linked Propositions in Open Information Extraction},
  booktitle = {Prooceedings of COLING 2018. To appear.},
  year      = {2018}
}

jbecke/Graphene