/protograph

Transform a stream of messages into a graph

Primary LanguageScala

protograph

Transform a stream of messages into a graph

PROTOGRAPH

what is protograph?

Protograph is a protocol for transforming messages from any given schema into a set of graph vertexes and edges.

To do this, you compose a protograph.yaml describing how to create vertexes and edges given a message of a variety of shapes (called labels in Protograph).

Given a well-constructed protograph.yaml, input for Protograph is a stream of messages described in a Protocol Buffers schema, and the output is a list of vertexes and edges, in a schema of their own.

protograph describes a property graph

To Protograph, vertexes and edges contain properties: ie, key/value pairs which are associated to a given vertex or edge. These properties are arbitrary structures containing one of these types:

  • string
  • number (integers or doubles)
  • list of any mixed values
  • map of strings to any value (string, number, list or map)

A vertex contains three keys:

  • label (a string declaring the type of vertex)
  • gid (a globally unique identifier constructed from the data contained in the message
  • data (containing all of the other data)

An edge has two terminals, a from and to, each with their own labels:

  • fromLabel (the label of the from vertex for the edge)
  • toLabel (the label of the to vertex for the edge)
  • label (the label of the edge itself).
  • from (the gid of the from vertex for the edge)
  • to (the gid of the to vertex for the edge)
  • data (once again, the rest of the data is here).

A basic example

input to Protograph representing a single Variant

{"sample": "biosample:CCLE:1321N1_CENTRAL_NERVOUS_SYSTEM",
 "referenceName": "1",
 "start": 10521380,
 "end": 10521380,
 "referenceBases": "A",
 "alternateBases": ["-"],
 "type": "call"}

protograph.yaml representing the transformation

- label: Variant
  match:
    type: call
  vertexes:
    - label: Variant
      gid: "variant:{{referenceName}}:{{start}}:{{end}}:{{referenceBases}}:{{alternateBases}}"
      merge: true
      filter:
        - sample
  edges:
    - fromLabel: Variant
      toLabel: Biosample
      label: variantInBiosample
      from: "variant:{{referenceName}}:{{start}}:{{end}}:{{referenceBases}}:{{alternateBases}}"
      to: "{{sample}}"

output from Protograph (both a vertex and an edge)

{"label": "Variant"
 "gid": "variant:1:10521380:10521380:A:-"
 "data": {
   "referenceName": "1",
   "start": 10521380,
   "end": 10521380,
   "referenceBases": "A",
   "alternateBases": ["-"]}}}

{"label": "variantInBiosample",
 "fromLabel": "Variant",
 "from": "variant:1:10521380:10521380:A:-"
 "toLabel": "Biosample",
 "to": "biosample:CCLE:1321N1_CENTRAL_NERVOUS_SYSTEM",
 "gid": "(variant:1:10521380:10521380:A:-)--variantInBiosample->(biosample:CCLE:1321N1_CENTRAL_NERVOUS_SYSTEM)",
 "data": {}}

To see a larger example, check out the protograph.yaml that comes with this repository.

protograph works with typed messages

Protograph directives are partitioned by type. When creating a protobuffer schema you declare a series of message types, and in protograph.yaml you refer to these type names when declaring how each message will be processed. This lives under the label key:

# a typed message
- label: Variant

each message type has a gid

Gids are one of the key concepts of Protograph. A gid (global identifier) refers to an identifier that can be entirely constructed from the message itself. Each message type declares a gid template that accepts the message as an argument and constructs the gid from values found within.

# this gid is composed of several properties
gid: "variant:{{referenceName}}:{{start}}:{{end}}:{{referenceBases}}:{{alternateBases}}"

messages reference one another through gids

Gids are used to link messages together. Typically a message will contain a gid for another message under some property in a string (for a single link) or list (for a multitude of links). Sometimes these references will be embedded inside an inner map, or list of maps. Protograph enables you to specify references anywhere they may live.

# this variant came from a sample
"sample": "biosample:CCLE:1321N1_CENTRAL_NERVOUS_SYSTEM"

protograph transformations describe the construction of vertexes and edges

In general, you specify a transformation for a given message type by describing what the output is going to look like in terms of the input map. This way you can transform messages of any shape or schema into graph elements.

To specify the transformations, you declare what vertexes and edges are generated from a given message label. Each message type can generate any number of vertexes and edges:

label: Variant
vertexes:
  - label: Variant
    gid: "variant:{{referenceName}}:{{start}}:{{end}}:{{referenceBases}}:{{alternateBases}}"
    merge: true
    filter:
      - sample
edges:
  - fromLabel: Variant
    toLabel: Biosample
    label: variantInBiosample
    from: "variant:{{referenceName}}:{{start}}:{{end}}:{{referenceBases}}:{{alternateBases}}"
    to: "{{sample}}"

protograph fields are constructed using selmer templates

Each field in protograph uses a template to construct its final value out of fields in the provided input message. These templates use the double curly brace paradigm to splice values into a larger string. In its simplest form this can literally be splicing a value from the input map directly in:

# this input
{value: 5}

# through this template
"{{value}}"

# creates this output
5

These templates can use dot notation to access into a nested structure:

# this input
{outer: {inner: {container: [{jewel: 88888}]}}}

# through this template
"extracting the {{outer.inner.container.0.jewel}}"

# creates this output
extracting the 88888

There are even simple filters you can trigger using the | operator:

# this input
{piles: [1, 2, 3, 5, 4, 5, 2, 3, 4, 1]}

# through this template
"{{piles|join:!}}"

# creates this output
1!2!3!5!4!5!2!3!4!1

There are a variety of filters available. For more information check out the Selmer documentation.

protograph has a protobuffer schema

There is a protobuffer schema for Protograph defined here: Protograph schema.

how to write protograph

The overall structure of a protograph.yaml is a list of transforms indexed by label:

- label: Variant
  ....

- label: Gene
  ....

- label: Biosample
  ....

determining the label of the message

When messages are processed, the first thing that happens is the label of the incoming message is matched to one of the protograph transforms. Once a label is chosen, each transform under that label is run on the given message.

Protograph has three ways of determining the label of the incoming message.

matching the label in the incoming message

The most flexible way is to determine the label from the incoming message. To do this, before the vertexes or edges entry you can add a match entry with a key and value (or multiple keys and values). If one of these matches the incoming message, this protograph entry will be used.

In the above example we had this section:

label: Variant
match:
  type: call
vertexes:
  ....

This will match any message that has the value "call" under the type key:

{...,
 type: call,
 start: 18232189,
 ...}

matching the file/topic name

In the absence of a match directive, Protograph will attempt to parse the filename or topic name. Here are some possible parsings:

  • from.somewhere.Variant.json --> Variant
  • a.topic.of.streaming.Biosample --> Biosample

using the --label flag

If all of these fail you can also supply the label Protograph will use to interpret the incoming messages with the --label flag on invocation. This will indiscriminately apply this label to all incoming messages, unless the messages match an existing match clause, in which case it will just use that directive.

specifying how vertexes and edges are generated from the message

Transforms are of two types: transforms that produce vertexes and transforms that produce edges. These live under the vertexes and edges keys respectively.

- label: Variant
  vertexes:
    ....
  edges:
    ....

There are many commonalities between creating vertexes and edges, but minor differences as well. As said in the beginning, a vertex has three keys:

  • label (a string declaring the type of vertex)
  • gid (a globally unique identifier constructed from the data contained in the message
  • data (containing all of the other data)

An edge has six keys: two terminals, a from and to, each with their own labels:

  • fromLabel (the label of the from vertex for the edge)
  • toLabel (the label of the to vertex for the edge)
  • label (the label of the edge itself).
  • from (the gid of the from vertex for the edge)
  • to (the gid of the to vertex for the edge)
  • data (once again, the rest of the data is here).

As you can see, both have a label and data, but the vertex also defines a unique gid while the edge specifies the vertexes it is connected to through from and to, and the labels of those vertexes with fromLabel and toLabel.

Each of these fields is constructed from a template as described in the section above protograph fields are constructed using selmer templates. Therefore, a vertex transform may look like this:

- label: Variant
  match:
    type: call
  vertexes:
    - label: Mutation
      gid: "variant:{{referenceName}}:{{start}}:{{end}}:{{referenceBases}}:{{alternateBases}}"
      data:
        alternateBases: "{{alternateBases|join:,}}"

merge/filter

Sometimes you want all (or most) of the fields present in the input message to appear in the output message, and you don't want to make an entry under data for each one (or maybe you don't even know what all of them are beforehand). This is where merge comes in:

- label: Variant
  gid: "variant:{{referenceName}}:{{start}}:{{end}}:{{referenceBases}}:{{alternateBases}}"
  merge: true

Saying merge: true will merge all fields from the input message into the output message. If you want all of them except for certain ones, you can add a filter entry under the merge:

- label: Variant
  gid: "variant:{{referenceName}}:{{start}}:{{end}}:{{referenceBases}}:{{alternateBases}}"
  merge: true
  filter:
    - sample

The filter is a list of fields to exclude from the merge.

splice

splice is similar to merge, but this time you are splicing in some nested object into the top level. During a splice there is no filter step, you just get the whole map at the top level. Like the filter directive, splice takes a list of paths:

- label: Variant
  gid: "variant:{{referenceName}}:{{start}}:{{end}}:{{referenceBases}}:{{alternateBases}}"
  splice:
    - info
    - center.source

index

Many times you have an array of things in the incoming message that entail an output of many edges, for instance. Take this example:

{"name": "azacitidine",
 "smiles": "Nc1ncn([C@@H]2O[C@H](CO)[C@@H](O)[C@H]2O)c(=O)n1",
 "targets": ["DNMT1", "BRAF"],
 ....}

We want to turn everything in the targets array into an edge. In cases like these, we can use _index!

vertexes:
  - label: Compound
    gid: "compound:{{name}}"
    merge: true
    filter:
      - targets
edges:
  - index: targets
    fromLabel: Compound
    toLabel: Gene
    label: targetsGene
    from: "compound:{{name}}"
    to: "gene:{{_index}}"

Notice for the edges, we declare the index to be the targets field, then later in the to field we can reference each item in the targets array using _index.

The index field can also use filters, so say you don't have an array but a comma-separated string:

{"name": "azacitidine",
 "smiles": "Nc1ncn([C@@H]2O[C@H](CO)[C@@H](O)[C@H]2O)c(=O)n1",
 "targets": "DNMT1,BRAF",
 ....}

Insidious! Yet, we can handle this as well using the split filter:

- index: targets|split:,

This makes the edges identical to the previous ones.

field types

Sometimes fields have a type beyond string. Currently supported are:

  • int
  • float

Otherwise the value is interpreted as a string.

In order to specify the type of a field, under the data key you can append .int or .float to the field name, and they will be interpreted as that type:

....
gid: "orb:{{name}}"
data:
  orb: "{{orb}}"
....

Input:

{....,
 name: glowing,
 orb: 99919,
 ....}

Output:

{....,
 gid: "orb:glowing",
 data: {
   orb: "99919",
   ....
 }}

Not what we wanted (orb is output as a string). We can edit the protograph.yaml to give the orb field its proper type:

....
gid: "orb:{{name}}"
data:
  orb.int: "{{orb}}"
....

Now our output becomes:

{....,
 gid: "orb:glowing",
 data: {
   orb: 99919,
   ....
 }}

Much better!

nested messages

Sometimes you have a big input message with submessages embedded inside, and you've already written some protograph for those and would rather not repeat yourself. You can trigger the processing of any subpart of a message as if it were the top level of a message with a different (or the same!) label.

Here is how this works. Alongside the other top-level protograph keys (label, match, vertexes, and edges) you can add an inner key of the form:

label: Container
inner:
  path: some.inner.key
  label: Inside

label: Inside
vertexes:
  ....

Now, whenever we process a Container message, whatever value is nested inside the keys some.inner.key will be interpreted as a message with the label Inside. This also works with index, so you can process a nested list of submessages:

label: Container
inner:
  index: some.inner.key
  path: _index.even.deeper
  label: DeeperInside

running protograph

You can run Protograph either by transforming a directory containing input messages into Vertex and Edge output files, or by consuming a Kafka topic and emitting to another pair of Kafka topics (one for Vertex and one for Edge).

Either way, start by downloading the latest release.

protograph transform with files

To run Protograph on a directory of input files, use the --input and --output options, along with the path to your protograph.yaml under --protograph:

java -jar protograph.jar --protograph path/to/protograph.yaml --input /path/to/input/messages.Label.json --output /path/to/output/with/file.prefix

Once processing is complete, it will output two files of the form:

/path/to/output/with/file.prefix.Vertex.json
/path/to/output/with/file.prefix.Edge.json

depending on what you passed to --output. If you know all messages will have a certain label and don't care about matching or parsing to find it, you can specify the label on the command line with --label. Note that this will still use the match directives to match labels if they have them, so you can use this to provide a default label for unmatched messages.

protograph transform using kafka

To run Protograph in Kafka mode you must have access to a Kafka node with some topics to import.

java -jar protograph.jar --protograph path/to/protograph.yaml --topic "topic1 topic2 topic3"

This will by default output to the Kafka topics protograph.Vertex and protograph.Edge. To change the prefix for these topics pass in something under the --prefix key:

# this will output to the topics inspired.project.Vertex and inspired.project.Edge
java -jar protograph.jar --protograph path/to/protograph.yaml --topic "topic1 topic2 topic3" --prefix inspired.project

If you need to change the kafka host, pass it in under --kafka:

java -jar protograph.jar --protograph path/to/protograph.yaml --kafka 10.96.11.82:9092 --topic "topic1 topic2 topic3"

generating dot files

You can also use protograph to generate a dot file representing the connections between all the node types. To do so, run the following command:

java -cp protograph.jar clojure.main -m protograph.dot --protograph path/to/protograph.yaml --output path/to/output.dot

Then you can generate a png representing the graph using the following command (assuming you have graphviz installed):

dot path/to/output.dot -Tpng -oprotograph.png

Here is an example using the protograph for BMEG:

DOT