This repository contains format converters developed in the Database Research Group for similarity search on tree-structured data.
The base format of our converters is the so-called Bracket notation. We convert to and from it. The bracket notation uses nested parentheses to represent the tree structure (nodes and labels).
A grammar describing the bracket notation currently looks as follows (in ANTLR format):
node : '{' LABEL? node* '}';
LABEL : [a-zA-Z0-9#_.:+-]+ ;
Example tree in bracket notation:
{a{b{d{f}}{e}}{c}}
and its more natural view:
a
/ \
b c
/ \
d e
|
f
The following converters are currently available:
- XML to Bracket.
- Newick to Bracket: Newick format is used to represent phylogenetic trees in evolutional biology.
- Bracket to Dot: Dot is a format to represent graphs used by Graphviz.
The source code is published under the MIT licence found in the root directory of the project and in the header of each source file.
TODO: The XML grammars have a BSD licence. It that a problem?
Here we describe the general building process of our converters. Converter-specific execution details and examples are listed in the respective README files.
We've implemented our converters in Python3. We use ANTLR4 for grammars and parser generation.
Download ANTLR jar file to the root directory of the format converters.
wget http://www.antlr.org/download/antlr-4.8-complete.jar
The easiest method is to use pip. See package website for details.
Use pip3
instead of pip
on Debian.
pip install antlr4-python3-runtime
Execute the make.sh
script to compile all grammars.
./make.sh
Execute python format_converters --help
for help.
Lets assume you have a Python script in the parent directory to format_converters
and you want to import the bracket-to-dot converter. Then, you can import it and use as follows.
import format_converters.tree_formats.bracket.dot.converter as BDConverter
source = "{a{b{d{f}}{e}}{c}}"
print(BDConverter.convert(source))