CoNLL-U Parser parses a CoNLL-U formatted string into a nested python dictionary. CoNLL-U is often the output of natural language processing tasks.
- It's simple. ~300 lines of code.
- Works with both Python 2 and Python 3
- It has no dependencies
- Nice set of tests with CI setup:
- It has 100% test coverage (and has undergone mutation testing)
- It has
pip install conllu
Or, if you are using conda:
conda install -c conda-forge conllu
I don't like breaking backwards compatibility, but to be able to add new features I felt I had to. This means that updating from 0.1 to 1.0 might require code changes. Here's a guide on how to upgrade to 1.0 .
At the top level, conllu provides two methods, parse
and parse_tree
. The first one parses sentences and returns a flat list. The other returns a nested tree structure. Let's go through them one by one.
>>> from conllu import parse
>>>
>>> data = """
# text = The quick brown fox jumps over the lazy dog.
1 The the DET DT Definite=Def|PronType=Art 4 det _ _
2 quick quick ADJ JJ Degree=Pos 4 amod _ _
3 brown brown ADJ JJ Degree=Pos 4 amod _ _
4 fox fox NOUN NN Number=Sing 5 nsubj _ _
5 jumps jump VERB VBZ Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin 0 root _ _
6 over over ADP IN _ 9 case _ _
7 the the DET DT Definite=Def|PronType=Art 9 det _ _
8 lazy lazy ADJ JJ Degree=Pos 9 amod _ _
9 dog dog NOUN NN Number=Sing 5 nmod _ SpaceAfter=No
10 . . PUNCT . _ 5 punct _ _
"""
Now you have the data in a variable called data
. Let's parse it:
>>> sentences = parse(data)
>>> sentences
[TokenList<The, quick, brown, fox, jumps, over, the, lazy, dog, .>]
Advanced usage: If you have many sentences (say over a megabyte) to parse at once, you can avoid loading them into memory at once by using
parse_incr()
instead ofparse
. It takes an opened file, and returns a generator instead of the list directly, so you need to either iterate over it, or call list() to get the TokenLists out. Here's how you would use it:from io import open from conllu import parse_incr data_file = open("huge_file.conllu", "r", encoding="utf-8") for tokenlist in parse_incr(data_file): print(tokenlist)For most files,
parse
works fine.
Since one CoNLL-U file usually contains multiple sentences, parse()
always returns a list of sentences. Each sentence is represented by a TokenList.
>>> sentence = sentences[0]
>>> sentence
TokenList<The, quick, brown, fox, jumps, over, the, lazy, dog, .>
The TokenList supports indexing, so you can get the first token, represented by an ordered dictionary, like this:
>>> token = sentence[0]
>>> token
OrderedDict([
('id', 1),
('form', 'The'),
('lemma', 'the'),
...
])
>>> token["form"]
'The'
>>> sentence = sentences[0]
>>> sentence
TokenList<The, quick, brown, fox, jumps, over, the, lazy, dog, .>
>>> sentence.filter(form="quick")
TokenList<quick>
By using filter(field1__field2=value)
you can filter based on subelements further down in a parsed token.
>>> sentence.filter(feats__Degree="Pos")
TokenList<quick, brown, lazy>
Filters can also be chained (meaning you can do sentence.filter(...).filter(...)
), and filtering on multiple properties at the same time (sentence.filter(field1=value1, field2=value2)
) means that ALL properties must match.
Each sentence can also have metadata in the form of comments before the sentence starts. This is available in a property on the TokenList called metadata
.
>>> sentence.metadata
OrderedDict([
('text', 'The quick brown fox jumps over the lazy dog.')
])
If you ever want to get your CoNLL-U formated text back (maybe after changing something?), use the serialize()
method:
>>> sentence.serialize()
# text = The quick brown fox jumps over the lazy dog.
1 The the DET DT Definite=Def|PronType=Art 4 det _ _
2 quick quick ADJ JJ Degree=Pos 4 amod _ _
3 brown brown ADJ JJ Degree=Pos 4 amod _ _
4 fox fox NOUN NN Number=Sing 5 nsubj _ _
5 jumps jump VERB VBZ Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin 0 root _ _
6 over over ADP IN _ 9 case _ _
7 the the DET DT Definite=Def|PronType=Art 9 det _ _
8 lazy lazy ADJ JJ Degree=Pos 9 amod _ _
9 dog dog NOUN NN Number=Sing 5 nmod _ SpaceAfter=No
10 . . PUNCT . _ 5 punct _ _
You can also convert a TokenList to a TokenTree by using to_tree
:
>>> sentence.to_tree()
TokenTree<token={id=5, form=jumps}, children=[...]>
That's it!
Sometimes you're interested in the tree structure that hides in the head
column of a CoNLL-U file. When this is the case, use parse_tree
to get a nested structure representing the sentence.
>>> from conllu import parse_tree
>>> sentences = parse_tree(data)
>>> sentences
[TokenTree<...>]
Advanced usage: If you have many sentences (say over a megabyte) to parse at once, you can avoid loading them into memory at once by using
parse_tree_incr()
instead ofparse_tree
. It takes an opened file, and returns a generator instead of the list directly, so you need to either iterate over it, or call list() to get the TokenTrees out. Here's how you would use it:from io import open from conllu import parse_tree_incr data_file = open("huge_file.conllu", "r", encoding="utf-8") for tokentree in parse_tree_incr(data_file): print(tokentree)
Since one CoNLL-U file usually contains multiple sentences, parse_tree()
always returns a list of sentences. Each sentence is represented by a TokenTree.
>>> root = sentences[0]
>>> root
TokenTree<token={id=5, form=jumps}, children=[...]>
To quickly visualize the tree structure you can call print_tree
on a TokenTree.
>>> root.print_tree()
(deprel:root) form:jumps lemma:jump upostag:VERB [5]
(deprel:nsubj) form:fox lemma:fox upostag:NOUN [4]
(deprel:det) form:The lemma:the upostag:DET [1]
(deprel:amod) form:quick lemma:quick upostag:ADJ [2]
(deprel:amod) form:brown lemma:brown upostag:ADJ [3]
(deprel:nmod) form:dog lemma:dog upostag:NOUN [9]
(deprel:case) form:over lemma:over upostag:ADP [6]
(deprel:det) form:the lemma:the upostag:DET [7]
(deprel:amod) form:lazy lemma:lazy upostag:ADJ [8]
(deprel:punct) form:. lemma:. upostag:PUNCT [10]
To access the token corresponding to the current node in the tree, use token
:
>>> root.token
OrderedDict([
('id', 5),
('form', 'jumps'),
('lemma', 'jump'),
...
])
To start walking down the children of the current node, use the children attribute:
>>> children = root.children
>>> children
[
TokenTree<token={id=4, form=fox}, children=[...]>,
TokenTree<token={id=9, form=dog}, children=[...]>,
TokenTree<token={id=10, form=.}, children=None>
]
Just like with parse()
, if a sentence has metadata it is available in a property on the TokenTree root called metadata
.
>>> root.metadata
OrderedDict([
('text', 'The quick brown fox jumps over the lazy dog.')
])
If you ever want to get your CoNLL-U formated text back (maybe after changing something?), use the serialize()
method:
>>> root.serialize()
# text = The quick brown fox jumps over the lazy dog.
1 The the DET DT Definite=Def|PronType=Art 4 det _ _
2 quick quick ADJ JJ Degree=Pos 4 amod _ _
...
If you want to write it back to a file, you can use something like this:
>>> from conllu import parse_tree
>>> sentences = parse_tree(data)
>>>
>>> # Make some change to sentences here
>>>
>>> with open('file-to-write-to', 'w') as f:
... f.writelines([sentence.serialize() + "\n" for sentence in sentences])
Far from all CoNLL-U files found in the wild follow the CoNLL-U format specification. CoNLL-U tries to parse even files that are malformed according to the specification, but sometimes that doesn't work. For those situations you can change how conllu parses your files.
A normal CoNLL-U file consists of a specific set of fields (id, form, lemma, and so on...). Let's walk through how to parse a custom format using the three options fields
, field_parsers
, metadata_parsers
. Here's the custom format we'll use.
>>> data = """
# tagset = TAG1|TAG2|TAG3|TAG4
# sentence-123
1 My TAG1|TAG2
2 custom TAG3
3 format TAG4
"""
Now, let's parse this with the the default settings, and looks specifically at the first token to see how it was parsed.
>>> sentences = parse(data)
>>> sentences[0][0]
OrderedDict([('id', 1), ('form', 'My'), ('lemma', 'TAG1|TAG2')])
The parser has assumed (incorrectly) that the third field must the the default ´lemma´ field and parsed it as such. Let's customize this so the parser gets the name right, by setting the fields
parameter when calling parse.
>>> sentences = parse(data, fields=["id", "form", "tag"])
>>> sentences[0][0]
OrderedDict([('id', 1), ('form', 'My'), ('tag', 'TAG1|TAG2')])
The only difference is that you now get the correct field name back when parsing. How let's say you want those two tags returned as a list instead of as a string you have to split. This can be done using field_parsers
.
>>> split_func = lambda line, i: line[i].split("|")
>>> sentences = parse(data, fields=["id", "form", "tag"], field_parsers={"tag": split_func})
>>> sentences[0][0]
OrderedDict([('id', 1), ('form', 'My'), ('tag', ['TAG1', 'TAG2'])])
That's much better! field_parsers
specifies a mapping from a field name, to a function that can parse that field. In our case, we specify that the field with custom logic is "tag"
and that the function to handle it is split_func
. Each field_parser gets sent two parameters:
line
: The whole list of values from this line, split on whitespace. The reason you get the full line is so you can merge several tokens into one using a field_parser if you wanted.i
: The current location in the line where you currently are. Most often, you'll useline[i]
to get the current value.
In our case, we return line[i].split("|")
, which returns a list, just like we want.
Let's look at the metadata in this example.
"""
# tagset = TAG1|TAG2|TAG3|TAG4
# sentence-123
"""
None of these values are valid in CoNLL-U, but since the first line follows the key-value format of other (valid) fields, conllu will parse it anyway:
>>> sentences = parse(data)
>>> sentences[0].metadata
OrderedDict([('tagset', 'TAG1|TAG2|TAG3|TAG4')])
Let's return this as a list using the metadata_parsers parameter.
>>> sentences = parse(data, metadata_parsers={"tagset": lambda key, value: (key, value.split("|"))})
>>> sentences[0].metadata
OrderedDict([('tagset', ['TAG1', 'TAG2', 'TAG3', 'TAG4'])])
A metadata parser behaves similarily as a field parser, but since most comments you'll see will be of the form "key = value" these values will be parsed and cleaned first, and then sent to your custom metadata_parser. Here we just take the value, and split it on "|", and return a list back. And lo and behold, we get what we wanted!
Now, let's deal with the "sentence-123" comment. Specifying another metadata_parser won't work, because this is an ID that will be different for each sentence. Instead, let's use a special metadata parser, called __fallback__
.
>>> sentences = parse(data, metadata_parsers={
... "tagset": lambda key, value: (key, value.split("|")),
... "__fallback__": lambda key, value: ("sentence-id", key)
... })
>>> sentences[0].metadata
OrderedDict([
('tagset', ['TAG1', 'TAG2', 'TAG3', 'TAG4']),
('sentence-id', 'sentence-123')
])
Just what we wanted! __fallback__
gets called any time none of the other metadata_parsers match, and just like the others, it gets sent the key and value of the current line. In our case, the line contains no "=" to split on, so key will be "sentence-123" and value will be empty. We can return whatever we want here, but let's just say we want to call this field "sentence-id" so we return that as the key, and "sentence-123" as our value.
Finally, consider an even trickier case.
>>> data = """
# id=1-document_id=36:1047-span=1
1 My TAG1|TAG2
2 custom TAG3
3 format TAG4
"""
This is actually three different comments, but somehow they are separated by "-" instead of on their own lines. To handle this, we get to use the ability of a metadata_parser to return multiple matches from a single line.
>>> sentences = parse(data, metadata_parsers={
... "__fallback__": lambda key, value: [pair.split("=") for pair in (key + "=" + value).split("-")]
... })
>>> sentences[0].metadata
OrderedDict([
('id', '1'),
('document_id', '36:1047'),
('span', '1')
])
Our fallback parser returns a list of matches, one per pair of metadata comments we find. The key + "=" + value
trick is needed since by default conllu assumes that this is a valid comment, so key
is "id" and value
is everything after the first "=", 1-document_id=36:1047-span=1
(note the missing "id=" in the beginning). We need to add it back before splitting on "-".
And that's it! Using these tricks you should be able to parse all the strange files you stumble into.
-
Make a fork of the repository to your own GitHub account.
-
Clone the repository locally on your computer:
git clone git@github.com:YOURUSERNAME/conllu.git conllu cd conllu
-
Install the library used for running the tests:
pip install tox
-
Now you can run the tests:
tox
This runs tox across all supported versions of Python, and also runs checks for code-coverage, syntax errors, and how imports are sorted.
-
(Alternative) If you just have one version of python installed, and don't want to go through the hassle of installing multiple version of python (hint: Install pyenv and pyenv-tox), it's fine to run tox with just one version of python:
tox -e py36
-
Make a pull request. Here's a good guide on PRs from GitHub.
Thanks for helping conllu become a better library!