CoNLL-U Parser parses a CoNLL-U formatted string into a nested python dictionary. CoNLL-U is often the output of natural language processing tasks.
- It's simple. ~300 lines of code.
- It has no dependencies
- Full typing support so your editor can do autocompletion
- Nice set of tests with CI setup:
- It has 100% test branch coverage (and has undergone mutation testing)
- It has
Note: As of conllu 4.0, Python 3.6 is required to install conllu. See Notes on updating from 3.0 to 4.0
pip install conllu
Or, if you are using conda:
conda install -c conda-forge conllu
Conllu version 4.0 drops support for Python 2 and all versions of earlier than Python 3.6. If you need support for older versions of python, you can always pin your install to an old version of conllu. You can install it with pip install conllu==3.1.1
.
The Universal dependencies 2.0 release changed two of the field names from xpostag -> xpos and upostag -> upos. Version 3.0 of conllu handles this by aliasing the previous names to the new names. This means you can use xpos/upos or xpostag/upostag, they will both return the same thing. This does change the public API slightly, so I've upped the major version to 3.0, but I've taken care to ensure you most likely DO NOT have to update your code when you update to 3.0.
I don't like breaking backwards compatibility, but to be able to add new features I felt I had to. This means that updating from 0.1 to 1.0 might require code changes. Here's a guide on how to upgrade to 1.0 .
At the top level, conllu provides two methods, parse
and parse_tree
. The first one parses sentences and returns a flat list. The other returns a nested tree structure. Let's go through them one by one.
>>> from conllu import parse
>>>
>>> data = """
# text = The quick brown fox jumps over the lazy dog.
1 The the DET DT Definite=Def|PronType=Art 4 det _ _
2 quick quick ADJ JJ Degree=Pos 4 amod _ _
3 brown brown ADJ JJ Degree=Pos 4 amod _ _
4 fox fox NOUN NN Number=Sing 5 nsubj _ _
5 jumps jump VERB VBZ Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin 0 root _ _
6 over over ADP IN _ 9 case _ _
7 the the DET DT Definite=Def|PronType=Art 9 det _ _
8 lazy lazy ADJ JJ Degree=Pos 9 amod _ _
9 dog dog NOUN NN Number=Sing 5 nmod _ SpaceAfter=No
10 . . PUNCT . _ 5 punct _ _
"""
Now you have the data in a variable called data
. Let's parse it:
>>> sentences = parse(data)
>>> sentences
[TokenList<The, quick, brown, fox, jumps, over, the, lazy, dog, ., metadata={text: "The quick brown fox jumps over the lazy dog."}>]
Advanced usage: If you have many sentences (say over a megabyte) to parse at once, you can avoid loading them into memory at once by using
parse_incr()
instead ofparse
. It takes an opened file, and returns a generator instead of the list directly, so you need to either iterate over it, or call list() to get the TokenLists out. Here's how you would use it:from io import open from conllu import parse_incr data_file = open("huge_file.conllu", "r", encoding="utf-8") for tokenlist in parse_incr(data_file): print(tokenlist)For most files,
parse
works fine.
Since one CoNLL-U file usually contains multiple sentences, parse()
always returns a list of sentences. Each sentence is represented by a TokenList.
>>> sentence = sentences[0]
>>> sentence
TokenList<The, quick, brown, fox, jumps, over, the, lazy, dog, ., metadata={text: "The quick brown fox jumps over the lazy dog."}>
The TokenList supports indexing, so you can get the first token, represented by an ordered dictionary, like this:
>>> token = sentence[0]
>>> token
{
'id': 1,
'form': 'The',
'lemma': 'the',
...
}
>>> token["form"]
'The'
>>> sentence = sentences[0]
>>> sentence
TokenList<The, quick, brown, fox, jumps, over, the, lazy, dog, ., metadata={text: "The quick brown fox jumps over the lazy dog."}>
>>> sentence.filter(form="quick")
TokenList<quick>
By using filter(field1__field2=value)
you can filter based on subelements further down in a parsed token.
>>> sentence.filter(feats__Degree="Pos")
TokenList<quick, brown, lazy>
Filters can also be chained (meaning you can do sentence.filter(...).filter(...)
), and filtering on multiple properties at the same time (sentence.filter(field1=value1, field2=value2)
) means that ALL properties must match.
You can also filter using a lambda function as value. This is useful if you, for instance, would like to filter out only tokens with integer ID:s:
>>> from conllu.models import TokenList, Token
>>> sentence2 = TokenList([
... Token(id=(1, "-", 2), form="It's"),
... Token(id=1, form="It"),
... Token(id=2, form="is"),
... ])
>>> sentence2
TokenList<It's, It, is>
>>> sentence2.filter(id=lambda x: type(x) is int)
TokenList<It, is>
If you want to change your CoNLL-U file, there are a couple of convenience methods to know about.
You can add a new token by simply appending a dictionary with the fields you want to a TokenList:
>>> sentence3 = TokenList([
... {"id": 1, "form": "Lazy"},
... {"id": 2, "form": "fox"},
... ])
>>> sentence3
TokenList<Lazy, fox>
>>> sentence3.append({"id": 3, "form": "box"})
>>> sentence3
TokenList<Lazy, fox, box>
Changing a sentence just means indexing into it, and setting a value to what you want:
>>> sentence4 = TokenList([
... {"id": 1, "form": "Lazy"},
... {"id": 2, "form": "fox"},
... ])
>>> sentence4[1]["form"] = "crocodile"
>>> sentence4
TokenList<Lazy, crocodile>
>>> sentence4[1] = {"id": 2, "form": "elephant"}
>>> sentence4
TokenList<Lazy, elephant>
If you omit a field when passing in a dict, conllu will fill in a "_" for those values.
>>> sentences = parse("1 The")
>>> sentences[0].append({"id": 2})
>>> sentences[0]
TokenList<The, _>
Each sentence can also have metadata in the form of comments before the sentence starts. This is available in a property on the TokenList called metadata
.
>>> sentence.metadata
{'text': 'The quick brown fox jumps over the lazy dog.'}
If you ever want to get your CoNLL-U formated text back (maybe after changing something?), use the serialize()
method:
>>> sentence.serialize()
# text = The quick brown fox jumps over the lazy dog.
1 The the DET DT Definite=Def|PronType=Art 4 det _ _
2 quick quick ADJ JJ Degree=Pos 4 amod _ _
3 brown brown ADJ JJ Degree=Pos 4 amod _ _
4 fox fox NOUN NN Number=Sing 5 nsubj _ _
5 jumps jump VERB VBZ Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin 0 root _ _
6 over over ADP IN _ 9 case _ _
7 the the DET DT Definite=Def|PronType=Art 9 det _ _
8 lazy lazy ADJ JJ Degree=Pos 9 amod _ _
9 dog dog NOUN NN Number=Sing 5 nmod _ SpaceAfter=No
10 . . PUNCT . _ 5 punct _ _
You can also convert a TokenList to a TokenTree by using to_tree
:
>>> sentence.to_tree()
TokenTree<token={id=5, form=jumps}, children=[...]>
That's it!
Sometimes you're interested in the tree structure that hides in the head
column of a CoNLL-U file. When this is the case, use parse_tree
to get a nested structure representing the sentence.
>>> from conllu import parse_tree
>>> sentences = parse_tree(data)
>>> sentences
[TokenTree<...>]
Advanced usage: If you have many sentences (say over a megabyte) to parse at once, you can avoid loading them into memory at once by using
parse_tree_incr()
instead ofparse_tree
. It takes an opened file, and returns a generator instead of the list directly, so you need to either iterate over it, or call list() to get the TokenTrees out. Here's how you would use it:from io import open from conllu import parse_tree_incr data_file = open("huge_file.conllu", "r", encoding="utf-8") for tokentree in parse_tree_incr(data_file): print(tokentree)
Since one CoNLL-U file usually contains multiple sentences, parse_tree()
always returns a list of sentences. Each sentence is represented by a TokenTree.
>>> root = sentences[0]
>>> root
TokenTree<token={id=5, form=jumps}, children=[...]>
To quickly visualize the tree structure you can call print_tree
on a TokenTree.
>>> root.print_tree()
(deprel:root) form:jumps lemma:jump upos:VERB [5]
(deprel:nsubj) form:fox lemma:fox upos:NOUN [4]
(deprel:det) form:The lemma:the upos:DET [1]
(deprel:amod) form:quick lemma:quick upos:ADJ [2]
(deprel:amod) form:brown lemma:brown upos:ADJ [3]
(deprel:nmod) form:dog lemma:dog upos:NOUN [9]
(deprel:case) form:over lemma:over upos:ADP [6]
(deprel:det) form:the lemma:the upos:DET [7]
(deprel:amod) form:lazy lemma:lazy upos:ADJ [8]
(deprel:punct) form:. lemma:. upos:PUNCT [10]
To access the token corresponding to the current node in the tree, use token
:
>>> root.token
{
'id': 5,
'form': 'jumps',
'lemma': 'jump',
...
}
To start walking down the children of the current node, use the children attribute:
>>> children = root.children
>>> children
[
TokenTree<token={id=4, form=fox}, children=[...]>,
TokenTree<token={id=9, form=dog}, children=[...]>,
TokenTree<token={id=10, form=.}, children=None>
]
Just like with parse()
, if a sentence has metadata it is available in a property on the TokenTree root called metadata
.
>>> root.metadata
{'text': 'The quick brown fox jumps over the lazy dog.'}
If you ever want to get your CoNLL-U formated text back (maybe after changing something?), use the serialize()
method:
>>> root.serialize()
# text = The quick brown fox jumps over the lazy dog.
1 The the DET DT Definite=Def|PronType=Art 4 det _ _
2 quick quick ADJ JJ Degree=Pos 4 amod _ _
...
If you want to write it back to a file, you can use something like this:
>>> from conllu import parse_tree
>>> sentences = parse_tree(data)
>>>
>>> # Make some change to sentences here
>>>
>>> with open('file-to-write-to', 'w') as f:
... f.writelines([sentence.serialize() + "\n" for sentence in sentences])
Far from all CoNLL-U files found in the wild follow the CoNLL-U format specification. CoNLL-U tries to parse even files that are malformed according to the specification, but sometimes that doesn't work. For those situations you can change how conllu parses your files.
A normal CoNLL-U file consists of a specific set of fields (id, form, lemma, and so on...). Let's walk through how to parse a custom format using the three options fields
, field_parsers
, metadata_parsers
. Here's the custom format we'll use.
>>> data = """
# tagset = TAG1|TAG2|TAG3|TAG4
# sentence-123
1 My TAG1|TAG2
2 custom TAG3
3 format TAG4
"""
Now, let's parse this with the the default settings, and look specifically at the first token to see how it was parsed.
>>> sentences = parse(data)
>>> sentences[0][0]
{'id': 1, 'form': 'My', 'lemma': 'TAG1|TAG2'}
The parser has assumed (incorrectly) that the third field must the the default ´lemma´ field and parsed it as such. Let's customize this so the parser gets the name right, by setting the fields
parameter when calling parse.
>>> sentences = parse(data, fields=["id", "form", "tag"])
>>> sentences[0][0]
{'id': 1, 'form': 'My', 'tag': 'TAG1|TAG2'}
The only difference is that you now get the correct field name back when parsing. Now let's say you want those two tags returned as a list instead of as a string. This can be done using the field_parsers
argument.
>>> split_func = lambda line, i: line[i].split("|")
>>> sentences = parse(data, fields=["id", "form", "tag"], field_parsers={"tag": split_func})
>>> sentences[0][0]
{'id': 1, 'form': 'My', 'tag': ['TAG1', 'TAG2']}
That's much better! field_parsers
specifies a mapping from a field name, to a function that can parse that field. In our case, we specify that the field with custom logic is "tag"
and that the function to handle it is split_func
. Each field_parser gets sent two parameters:
line
: The whole list of values from this line, split on whitespace. The reason you get the full line is so you can merge several tokens into one using a field_parser if you want.i
: The current location in the line where you currently are. Most often, you'll useline[i]
to get the current value.
In our case, we return line[i].split("|")
, which returns a list like we want.
Let's look at the metadata in this example.
"""
# tagset = TAG1|TAG2|TAG3|TAG4
# sentence-123
"""
None of these values are valid in CoNLL-U, but since the first line follows the key-value format of other (valid) fields, conllu will parse it anyway:
>>> sentences = parse(data)
>>> sentences[0].metadata
{'tagset': 'TAG1|TAG2|TAG3|TAG4'}
Let's return this as a list using the metadata_parsers
parameter.
>>> sentences = parse(data, metadata_parsers={"tagset": lambda key, value: (key, value.split("|"))})
>>> sentences[0].metadata
{'tagset': ['TAG1', 'TAG2', 'TAG3', 'TAG4']}
A metadata parser behaves similarily to a field parser, but since most comments you'll see will be of the form "key = value" these values will be parsed and cleaned first, and then sent to your custom metadata_parser. Here we just take the value, and split it on "|", and return a list back. And lo and behold, we get what we wanted!
Now, let's deal with the "sentence-123" comment. Specifying another metadata_parser won't work, because this is an ID that will be different for each sentence. Instead, let's use a special metadata parser, called __fallback__
.
>>> sentences = parse(data, metadata_parsers={
... "tagset": lambda key, value: (key, value.split("|")),
... "__fallback__": lambda key, value: ("sentence-id", key)
... })
>>> sentences[0].metadata
{
'tagset': ['TAG1', 'TAG2', 'TAG3', 'TAG4'],
'sentence-id': 'sentence-123'
}
Just what we wanted! __fallback__
gets called any time none of the other metadata_parsers match, and just like the others, it gets sent the key and value of the current line. In our case, the line contains no "=" to split on, so key will be "sentence-123" and value will be empty. We can return whatever we want here, but let's just say we want to call this field "sentence-id" so we return that as the key, and "sentence-123" as our value.
Finally, consider an even trickier case.
>>> data = """
# id=1-document_id=36:1047-span=1
1 My TAG1|TAG2
2 custom TAG3
3 format TAG4
"""
This is actually three different comments, but somehow they are separated by "-" instead of on their own lines. To handle this, we get to use the ability of a metadata_parser to return multiple matches from a single line.
>>> sentences = parse(data, metadata_parsers={
... "__fallback__": lambda key, value: [pair.split("=") for pair in (key + "=" + value).split("-")]
... })
>>> sentences[0].metadata
{
'id': '1',
'document_id': '36:1047',
'span': '1'
}
Our fallback parser returns a list of matches, one per pair of metadata comments we find. The key + "=" + value
trick is needed since by default conllu assumes that this is a valid comment, so key
is "id" and value
is everything after the first "=", 1-document_id=36:1047-span=1
(note the missing "id=" in the beginning). We need to add it back before splitting on "-".
And that's it! Using these tricks you should be able to parse all the strange files you stumble into.
-
Make a fork of the repository to your own GitHub account.
-
Clone the repository locally on your computer:
git clone git@github.com:YOURUSERNAME/conllu.git conllu cd conllu
-
Install the library used for running the tests:
pip install tox
-
Now you can run the tests:
tox
This runs tox across all supported versions of Python, and also runs checks for code-coverage, syntax errors, and how imports are sorted.
-
(Alternative) If you just have one version of python installed, and don't want to go through the hassle of installing multiple version of python (hint: Install pyenv and pyenv-tox), it's fine to run tox with just one version of python:
tox -e py36
-
Make a pull request. Here's a good guide on PRs from GitHub.
Thanks for helping conllu become a better library!