/udpos

Universal Dependencies datasets preprocess and autodownloads.

Primary LanguagePythonCreative Commons Zero v1.0 UniversalCC0-1.0

udpos

Universal Dependencies (UD) datasets converter and autodownloads sources.

About

This repository contains UD preprocessed datasets for Part of Speech (POS) tasks. The datasets are converted from .conllu to .txt format. They can be downloaded using the torch.datasets.sequence_tagging.SequenceTaggingDataset class, just change the urls to the dataset of your choice.

Example

The datasets follows the orginal template from the Universal Dependencies English Treebank, available here. Create your custom UDPOS dataset using the language of your choice. Replace the urls and saving directory dirname to extract the downloaded files. That's it !

from torchtext.datasets import SequenceTaggingDataset

class UDPOSFR(SequenceTaggingDataset):
    # Universal Dependencies French Web Treebank.
    # Download original at http://universaldependencies.org/
    # License: http://creativecommons.org/licenses/by-sa/4.0/
    urls = ['https://github.com/arthurdjn/udpos/raw/master/data/fr-gsd-ud-15032020.zip'] # change to the dataset of your choice
    dirname = 'fr-gsd-ud'  # don't forget to change me too !
    name = 'udpos'         # not obligatory to change here

    @classmethod
    def splits(cls, fields, root=".data", 
               train="fr_gsd-ud-dev.txt",
               validation="fr_gsd-ud-dev.txt",
               test="fr_gsd-ud-dev.txt", **kwargs):
        """Downloads and loads the Universal Dependencies Version 2 POS Tagged
        data.
        """

        return super(UDPOSFR, cls).splits(
            fields=fields, root=root, train=train, validation=validation,
            test=test, **kwargs)

Then,

from torchtext import data
import UDPOSFR

TEXT = data.Field(lower = True)
LEMMATIZED = data.Field(unk_token = None)
UD_TAGS = data.Field(unk_token = None)
fields = (("text", TEXT), ("lemmatized", LEMMATIZED), ("udtags", UD_TAGS))

# Load the UD french dataset
train_data, eval_data, test_data = UDPOSFR.splits(fields)

Convert your own dataset

If you want to use a dataset that is not saved under the data/ folder, you can still use the conllu2txt(filename) function, in utils/. It will preprocess the original dataset into a ready-to-use one with PyTorch.

Source

Original datasets available at Universal Dependencies repository and on their official website. All rights reserved to Universal Dependencies.