/kie-pytorch

PyTorch-IE: State-of-the-art Information Extraction in PyTorch

Primary LanguagePythonMIT LicenseMIT

PyTorch-IE: State-of-the-art Information Extraction in PyTorch

PyPI Status Python Version License

Read the documentation at https://pytorch-ie.readthedocs.io/ Tests Codecov

pre-commit Black

๐Ÿคฏ What's this about?

This is an experimental framework that aims to combine the lessons learned from five years of information extraction research.

  • Focus on the core task: The main goal is to develop information extraction methods not dataset loading and evaluation logic. We use external well-maintained libraries for non-core functionality. PyTorch-Lightning for training and logging, Huggingface datasets for dataset reading, and Huggingface evaluate for evaluation (coming soon).
  • Sharing is caring: Being able to quickly and easily share models is key to promote your work and facilitate further research. All models developed in PyTorch-IE can be easily shared via the Huggingface model hub. This further allows to quickly build demos based on Huggingface spaces, gradio or streamlit.
  • Unified document format: A unified document format allows for quick experimentation on any dataset or task.
  • Beyond sentence level: Most information extraction frameworks assume text inputs at a sentence granularity. We do not make any assumption on the granularity but generally aim for document-level information extraction.
  • Beyond unstructured text: Unstructured text is only one possible area for information extraction. We developed the framework to also support information extraction from semi-structured text (e.g. HTML), two-dimensional text (e.g. OCR'd images), and images.
  • Character-level annotation and evaluation: Many information extraction frameworks annotate and evaluate on a token level. We believe that annotation and evaluation should be done on a character level as this also considers the suitability of the tokenizer for the task.
  • Make no assumptions on the structure of models: The last years have seen many different and creative approaches to information extraction and a framework that imposes a structure on those will most certainly be to limiting. With PyTorch-iE you have full control over how a document is prepared for a model and how the model is structured. The logic is self-contained and thus can be easily shared and inspected by others. The only assumption we make is that the input is a document and the output are targets (training) or annotations (inference).

๐Ÿ”ญ Demos

Task Link
Named Entity Recognition (Span-based) Hugging Face Spaces
Joint Named Entity Recognition and Relation Classification Hugging Face Spaces

๐Ÿš€๏ธ Quickstart

$ pip install pytorch-ie

For even faster prototyping with pre-defined, but fully configurable training pipelines and much more useful tooling, have a look into the PyTorch-IE-Hydra-Template.

๐Ÿฅง Concepts & Architecture

PyTorch-IE builds on three core concepts: the ๐Ÿ“ƒ Document, the ๐Ÿ”ค โ‡” ๐Ÿ”ข Taskmodule, and the ๐Ÿงฎ Model. In a nutshell, the Document says how your data is structured, the Model defines your trainable logic and the Taskmodule converts from one end to the other. All three concepts are represented as abstract classes that should be used to derive use-case specific versions. In the following, they are explained in detail.

๐Ÿ“ƒ Document

The Document class is a special dataclass that defines the document model. Derivations can contain several elements:

  • Data fields like strings to represent one or multiple texts or arrays for image data. These elements can be arbitrary python objects.
  • Annotation fields like labeled spans for entities or labeled tuples of spans for relations. These elements have to be of a certain container type AnnotationList that is dynamically typed with the actual annotation type, e.g. entities: AnnotationList[LabeledSpan]. Furthermore, annotation elements define one or multiple annotation targets. An annotation target is either a data element or another annotation container. Internally, targets are used to construct the annotation graph, i.e. data elements and annotation containers are the nodes and targets define the edges. The annotation graph defines the (de-)serialization order and what is accessible from within an annotation. To facilitate the setup of annotation containers, there is the annotation_field() method.
  • Other fields to save metadata, ids, etc. They are not constrained in any way, but can not be accessed from within annotations.

Example Document Model

from typing import Optional
from pytorch_ie.core import Document, AnnotationList, annotation_field
from pytorch_ie.annotations import LabeledSpan, BinaryRelation, Label

class MyDocument(Document):
    # data fields (any field that is targeted by an annotation fields)
    text: str
    # annotation fields
    entities: AnnotationList[LabeledSpan] = annotation_field(target="text")
    relations: AnnotationList[BinaryRelation] = annotation_field(target="entities")
    label: AnnotationList[Label] = annotation_field()
    # other fields
    doc_id: Optional[str] = None

Note that the label is a special annotation field that does not define a target because it belongs to the whole document. You can also have more complex constructs, like annotation fields that target multiple other fields by using annotation_field(targets) or annotation_field(named_targets). The latter is useful if you want to access the targets by name from within the annotation, see below for an example.

Annotations

There are several predefined annotation types in pytorch_ie.annotations, however, feel free to define your own. Annotations have to be dataclasses that subclass pytorch_ie.core.Annotation. They also need to be hashable and immutable. The following is a simple example:

@dataclass(eq=True, frozen=True)
class SimpleLabeledSpan(Annotation):
    start: int
    end: int
    label: str
Accessing Target Content

We can expand the above example a little to have a nice string representation:

@dataclass(eq=True, frozen=True)
class LabeledSpan(Annotation):
    start: int
    end: int
    label: str

    def __str__(self) -> str:
        if self.targets is None:
            return ""
        return str(self.target[self.start : self.end])

The content of self.target is lazily assigned as soon as the annotation is added to a document.

Note that this now expects a single collections.abc.Sequence as target, e.g.:

my_spans: AnnotationList[Span] = annotation_field(target="<NAME_OF_THE_SEQUENCE_FIELD>")

If we have multiple targets, we need to define target names to access them. For this, we need to set the special field TARGET_NAMES:

@dataclass(eq=True, frozen=True)
class Alignment(Annotation):
    TARGET_NAMES = ("text1", "text2")
    start1: int
    end1: int
    start2: int
    end2: int

    def __str__(self) -> str:
        if self.targets is None:
            return ""
        # we can access the `named_targets` which has the keys defined in `TARGET_NAMES`
        span1 = self.named_targets["text1"][self.start1 : self.end1]
        span2 = self.named_targets["text2"][self.start2 : self.end2]
        return f'span1="{span1}" is aligned with span2="{span2}"'

This requires to define the annotation container as follows:

class MyDocumentWithAlignment(Document):
    text_a: str
    text_b: str
    # `named_targets` defines the mapping from `TARGET_NAMES` to data fields
    my_alignments: AnnotationList[Alignment] = annotation_field(named_targets={"text1": "text_a", "text2": "text_b"})

Note that text1 and text2 can also target the same field.

(De-)Serialization of Annotations

As usual for dataclasses, annotations can be converted to json like objects with .asdict(). However, they can be also created with MyAnnotation.fromdict(dct, annotation_store). Both methods are required because documents and their annotations are created on the fly when working with PIE datasets (see below).

๐Ÿ”ค โ‡” ๐Ÿ”ข Taskmodule

The taskmodule is responsible for converting documents to model inputs and back. For that purpose, it requires the user to implement the following methods:

  • encode_input: Taking one document, create one or multiple TaskEncodings. A TaskEncoding represents an example that will be passed to the model later on. It is a container holding inputs, optional targets, the original document, and metadata. Note that encode_input should not assign a value to targets.
  • encode_target: This gets a single TaskEncoding and should produce a target encoding that will be assigned to targets later on. As such, it is called only during training / evaluation, but not for inference. Note that, this is allowed to return None. In this case, the respective TaskEncoding will not be passed to the model at all.
  • collate: Taking a batch of TaskEncodings, this should produce a batch input for the model. Note that this has to work with available targets (training and evaluation) and without them (inference).
  • unbatch_output: This gets a batch output from the model and should rearrange that into a sequence of TaskOutputs. In that means it can be understood as the opposite to collate. The number of TaskOutputs should match the number of TaskEncodings that got into the batch because we align them later on for easy creation of new annotations.
  • create_annotations_from_output: This gets a single TaskEncoding with its corresponding TaskOutput and should yield tuples each consisting of an annotation field name and an annotation. The annotations will be added as predictions to the annotation field with the respective name.
  • prepare (OPTIONAL): This will get the train dataset, i.e. a Sequence or Iterable of Documents, and can be used to calculate additional parameters like the list of all available labels, etc.

You can find some predefined taskmodules for text- and token classification, text classification based relation extraction, joint entity and relation classification and other use cases in the package pytorch_ie.taskmodules. Especially, have a look at the SimpleTransformerTextClassificationTaskModule that is well documented and should provide a good starting point to implement your own one.

๐Ÿงฎ Model

PyTorch-IE models are meant to do the heavy lifting training and inference. They are Pytorch-Lightning modules, enhanced with some functionality to ease persisting them, see Reusability and Sharing.

You can find some predefined models for transformer based text- and token classification, sequence generation, and other use cases in the package pytorch_ie.models.

Reusability and Sharing

Taskmodules and Models provide some functionality to ease reusability and reproducibility. Especially, they provide the methods save_pretrained() and from_pretrained() that can be used to save their specification, i.e. their config, and available model wights to disc and exactly re-create them again from that data.

Huggingface Hub and Extended Configs

These methods come along with integration to the Huggingface Hub. By passing push_to_hub=True to save_pretrained(), the taskmodule / model is directly pushed to the Hub and can be loaded again with the respective identifier (see the Examples for how to do so). However, to work properly, each taskmodule / model has to correctly implement the _config() getter method. Per default, it returns all parameters passed to the __init__ method if this calls save_hyperparameters() which is very recommended. But you may have created some further parameters that should be persisted, for instance a label-to-id mapping. In this case, _config() should be overwritten to take this into account:

def _config(self) -> Dict[str, Any]:
    # add the label-to-id mapping to the config
    config = super()._config()
    config["label_to_id"] = self.label_to_id
    return config

Furthermore, you can use the property is_from_pretrained to know if the taskmodule / model is just loaded or created from scratch. This may be useful, for instance, to avoid downloading a model from Huggingface Transformers when you in fact want to load your own trained model from disc via from_pretrained:

from transformers import AutoConfig, AutoModel

hf_config = AutoConfig.from_pretrained(model_name_or_path)
# If this is already trained, just create an empty transformer model. The weights are loaded afterwards
# via the pytorch_ie.Model.from_pretrained() logic.
if self.is_from_pretrained:
    self.model = AutoModel.from_config(config=hf_config)
# Otherwise, download the whole model from the Huggingface Hub.
else:
    self.model = AutoModel.from_pretrained(model_name_or_path, config=hf_config)

In short, each taskmodule / model implementation should:

  • call save_hyperparameters() in __init__ to save all constructor arguments,
  • pass remaining __init__ kwargs (keyword arguments) to its super to not break some other helpful functionality (e.g. is_from_pretrained), and
  • overwrite _config() if additional parameters are calculated, e.g. from the dataset.

โšก๏ธ Examples: Prediction

The following examples work out of the box. No further setup like manually downloading a model is needed!

Note: Setting num_workers=0 in the pipeline is only necessary when running an example in an interactive python session. The reason is that multiprocessing doesn't play well with the interactive python interpreter, see here for details.

Span-classification-based Named Entity Recognition

from dataclasses import dataclass

from pytorch_ie.annotations import LabeledSpan
from pytorch_ie.auto import AutoPipeline
from pytorch_ie.core import AnnotationList, annotation_field
from pytorch_ie.documents import TextDocument

@dataclass
class ExampleDocument(TextDocument):
    entities: AnnotationList[LabeledSpan] = annotation_field(target="text")

document = ExampleDocument(
    "โ€œMaking a super tasty alt-chicken wing is only half of it,โ€ said Po Bronson, general partner at SOSV and managing director of IndieBio."
)

# see below for the long version
ner_pipeline = AutoPipeline.from_pretrained("pie/example-ner-spanclf-conll03", device=-1, num_workers=0)

ner_pipeline(document)

for entity in document.entities.predictions:
    print(f"{entity} -> {entity.label}")

# Result:
# IndieBio -> ORG
# Po Bronson -> PER
# SOSV -> ORG
To create the same pipeline as above without `AutoPipeline`
from pytorch_ie.auto import AutoTaskModule, AutoModel
from pytorch_ie.pipeline import Pipeline

model_name_or_path = "pie/example-ner-spanclf-conll03"
ner_taskmodule = AutoTaskModule.from_pretrained(model_name_or_path)
ner_model = AutoModel.from_pretrained(model_name_or_path)
ner_pipeline = Pipeline(model=ner_model, taskmodule=ner_taskmodule, device=-1, num_workers=0)
Or, without `Auto` classes at all
from pytorch_ie.pipeline import Pipeline
from pytorch_ie.models import TransformerSpanClassificationModel
from pytorch_ie.taskmodules import TransformerSpanClassificationTaskModule

model_name_or_path = "pie/example-ner-spanclf-conll03"
ner_taskmodule = TransformerSpanClassificationTaskModule.from_pretrained(model_name_or_path)
ner_model = TransformerSpanClassificationModel.from_pretrained(model_name_or_path)
ner_pipeline = Pipeline(model=ner_model, taskmodule=ner_taskmodule, device=-1, num_workers=0)

Text-classification-based Relation Extraction

from dataclasses import dataclass

from pytorch_ie.annotations import BinaryRelation, LabeledSpan
from pytorch_ie.auto import AutoPipeline
from pytorch_ie.core import AnnotationList, annotation_field
from pytorch_ie.documents import TextDocument


@dataclass
class ExampleDocument(TextDocument):
    entities: AnnotationList[LabeledSpan] = annotation_field(target="text")
    relations: AnnotationList[BinaryRelation] = annotation_field(target="entities")

document = ExampleDocument(
    "โ€œMaking a super tasty alt-chicken wing is only half of it,โ€ said Po Bronson, general partner at SOSV and managing director of IndieBio."
)

re_pipeline = AutoPipeline.from_pretrained("pie/example-re-textclf-tacred", device=-1, num_workers=0)

for start, end, label in [(65, 75, "PER"), (96, 100, "ORG"), (126, 134, "ORG")]:
    document.entities.append(LabeledSpan(start=start, end=end, label=label))

re_pipeline(document, batch_size=2)

for relation in document.relations.predictions:
    print(f"({relation.head} -> {relation.tail}) -> {relation.label}")

# Result:
# (Po Bronson -> SOSV) -> per:employee_of
# (Po Bronson -> IndieBio) -> per:employee_of
# (SOSV -> Po Bronson) -> org:top_members/employees
# (IndieBio -> Po Bronson) -> org:top_members/employees

โšก๏ธ Examples: Training

Span-classification-based Named Entity Recognition

import pytorch_lightning as pl
from pytorch_lightning.callbacks import ModelCheckpoint
from torch.utils.data import DataLoader

import datasets
from pytorch_ie.models.transformer_span_classification import TransformerSpanClassificationModel
from pytorch_ie.taskmodules.transformer_span_classification import (
    TransformerSpanClassificationTaskModule,
)

pl.seed_everything(42)

model_output_path = "./model_output/"
model_name = "bert-base-cased"
num_epochs = 10
batch_size = 32

# Get the PIE dataset consisting of PIE Documents that will be used for training (and evaluation).
dataset = datasets.load_dataset(
    path="pie/conll2003",
)
train_docs, val_docs = dataset["train"], dataset["validation"]

print("train docs: ", len(train_docs))
print("val docs: ", len(val_docs))

# Create a PIE taskmodule.
task_module = TransformerSpanClassificationTaskModule(
    tokenizer_name_or_path=model_name,
    max_length=128,
)

# Prepare the taskmodule with the training data. This may collect available labels etc.
# The result of this should affect the state of the taskmodule config which will be
# persisted (and can be loaded) later on.
task_module.prepare(train_docs)

# Persist the taskmodule. This writes the taskmodule config as a json file into the
# model_output_path directory. The config contains all constructor parameters to
# re-create the taskmodule at this state (via AutoTaskmodule.from_pretrained(model_output_path)).
task_module.save_pretrained(model_output_path)

# Use the taskmodule to encode the train and dev sets. This may use the text and
# available annotations of the documents.
train_dataset = task_module.encode(train_docs, encode_target=True, as_dataset=True)
val_dataset = task_module.encode(val_docs, encode_target=True, as_dataset=True)

# Create the dataloaders. Note that the taskmodule provides the collate function!
train_dataloader = DataLoader(
    train_dataset,
    batch_size=batch_size,
    shuffle=True,
    collate_fn=task_module.collate,
)

val_dataloader = DataLoader(
    val_dataset,
    batch_size=batch_size,
    shuffle=False,
    collate_fn=task_module.collate,
)

# Create the PIE model. Note that we use the number of entries in the previously
# collected label_to_id mapping to set the number of classes to predict.
model = TransformerSpanClassificationModel(
    model_name_or_path=model_name,
    num_classes=len(task_module.label_to_id),
    t_total=len(train_dataloader) * num_epochs,
    learning_rate=1e-4,
)

# Optionally, set up a model checkpoint callback. See here for further information:
# https://pytorch-lightning.readthedocs.io/en/stable/api/pytorch_lightning.callbacks.ModelCheckpoint.html
# checkpoint_callback = ModelCheckpoint(
#     monitor="val/f1",
#     dirpath=model_output_path,
#     filename="zs-ner-{epoch:02d}-val_f1-{val/f1:.2f}",
#     save_top_k=1,
#     mode="max",
#     auto_insert_metric_name=False,
#     save_weights_only=True,
# )

# Create the pytorch-lightning trainer. See here for further information:
# https://pytorch-lightning.readthedocs.io/en/latest/api/pytorch_lightning.trainer.trainer.Trainer.html
trainer = pl.Trainer(
    fast_dev_run=False,
    max_epochs=num_epochs,
    gpus=0,
    enable_checkpointing=False,
    # callbacks=[checkpoint_callback],
    precision=32,
)
# Start the training.
trainer.fit(model, train_dataloader, val_dataloader)

# Persist the trained model. This will save the model weights and the model config that allows
# to re-create the model at this state (via AutoModel.from_pretrained(model_output_path)).
# model.save_pretrained(model_output_path)

๐Ÿ“š Datasets

We parse all datasets into a common format that can be loaded directly from the model hub via Huggingface datasets. The documents are cached in an arrow table and serialized / deserialized on the fly. Any changes or preprocessing applied to the documents will be cached as well.

import datasets

dataset = datasets.load_dataset("pie/conll2003")

print(dataset["train"][0])
# >>> CoNLL2003Document(text='EU rejects German call to boycott British lamb .', id='0', metadata={})

dataset["train"][0].entities
# >>> AnnotationList([LabeledSpan(start=0, end=2, label='ORG', score=1.0), LabeledSpan(start=11, end=17, label='MISC', score=1.0), LabeledSpan(start=34, end=41, label='MISC', score=1.0)])

entity = dataset["train"][0].entities[1]

print(f"[{entity.start}, {entity.end}] {entity}")
# >>> [11, 17] German
How to create your own Pytorch-IE dataset

PyTorch-IE datasets are built on top of Huggingface datasets. For instance, consider the conll2003 from the Huggingface Hub and especially their respective dataset loading script. To create a PyTorch-IE dataset from that, you have to implement:

  1. A Document class. This will be the type of individual dataset examples.
@dataclass
class CoNLL2003Document(TextDocument):
    entities: AnnotationList[LabeledSpan] = annotation_field(target="text")

Here we derive from TextDocument that has a simple text string as base annotation target. The CoNLL2003Document adds one single annotation list called entities that consists of LabeledSpans which reference the text field of the document. You can add further annotation types by adding AnnotationList fields that may also reference (i.e. target) other annotations as you like. See 'pytorch_ie.annotations` for predefined annotation types.

  1. A dataset config. This is similar to creating a Huggingface dataset config.
class CoNLL2003Config(datasets.BuilderConfig):
    """BuilderConfig for CoNLL2003"""

    def __init__(self, **kwargs):
        """BuilderConfig for CoNLL2003.
        Args:
          **kwargs: keyword arguments forwarded to super.
        """
        super().__init__(**kwargs)
  1. A dataset builder class. This should inherit from pytorch_ie.data.builder.GeneratorBasedBuilder which is a wrapper around the Huggingface dataset builder class with some utility functionality to work with PyTorch-IE Documents. The key elements to implement are: DOCUMENT_TYPE, BASE_DATASET_PATH, and _generate_document.
class Conll2003(pytorch_ie.data.builder.GeneratorBasedBuilder):
    # Specify the document type. This will be the class of individual dataset examples.
    DOCUMENT_TYPE = CoNLL2003Document

    # The Huggingface identifier that points to the base dataset. This may be any string that works
    # as path with Huggingface `datasets.load_dataset`.
    BASE_DATASET_PATH = "conll2003"

    # The builder configs, see https://huggingface.co/docs/datasets/dataset_script for further information.
    BUILDER_CONFIGS = [
        CoNLL2003Config(
            name="conll2003", version=datasets.Version("1.0.0"), description="CoNLL2003 dataset"
        ),
    ]

    # [Optional] Define additional keyword arguments which will be passed to `_generate_document` below.
    def _generate_document_kwargs(self, dataset):
        return {"int_to_str": dataset.features["ner_tags"].feature.int2str}

    # Define how a Pytorch-IE Document will be created from a Huggingface dataset example.
    def _generate_document(self, example, int_to_str):
        doc_id = example["id"]
        tokens = example["tokens"]
        ner_tags = [int_to_str(tag) for tag in example["ner_tags"]]

        text, ner_spans = tokens_and_tags_to_text_and_labeled_spans(tokens=tokens, tags=ner_tags)

        document = CoNLL2003Document(text=text, id=doc_id)

        for span in sorted(ner_spans, key=lambda span: span.start):
            document.entities.append(span)

        return document

The full script can be found here: dataset_builders/conll2003/conll2003.py. Note, that to load the dataset with datasets.load_dataset, the script has to be located in a directory with the same name (as it is the case for standard Huggingface dataset loading scripts).

โœจ๐Ÿ“šโœจ Read the full documentation

๐Ÿ”ง Project Development

Setup

This project is build with Poetry. It is recommended, to install Poetry via pipx.

  1. To install pipx, execute the following (taken from pipx installation instructions):

    # [IF PIP IS NOT AVAILABLE] install pip
    python -m ensurepip --upgrade
    # [OPTIONAL] update pip
    python -m pip install --upgrade pip
    
    # install pipx (requires pip 19.0 or later)
    python3 -m pip install --user pipx
    python3 -m pipx ensurepath

    NOTE: This installs pipx globally!

  2. Install Poetry via pipx (or see Poetry installation guide):

    pipx install poetry

    NOTE: This installs pipx globally!

  3. [IF REQUIRED PYTHON VERSION IS NOT AVAILABLE] install required python:

    # for instance, to install python3.9 on Ubuntu:
    sudo add-apt-repository ppa:deadsnakes/ppa
    sudo apt update
    sudo install python3.9
    
    # or via conda:
    conda create -n python3.9=python3.9 -y
    conda activate python3.9
  4. Finally, install the dependencies (including for development) for PyTorch-IE:

    poetry install --with dev

    NOTE: If the installation gets stuck, try if disabling experimental parallel installer helps (source): poetry config experimental.new-installer false

Testing and code quality checks

We use Nox to execute any tests and code quality tooling in a reproducible way.

To get a list of available toolchains, call:

poetry run nox -l

To run a specific command from that list, call:

poetry run nox -s <command>

Note: To run the nox commands in the same, reproducible setup that is specified by the lock file, we call them via poetry run <nox-command>.

For instance, to run static type checking with mypy, call:

poetry run nox -s mypy-3.9

To run all commands that also run on GitHub CI, call:

poetry run nox

To run more tests (also tests marked with @pytest.mark.slow, but without tests for all datasets which would take forever), call:

poetry run nox -s tests_no_local_datasets-3.9

Updating Dependencies

Call this to update individual packages:

poetry update <package>

Then, commit the modified lock file to persist the state.

Releasing

Since this project is based on the Cookiecutter template, we can follow their release steps:

  1. Create the release branch: git switch --create release main
  2. Increase the version: poetry version <PATCH|MINOR|MAJOR>, e.g. poetry version patch for a patch release
  3. Commit the changes: git commit --message="release <NEW VERSION>" pyproject.toml, e.g. git commit --message="release 0.13.0" pyproject.toml
  4. Push the changes to GitHub: git push origin release
  5. Create a PR for that release branch on GitHub.
  6. Wait until checks passed successfully.
  7. Integrate the PR into the main branch (use rebase to have a linear history). This triggers the GitHub Action that creates all relevant release artefacts and also uploads them to PyPI.
  8. Cleanup: Delete the release branch.

๐Ÿ… Acknowledgements

๐Ÿ“ƒ Citation

If you find the framework useful please consider citing it:

@misc{alt2022pytorchie,
    author={Christoph Alt, Arne Binder},
    title = {PyTorch-IE: State-of-the-art Information Extraction in PyTorch},
    year = {2022},
    publisher = {GitHub},
    journal = {GitHub repository},
    howpublished = {\url{https://github.com/ChristophAlt/pytorch-ie}}
}