/uniformers

Token-free Language Modeling with ByGPT5 & Friends!

Primary LanguagePythonApache License 2.0Apache-2.0

Uniformers
Token-free Language Modeling with ByGPT5 & Friends

ACL Anthology arXiv Semantic Scholar Colab

Uniformers (short for Universal Coded Character Set Transformers) is a library for token-free language modeling. In particular, it contains the reference implementation of ByGPT5: End-to-End Style-conditioned Poetry Generation with Token-free Language Models. ByGPT5 is a token-free decoder-only transformer that excels at character-level tasks such as style-conditioned poetry generation.

  • 📜 Read our paper on ByGPT5 for details.
  • 🪶 An interactive demo for poetry generation is available.
  • 💡 If you make use of this library in your work please cite it.

Installation

If you want to use this project as a library you can install it as a regular package using pip:

pip install 'git+https://github.com/potamides/uniformers.git#egg=uniformers'

If your goal is to run the included examples (e.g., to reproduce results) clone the repository and install it in editable mode:

git clone https://github.com/potamides/uniformers
pip install -e uniformers[examples]

Usage

Uniformers builds upon the transformers library and can be used very similarly.

from torch import device
from transformers.pipelines.text_generation import TextGenerationPipeline

from uniformers.models.bygpt5 import ByGPT5LMHeadModel, ByGPT5Tokenizer

prompt = "In a shocking finding, scientist discovered a herd of unicorns living in a remote, previously unexplored valley, in the Andes Mountains. Even more surprising to the researchers was the fact that the unicorns spoke perfect English."

pipeline = TextGenerationPipeline(
    model=ByGPT5LMHeadModel.from_pretrained("nllg/bygpt5-medium-en"),
    tokenizer=ByGPT5Tokenizer.from_pretrained("nllg/bygpt5-medium-en"),
    device=device("cuda:0"),
)

completion = pipeline(
    prompt,
    max_length=1024,
    do_sample=True,
    top_k=40,
    temperature=1.0,
    top_p=0.9,
)

print(completion[0]["generated_text"])

Poetry can also be generated easily. For more involved usage examples take a look at the provided examples.

from torch import device
from transformers.pipelines.text_generation import TextGenerationPipeline

from uniformers.models.bygpt5 import ByGPT5LMHeadModel, ByGPT5Tokenizer
from uniformers.utils import Poetry2Tokens

model = ByGPT5LMHeadModel.from_pretrained("nllg/poetry-bygpt5-base-en")
tokenizer = ByGPT5Tokenizer.from_pretrained("nllg/poetry-bygpt5-base-en")
p2t = Poetry2Tokens(tokenizer)

pipeline = TextGenerationPipeline(
    model=model,
    tokenizer=tokenizer,
    device=device("cuda:0"),
)

styles = (
    tokenizer.bos_token
    + p2t.rhymes2tokens["ABAB"]
    + p2t.meters2tokens["iambus"]
    + p2t.alliterations2tokens["medium"]
)

quatrain = pipeline(
    styles,
    return_full_text=False,
    bad_words_ids=[[id_] for id_ in tokenizer.additional_special_tokens_ids],
    do_sample=True,
    max_length=384,
    top_k=0,
    temperature=0.7,
    top_p=0.9,
)

print(quatrain[0]["generated_text"])

Released Model Checkpoints

We have released the following checkpoints for pre-trained ByGPT5 language models on the Hugging Face Model Hub:

ByGPT5 Parameters Language Modeling Poetry Generation
Small 73.5M English, German English, German
Base 139.2M English, German English, German
Medium 289.1M English, German English, German

Released Datasets

By default, this library creates QuaTrain on-the-fly when needed (which can take some time). A preprocessed version (both in English and German) can be found under releases.

Dataset Language #Quatrains
QuaTrain English 2.7M
QuaTrain German 5.9M