Uniformers
_{^{Token-free Language Modeling with ByGPT5 & Friends}}

Uniformers (short for Universal Coded Character Set Transformers) is a library for token-free language modeling. In particular, it contains the reference implementation of ByGPT5: End-to-End Style-conditioned Poetry Generation with Token-free Language Models. ByGPT5 is a token-free decoder-only transformer that excels at character-level tasks such as style-conditioned poetry generation.

📜 Read our paper on ByGPT5 for details.
🪶 An interactive demo for poetry generation is available.
💡 If you make use of this library in your work please cite it.

Installation

If you want to use this project as a library you can install it as a regular package using pip:

pip install 'git+https://github.com/potamides/uniformers.git#egg=uniformers'

If your goal is to run the included examples (e.g., to reproduce results) clone the repository and install it in editable mode:

git clone https://github.com/potamides/uniformers
pip install -e uniformers[examples]

Usage

Uniformers builds upon the transformers library and can be used very similarly.

from torch import device
from transformers.pipelines.text_generation import TextGenerationPipeline

from uniformers.models.bygpt5 import ByGPT5LMHeadModel, ByGPT5Tokenizer

prompt = "In a shocking finding, scientist discovered a herd of unicorns living in a remote, previously unexplored valley, in the Andes Mountains. Even more surprising to the researchers was the fact that the unicorns spoke perfect English."

pipeline = TextGenerationPipeline(
    model=ByGPT5LMHeadModel.from_pretrained("nllg/bygpt5-medium-en"),
    tokenizer=ByGPT5Tokenizer.from_pretrained("nllg/bygpt5-medium-en"),
    device=device("cuda:0"),
)

completion = pipeline(
    prompt,
    max_length=1024,
    do_sample=True,
    top_k=40,
    temperature=1.0,
    top_p=0.9,
)

print(completion[0]["generated_text"])

Poetry can also be generated easily. For more involved usage examples take a look at the provided examples.

from torch import device
from transformers.pipelines.text_generation import TextGenerationPipeline

from uniformers.models.bygpt5 import ByGPT5LMHeadModel, ByGPT5Tokenizer
from uniformers.utils import Poetry2Tokens

model = ByGPT5LMHeadModel.from_pretrained("nllg/poetry-bygpt5-base-en")
tokenizer = ByGPT5Tokenizer.from_pretrained("nllg/poetry-bygpt5-base-en")
p2t = Poetry2Tokens(tokenizer)

pipeline = TextGenerationPipeline(
    model=model,
    tokenizer=tokenizer,
    device=device("cuda:0"),
)

styles = (
    tokenizer.bos_token
    + p2t.rhymes2tokens["ABAB"]
    + p2t.meters2tokens["iambus"]
    + p2t.alliterations2tokens["medium"]
)

quatrain = pipeline(
    styles,
    return_full_text=False,
    bad_words_ids=[[id_] for id_ in tokenizer.additional_special_tokens_ids],
    do_sample=True,
    max_length=384,
    top_k=0,
    temperature=0.7,
    top_p=0.9,
)

print(quatrain[0]["generated_text"])

Released Model Checkpoints

We have released the following checkpoints for pre-trained ByGPT5 language models on the Hugging Face Model Hub:

ByGPT5	Parameters	Language Modeling	Poetry Generation
Small	73.5M	English, German	English, German
Base	139.2M	English, German	English, German
Medium	289.1M	English, German	English, German

Released Datasets

By default, this library creates QuaTrain on-the-fly when needed (which can take some time). A preprocessed version (both in English and German) can be found under releases.

Dataset	Language	#Quatrains
QuaTrain	English	2.7M
QuaTrain	German	5.9M

potamides/uniformers

UniformersToken-free Language Modeling with ByGPT5 & Friends

Installation

Usage

Released Model Checkpoints

Released Datasets

Uniformers
_{^{Token-free Language Modeling with ByGPT5 & Friends}}