Uniformers (short for Universal Coded Character Set Transformers) is a library for token-free language modeling. In particular, it contains the reference implementation of ByGPT5: End-to-End Style-conditioned Poetry Generation with Token-free Language Models. ByGPT5 is a token-free decoder-only transformer that excels at character-level tasks such as style-conditioned poetry generation.
- 📜 Read our paper on ByGPT5 for details.
- 🪶 An interactive demo for poetry generation is available.
- 💡 If you make use of this library in your work please cite it.
If you want to use this project as a library you can install it as a regular package using pip:
pip install 'git+https://github.com/potamides/uniformers.git#egg=uniformers'
If your goal is to run the included examples (e.g., to reproduce results) clone the repository and install it in editable mode:
git clone https://github.com/potamides/uniformers
pip install -e uniformers[examples]
Uniformers builds upon the transformers library and can be used very similarly.
from torch import device
from transformers.pipelines.text_generation import TextGenerationPipeline
from uniformers.models.bygpt5 import ByGPT5LMHeadModel, ByGPT5Tokenizer
prompt = "In a shocking finding, scientist discovered a herd of unicorns living in a remote, previously unexplored valley, in the Andes Mountains. Even more surprising to the researchers was the fact that the unicorns spoke perfect English."
pipeline = TextGenerationPipeline(
model=ByGPT5LMHeadModel.from_pretrained("nllg/bygpt5-medium-en"),
tokenizer=ByGPT5Tokenizer.from_pretrained("nllg/bygpt5-medium-en"),
device=device("cuda:0"),
)
completion = pipeline(
prompt,
max_length=1024,
do_sample=True,
top_k=40,
temperature=1.0,
top_p=0.9,
)
print(completion[0]["generated_text"])
Poetry can also be generated easily. For more involved usage examples take a look at the provided examples.
from torch import device
from transformers.pipelines.text_generation import TextGenerationPipeline
from uniformers.models.bygpt5 import ByGPT5LMHeadModel, ByGPT5Tokenizer
from uniformers.utils import Poetry2Tokens
model = ByGPT5LMHeadModel.from_pretrained("nllg/poetry-bygpt5-base-en")
tokenizer = ByGPT5Tokenizer.from_pretrained("nllg/poetry-bygpt5-base-en")
p2t = Poetry2Tokens(tokenizer)
pipeline = TextGenerationPipeline(
model=model,
tokenizer=tokenizer,
device=device("cuda:0"),
)
styles = (
tokenizer.bos_token
+ p2t.rhymes2tokens["ABAB"]
+ p2t.meters2tokens["iambus"]
+ p2t.alliterations2tokens["medium"]
)
quatrain = pipeline(
styles,
return_full_text=False,
bad_words_ids=[[id_] for id_ in tokenizer.additional_special_tokens_ids],
do_sample=True,
max_length=384,
top_k=0,
temperature=0.7,
top_p=0.9,
)
print(quatrain[0]["generated_text"])
We have released the following checkpoints for pre-trained ByGPT5 language models on the Hugging Face Model Hub:
ByGPT5 | Parameters | Language Modeling | Poetry Generation |
---|---|---|---|
Small | 73.5M | English, German | English, German |
Base | 139.2M | English, German | English, German |
Medium | 289.1M | English, German | English, German |
By default, this library creates QuaTrain on-the-fly when needed (which can take some time). A preprocessed version (both in English and German) can be found under releases.
Dataset | Language | #Quatrains |
---|---|---|
QuaTrain | English | 2.7M |
QuaTrain | German | 5.9M |