Watermarking for language models

Description

Re-implementation of the watermarking technique proposed in A Watermark for Large Language Models by Kirchenbauer & Geiping et. al. (Original repo).

Usage

Generating a (soft) watermarked text with your language model is as easy as:

from watermark import generate

# Loading the model
model = load_my_model().eval().to(device)

# Creating prior text
prior = torch.randint(0, vocab_size, (batch_size, 1)).to(device)

# Generating the watermarked text
watermarked = generate(model, prior, max_length=200, watermarked=True, gamma=0.5, delta=2)

Verfiying if a text was watermarked can be done as follows:

from watermarking import detect_watermark

# Text is a (B, T) tensor of idxs
z_score = detect_watermark(text, vocabulary_size, gamma=0.5)

if (z_score >= threshold):
    print("Text has been AI-generated.")

Optionally, you can check a model's own perplexity of its generated text as follows:

from watermarking import get_perplexities

n_perplexities = get_perplexities(model, normal_text)
w_perplexities = get_perplexities(model, watermarked_text)

For more information, refer to this example.

Plotting

With the plot.py script, you can plot the perplexity of the model against the Z-score for watermarked and non-watermarked sentences.

This image was generated by sampling 1'000 nonwatermarked and watermarked sentences using HuggingFace's GPT2 pre-trained model and multinomial sampling, a $seq_{len}=200$, $\gamma = 0.5$ and $\delta=2$ for watermarking. The hash function is the default python hash function applied to the "stringyfied" tensor of the word index in the vocabulary.

By the image, we see that watermarked sentences have a much higher Z-score on average despite their relatively low perplexity.

License

The code is released with the MIT license.

BrianPulfer/LMWatermark

Watermarking for language models

Description

Usage

Plotting

License