/wordsiv

A Python library for generating text with a limited character set, with type proofing in mind

Primary LanguagePythonMIT LicenseMIT

CI Coverage Status

Wordsiv

Wordsiv is a Python package for generating text with a limited character set. It is designed for type proofing, but it may be useful for generating lipograms.

Let's say you have the letters HAMBURGERFONTSIVhamburgerfontsiv and punctuation ., in your font. Wordsiv might generate the following drivel:

True enough, for a fine enough to him at the thrones above, some time for at that first the business. She is that he hath set as a thought he even from its substratums It is the rest of it is not the things the savings of a movement that he measures about the matter, Ahab gives to the boats at noon, or from the Green Forest, this high as on Ahab.

Why Wordsiv?

While designing a typeface, it is useful to examine text with a partial character set. Wordsiv tries its best to generate realistic-looking text with whatever glyphs are available.

Wordsiv can do things like...

  • Determine available glyphs from a font file
  • Generate sorta-realistic-looking text with a variety of models
  • Filter words by number of characters and approximate rendered width

Wordsiv hopes to:

  • be an easy-to-use and easy-to-extend meaningless language generation framework
  • support many languages and scripts (please help me!)

Wordsiv is NOT

Installation

First, install wordsiv with pip:

# we install straight from git (for now!)
$ pip install git+https://github.com/jmsole/wordsiv # byexample: +pass

Next, install one or more source packages from the releases page of the source packages repo:

$ base=https://github.com/jmsole/wordsiv-source-packages/releases/download
$ pkg=fa_wordcount_subs-0.1.0/fa_wordcount_subs-0.1.0-py3-none-any.whl
$ pip install $base/$pkg # byexample: +pass

Now you can make bogus sentences in Python!

>>> import wordsiv
>>> wsv = wordsiv.WordSiv(limit_glyphs=('HAMBURGERFONTSIVhamburgerfontsiv'))
>>> wsv.sentence(source='en_markov_gutenberg')
('I might go over the instant to the streets in the air of those the same be '
 'haunting')

Installing in DrawBot App

If you prefer to work in the DrawBot app, you can follow this procedure to install Wordsiv:

  1. Install the wordsiv package via Python->Install Python Packages:

    • Enter git+https://github.com/jmsole/wordsiv and click Go!
    • Note: you'll probably see lots of red text but it should still work just fine

    Screenshot of DrawBot "Install Python Packages" Window

  2. Install the desired Source packages in the same window, but add --no-deps at the end:

    https://github.com/tallpauley/wordsiv-source-packages/releases/download/en_wordcount_web-0.1.0/en_wordcount_web-0.1.0-py3-none-any.whl --no-deps

  3. When you write your DrawBot script, you will add each source using add_source_module():

    import wordsiv
    import en_wordcount_web
    wsv = wordsiv.WordSiv()
    wsv.add_source_module(en_wordcount_web)
    print(wsv.sentence(source="en_wordcount_web"))

    Screenshot of DrawBot

Sources

Wordsiv first needs some words, which come in the form of Sources: objects which supply the raw word data.

These Sources are available via Source Packages, which are simply Python Packages. Let's install some:

base=https://github.com/tallpauley/wordsiv-source-packages/releases/download

# A markov model trained on public domain books
$ pkg=en_markov_gutenberg-0.1.0/en_markov_gutenberg-0.1.0-py3-none-any.whl
$ pip install $base/$pkg   # byexample: +pass

# Most common English words compiled by Peter Norvig with data from Google
$ pkg=en_wordcount_web-0.1.0/en_wordcount_web-0.1.0-py3-none-any.whl
$ pip install $base/$pkg   # byexample: +pass

# Most common Trigrams compiled by Peter Norvig with data from Google
$ pkg=en_wordcount_trigrams-0.1.0/en_wordcount_trigrams-0.1.0-py3-none-any.whl
$ pip install $base/$pkg   # byexample: +pass

Wordsiv auto-discovers these installed packages, and and can use these sources right away. Let's try a source with the most common words in the English language in modern usage:

>>> from wordsiv import WordSiv
>>> wsv = WordSiv()
>>> wsv.sentence(source='en_wordcount_web')
('Maple canvas sporting pages transferred with superior government brand with '
 'women for key assign.')

Models

How does a Wordsiv know how to arrange words from a Source into a sentence? This is where Models come into play.

The source en_wordcount_web uses model rand by default. Here we explicitly select the model rand to achieve the same result as above:

>>> wsv = WordSiv()
>>> wsv.sentence(source='en_wordcount_web', model="rand")
('Maple canvas sporting pages transferred with superior government brand with '
 'women for key assign.')

Notice we get the same sentence when we initialize a new WordSiv() object. This is because Wordsiv is designed to be determinisic.

Markov Model

If we want text that is somewhat natural-looking, we might use our MarkovModel (model='mkv').

>>> wsv.paragraph(source="en_markov_gutenberg", model="mkv") # byexample: +skip
"Why don't think so desirous of hugeness. Our pie is worship..."

A Markov model is trained on real text, and forecasts each word by looking at the preceding word(s). We keep the model as stupid as possible though (one word state) to generate as many different sentences as possible.

WordCount Models

WordCount Sources and Models work with simple lists of words and occurence counts to generate words.

The RandomModel (model='rand') uses occurence counts to randomly choose words, favoring more popular words:

# Default: probability by occurence count
>>> wsv.paragraph(source='en_wordcount_web', model='rand') # byexample: +skip
'Day music, commencement protection to threads who and dimension...'

The RandomModel can also be set to ignore occurence counts and choose words completely randomly:

>>> wsv.sentence(source='en_wordcount_web', sent_len=5, prob=False) # byexample: +skip
'Conceivably championships consecration ects— anointed.'

The SequentialModel (model='seq') spits out words in the order they appear in the Source. We could use this Model to display the top 5 trigrams in the English language:

>>> wsv.words(source='en_wordcount_trigrams', num_words=5) # byexample: +skip
['the', 'ing', 'and', 'ion', 'tio']

Limiting Glyphs

Limiting Glyphs with a Font File

Wordsiv is built around the idea of selecting words which can be rendered with the glyphs in an incomplete font file. Wordsiv can automatically determine what glyphs are in a font file.

Let's load a font with the characters HAMBURGERFONTSIVhamburgerfontsiv

>>> wsv = WordSiv(font_file='tests/data/noto-sans-subset.ttf')
>>> wsv.sentence(source='en_markov_gutenberg', max_sent_len=10)
'Nor is fair to be in as these annuities'

Limiting Glyphs Manually

We can limit the glyphs in the same way, but manually with limit_glyphs

>>> wsv = WordSiv(limit_glyphs='HAMBURGERFONTSIVhamburgerfontsiv')
>>> wsv.sentence(source='en_wordcount_web')
'Manage miss ago are motor to rather at first to be of has forget'

Limiting Glyphs with font_file AND limit_glyphs

It can be useful at times to specify the character set we want to display, and only using those characters if we have them in the font file. We can do this by specifying both font_file and limit_glyphs:

>>> wsv = WordSiv(
...    font_file='tests/data/noto-sans-subset.ttf',
...    limit_glyphs='abcdefghijklmnop'
... )
>>> wsv.sentence('en_wordcount_web', cap_sent=False, min_wl=3)
'eng gnome gene game egg one aim him again one game one image boom'

Customizing Text

There are a variety of ways text can be manipulated. Here are some examples:

Change Case

Both the MarkovModel and WordCount models allow us to uppercase or lowercase text, whether or not the source words are capitalized or not:

>>> wsv = WordSiv()
>>> wsv.sentence('en_wordcount_web', uc=True, max_sent_len=8)
'MAPLE CANVAS SPORTING PAGES TRANSFERRED, WITH SUPERIOR GOVERNMENT.'

>>> wsv.sentence(
...    'en_markov_gutenberg', lc=True, min_sent_len=7, max_sent_len=10
... )
'i besought the bosom of the sun so'

The RandomModel by capitalizes sentences by default, but we can turn this off:

>>> wsv.sentence('en_wordcount_web', cap_sent=False, sent_len=10)
'egcs very and mortgage expressed about and online truss controls.'

Punctuation

By default the WordCount models insert punctuation with probabilities roughly derived from usage in the English language.

We can turn this off by passing our own function for punctuation:

>>> def only_period(words, *args): return ' '.join(words) + '.'
>>> wsv.paragraph(
...    source='en_wordcount_web', punc_func=only_period, sent_len=5, para_len=2
... )
'By schools sign I avoid. Or about fascism writers what.'

For more details on punc_func, see punctuation.py. This only applies for WordCount models, as the MarkovModel uses the punctuation in it's source data.

Sentence and Word Parameters

Models take care of generating sentences and words, so parameters relating to these are handled by the models. For now, please refer to the source code for these models to learn the parameters accepted for word(), words(), and sentence() APIs:

Paragraph and Text Parameters

The WordSiv object itself handles sentences(), paragraph(), paragraphs() and text calls with their parameters. See the WordSiv Class source code to learn how to customize the text output.

Technical Notes

Determinism

When proofing type, we probably want our proof to stay the same as long as we have the same character set. This helps us compare changes in the type.

For this reason, Wordsiv uses a single pseudo-random number generator, that is seeded upon creation of the WordSiv object. This means that a Python script using this library will produce the same outcome wherever it runs.

If you want your script to generate different words, you can seed the WordSiv object:

>>> wsv = WordSiv(seed=6)
>>> wsv.sentence(source="en_markov_gutenberg", min_sent_len=7)
'even if i forgot the go in their'

Ethical Guidelines

After watching the documentary Coded Bias, I considered whether we should even generate text based on historical (or even current) data, because of the sexism, racism, colonialism, homophobia, etc., contained within the texts.

This section attempts to address some of these ethical questions that arose for me (Chris Pauley), and try to steer this project away from generating offensive text.

Intentions

First off, this library was designed for the purposes of:

  • generating text that isn't intended to be read, just looked at:
  • type design proofing: in which we are examining how words, sentences, and paragraphs look.

Of course, we naturally read words (duh), so it goes without saying that you should supervise text generated by this library.

Ethical Random Text Generation?

I considered if there were more progressive texts I could train a Markov model on. However, we are scrambling the source text to meaninglessness anyway, to maximize sentences made with a limited character set.

Even the most positive text can get dark really quick when scrambled. A Markov model of state size 1 (ideal for limited character sets) trained with the UN Universal Declaration of Human Rights made this sentence:

Everyone is entitled to torture
or other limitation of brotherhood.

The point is, semi-random word generation ruins the meaning of text, so why bother picking a thoughtful source? However, we should really try to stay away from offensive source material, because offensive patterns will show up if there is any probability involved.

Ethical Guidelines for Contributing Sources and Models

If you're wanting to contribute a source and/or model to this project, here are some guidelines:

Generate sentences that are largely nonsensical

For example, we keep MarkovModel from picking up too much context from the original text by keeping a state size of 1. Having a single word state also increases the amount of potential sentences as well, so it works out.

The sentences we generate make less sense, but since this is designed for dummy text for proofing, this is a good thing!

Try to filter sources of "generally offensive" words

It's tricky filtering out "offensive" words, since:

  • the offensiveness of words depends on context
  • offensiveness is largely subjective
  • "offensive" word lists could potentially be used to silence important discussions

Since we're generating nonsensical text for proofing, we should try our best to filter wordlists by offensive words lists. If you really need swears in your text, you can create sources for your own purposes.

We can't prevent random words from forming offensive sentences, but we can at least restrict words that tend to form offensive sentences.

Stay away from offensive sources

Statistical models like those used in WordSiv will pick up on patterns in text— especially MarkovModels. Try to pick source material that is fairly neutral (not that anything really is).

I trained en_markov_gutenberg with these public domain texts from nltk, which seemed safe enough for a dumb, single-word-state Markov model:

['austen-emma.txt', 'austen-persuasion.txt', 'austen-sense.txt',
'blake-poems.txt', 'bryant-stories.txt', 'burgess-busterbrown.txt',
'carroll-alice.txt', 'chesterton-ball.txt', 'chesterton-brown.txt',
'chesterton-thursday.txt', 'edgeworth-parents.txt', 'melville-moby_dick.txt',
'milton-paradise.txt', 'shakespeare-caesar.txt', 'shakespeare-hamlet.txt',
'shakespeare-macbeth.txt', 'whitman-leaves.txt']

If you notice any particular models generating offensive sentences more than not, please file an issue at the wordsiv-source-packages repo.

Similar Tools

I'm definitely not the first to generate words for proofing. Check out these cool projects, And let me know if you know of more I should add!

  • word-o-mat: Nina Stössinger's RoboFont extension for making test words. Also ported to Glyphs and Javascript.
  • adhesiontext: a web-based tool by Miguel Sousa for generating text from a limited character set.

Acknowledgements

I probably wouldn't have got very far without the inspiration of word-o-mat, and a nice DrawBot script that Rob Stenson shared with me. The latter is where I got the idea to seed the random number generator to make it deterministic.

I also borrowed heavily from spaCy in how I set up the Source packages.

Also want to thank my wife Pammy for kindly listening as I explain each esoteric challenge I've tackled, and lending me emotional support when I almost wiped out 4 hours of work with a careless Git mistake.