/cheapmarkovstracedquick

Use Markov chain generators in Tracery/cheapbotsdonequick bots

Primary LanguagePythonMIT LicenseMIT

Cheap Markovs, Traced Quick!

Cheap Markovs, Traced Quick! (CMTQ) is a Python module which analyses text to create a Markov text generator, and outputs the result as a Tracery grammar compatible with the Cheap Bots, Done Quick! (CBDQ) Twitterbot hosting service.

CMTQ is written using Transcrypt-compatible Python, so it can be run as a native Python script (cmtq-console.py), or compiled to a JavaScript library and run in a webpage.

You can use Cheap Markovs, Traced Quick! in your browser, with no installation required, at https://serin-delaunay.github.io/cheapmarkovstracedquick.

Tutorial

Choose your source texts

Your Markov chain grammar needs something to imitate. Paste some text in the input text box, or upload some plain text files. The output generated by the Tracery grammar will look like a mashed-up version of the input text.

Be careful with texts containing words or phrases relating to sensitive topics. Your resulting text generator will be able to output every word in the input text, and almost every combination of concepts will be juxtaposed. To avoid producing any offensive or otherwise unwanted output, it's recommended to edit out sections of the source text as necessary, and check some of the generator's output before publishing it.

You can use most characters in the input text, but { and } can cause Cheap Bots, Done Quick! to break, and unicode characters might not work when added with the file uploader.

Choose how to split up the source text

If your Markov chain grammar is going to generate tweets, it needs to know where to stop, and it needs to stop before it hits the character limit. To do this it needs the source text to be broken into "lines". These don't have to be literal lines in your source text; If you want a bot to generate multi-line tweets, you can do that. Instead, these are slices of the input text with a good start and end point.

You can choose how the source text is broken up by setting line delimiter characters. Any time CMTQ finds one of these, it will stop the line and start a new one. Special characters can be added using escape sequences:

  • \r\n: new line
  • \t: tab

If you turn on the "include line delimiters" option, the delimiters will be included at the end of each line.

Example 1

\r\n, delimiters not included, input text:

here's line 1!
here's line 2?

Output line 1: here's line 1!

Output line 2: here's line 2?

Example 2

|, delimiters not included, input text:

here are lines 1,
and one-point-five!|this seems to be line 2?

Output line 1:

here are lines 1,
and one-point-five!

Output line 2:

this seems to be line 2?

Example 3

.?!, delimiters included, input text:

here's line 1.this seems to be line 2? line 3 starts with a space.

Output line 1:

here's line 1.

Output line 2:

this seems to be line 2?

Output line 3:

 line 3 starts with a space.

Choose how lines are interpreted

Markov chains can generate text in more than one way. The simplest is to split each line into individual characters. This can allow the generator to produce new words and strings of punctuation, but the results will generally be less comprehensible. CMTQ also allows you to split each line into an alternating sequence of punctuation+whitespace and words. This will allow the output text to seem more grammatically correct, but it will prevent the bot from inventing new words. These two options split the input into tokens, and we call the process tokenisation.

After the text is tokenised, the tokens are recombined into sequences called n-grams. These are small fragments of text, like "a", "g", "m", "e" or "the", " ", "tokens", " ", "are". Your output grammar can only use the n-grams found in the source text, and it will try to make them appear with frequencies similar to the source text.

It's important to choose an appropriate length for the n-grams. If the length is very long (like the length of the source text), the only possible output will be the source text itself. If the length is 1 (the minimum), there will be no continuity in the output text. In general the value of n is a trade-off between imagination and meaning.

Good values for n-gram length are 3-6 for character-based tokenisation, and 4-7 for word-based tokenisation. Here are some examples:

Input text:

The North Wind and the Sun were disputing which was the stronger,
when a traveler came along wrapped in a warm cloak.
They agreed that the one who first succeeded in making the traveler take his cloak off should be considered stronger than the other.
Then the North Wind blew as hard as he could,
but the more he blew the more closely did the traveler fold his cloak around him;
and at last the North Wind gave up the attempt.
Then the Sun shined out warmly,
and immediately the traveler took off his cloak.
And so the North Wind was obliged to confess that the Sun was the stronger of the two.

Output (characters, n=3): an wer of hisput st strappediat ward the morthey at whim;

Output (characters, n=4): Then a traveler the blew the Sun should blew ther.

Output (characters, n=5): when a warm cloak off should be could be confess the stronger of the traveler fold him;

Output (words, n=4): when a traveler fold his cloak off should be considered stronger of the North Wind gave up the Sun shined out warmly,

Output (words, n=5): And so the North Wind was obliged to confess that the Sun were disputing which was the stronger of the two.

Output (words, n=6): but the more closely did the traveler fold his cloak off should be considered stronger than the other.

Generate a Tracery grammar!

Click the update! button to make a Tracery Markov chain out of your source text and options, and be patient - if you've got a long source text, your computer will have a lot of work to do. When the output is ready, it will appear in the output textbox. Select it all and copy it to your clipboard. You can now save it to a file, or paste it into a Tracery tool like Cheap Bots, Done Quick!.

Caveats

Cheap Bots, Done Quick! has (or has had) a limit on maximum grammar size of 4MB. If you use a long source text, it's very easy to exceed that limit. In the future, grammar compression options will be provided on CMTQ. In the meantime, if your grammar is failing to save on CBDQ, you can follow the contact instructions on Cheap Bots, Done Quick! to ask for the limit to be raised.

Pasting large grammars into CBDQ can also slow your computer down a lot. Be patient!

Why does this exist?

Markov text generators are a standard, almost stereotypical, way of generating humorous procedural text, and there are already a lot of tools available to make them. Some of these tools even run a Twitterbot for you. However, they all (as far as I know) require you to do at least one of these things:

  • Install a programming environment like Node.js, Python, or Ruby locally
  • Run a server
  • Use command-line tools
  • Program in Javascript, Python, or Ruby

Requirements like these provide obstacles to the craft of botmaking for non-programmers, so in the spirit of tools like Tracery and CBDQ, CMTQ attempts to minimise the technical knowledge and resources needed to make a Markov Twitterbot.