/one-million-monkeys

Different ways to Generate Text

Primary LanguagePythonMIT LicenseMIT

Text Generation

I came accross this article and descided to see if I could replicate it using a varity of different techniques.

Prerequsits

The following packages need to be installed. I recomend using Chocolatey

  • 7-zip
  • Python
  • VS Code along with the below plugins
    • Python by Microsoft. Set the option Python >> Data Science: Send Selection To Interactive Window
if('Unrestricted' -ne (Get-ExecutionPolicy)) { Set-ExecutionPolicy Bypass -Scope Process -Force }
iex ((New-Object System.Net.WebClient).DownloadString('https://chocolatey.org/install.ps1'))
refreshenv

choco install python3  -y
choco install 7zip.install -y
choco install vscode -y
refreshenv

code --install-extension ms-python.python

Data

Data is sourced from here. The robots.txt file seems malformed, so I am guessing it intends to prevent all access

Steps

  1. Navigate to the main page.
  2. Navigate to the play
  3. Use the "Entire play" view
  4. Save the play in ./data/raw using the right mouse button (or ctrl + s). Please save the HTML as "Webpage Only"
  5. Run the Parsing script
python preprocess_plays.py -in ../data/raw -out ../data/parsed

Text Generation with Markov Models (MC)

  1. Train the MC model on the sample text
  2. Test the MC model by generating sample lines
python markov_chain_train.py -in ../data/parsed -out ../data/model.json -o 2
python markov_chain_test.py -in ../data/model.json -out ../data/sample.txt -cnt 2