I came accross this article and descided to see if I could replicate it using a varity of different techniques.
The following packages need to be installed. I recomend using Chocolatey
- 7-zip
- Python
- VS Code along with the below plugins
- Python by Microsoft. Set the option Python >> Data Science: Send Selection To Interactive Window
if('Unrestricted' -ne (Get-ExecutionPolicy)) { Set-ExecutionPolicy Bypass -Scope Process -Force }
iex ((New-Object System.Net.WebClient).DownloadString('https://chocolatey.org/install.ps1'))
refreshenv
choco install python3 -y
choco install 7zip.install -y
choco install vscode -y
refreshenv
code --install-extension ms-python.python
Data is sourced from here. The robots.txt file seems malformed, so I am guessing it intends to prevent all access
- Navigate to the main page.
- Navigate to the play
- Use the "Entire play" view
- Save the play in
./data/raw
using the right mouse button (or ctrl + s). Please save the HTML as "Webpage Only" - Run the Parsing script
python preprocess_plays.py -in ../data/raw -out ../data/parsed
- Train the MC model on the sample text
- Test the MC model by generating sample lines
python markov_chain_train.py -in ../data/parsed -out ../data/model.json -o 2
python markov_chain_test.py -in ../data/model.json -out ../data/sample.txt -cnt 2