Generating fun Stack Exchange questions using Markov chains
- python 3.5+ (only tested with python 3.6)
- 7z
For Debian and similar distribution install with:
sudo apt-get install p7zip-full- git clone with submodules
git clone https://github.com/Findus23/se-simulator
cd se-simulator
git submodule init
git submodule updatepip install -r requirements.txt- create a MySQL database called
se-simulator - rename
config.sample.pytoconfig.pyand fill in the database details and create asecret_key - run
create.py, which creates the database and fetches the list of SE sites - run
apply_colors.py(which should run really quickly) - create folders called
chains,downloadandraw(or syminks to somewhere where more disk space is left) - [download](https://archive.org/details/stackexchange]
.7zfiles for the sites you want to generate (it's recommend to start with a file <100MB)- If the
.7zhas another name as the site has now, rename it
- If the
- run
consume.py- It should check the hash, move the file to
raw/, unpack it and extract the needed content from the.xmlfiles into new.jsonlfiles. It also writes the data of the file into the db, so it won't be imported again.
- It should check the hash, move the file to
- now the most important step: run
todb.py- this will generate the markov chains and save them (or use existing ones on the next run)
- afterwards 100 questions will be added to the db, with corresponding answers, titles and usernames
- run
shuffle.py- I haven't found a performant way to get a random question without asigning every question an integer and saving the maximum to
count.txt
- I haven't found a performant way to get a random question without asigning every question an integer and saving the maximum to
- run
server.py- this starts the Flask server on
http://127.0.0.1:5000/ - if I didn't miss an important step, the site should be working fine now.
- this starts the Flask server on
app.py: needed for Flaskbasemodel.pyandmodels.py: peewee ORMextra_data.py: manually collected colors of every site with an custom thememarkov.py: extending the great markovify library for my use caseparsexml.py: reading in the Stack Exchange dump XML files with no more than 40MB RAM usage.text_generator.py: everything that creates the content and handles the Markov chainsupdater.py: probably not working anymore, checks for newer dump filesutils.py: everything else