KryxExtractor

This tools extracts and exports Kryx's 5e homebrew website to a PDF for posterity.

The tool crawls the website using a Depth-First search of links on the page, and organizes pages found in that search into a fully compiled PDF.

Usage

Here are the usage instructions for running the tools

Install the requirements in requirements.txt (better to do so in a virtual environment). bash pip install -r requirements.txt

You will need Firefox and the Firefox geckodriver. These can be found here:

https://github.com/mozilla/geckodriver/releases

Add the geckodriver to your $PATH. If you are running LINUX

export PATH="/path/to/geckodriver:$PATH"

You will also need wkhtmltopdf in order to export the PDF. On Ubuntu, you can run

sudo apt-get install wkhtmltopdf

or go to https://wkhtmltopdf.org/downloads.html to find the download for Windows/OSx.

The tool can be run using default parameters by just running the script

python KryxExtractor.py

or on the python command line

extractor = KryxExtractor()
extractor.run()

which will create a PDF file of the exported website. To cleanup PDF and HTML pages and just keep the compiled final PDF

extractor = KryxExtractor(keep_pdf=False, keep_html=False)
extractor.run()

Current version does not fix broken internal links from source HTML
Current version does not allow for other selenium webdrivers
Current version does not support partial crawling (e.g. just the bestiary)
CSS multiclasses etc, are currently not supported, so certain CSS tags do not work
Create Table of Contents and Title Page
More beautification to fit in an 8.5x11 page more evenly

NAME	TYPE	DESCRIPTION
start_url	str	URL to start crawling from
url_prefix	str	URL prefix to replace in target URLs
url_replace	str	String to replace URL prefix
changelog_url	str	URL of the changelog
url_sep_char	str	Separating character in URLs
js_wait_interval	int,float	Interval to wait for javascript actions to occur
page_wait_interval	int,float	Interval to wait between crawling pages
click_offset	int	Offset for clicking off of javascript elements
hit_buttons	list[str]	List of button IDs which have already been hit
button_seek_params	list[args]	Parameters for finding clickable buttons
selenium_driver	Firefox Webdriver	Selenium Webdriver to use
ignore_urls	list[str]	URLS which should not be exported or crawled further
stack	list[str]	Stack data structure of URLs to crawls
history	list[str]	List of URLS already crawled
html_remove_tags	list[str]	Tags to remove from HTML
export_dir	str	Base directory to export in (irrelevent if path is specified)
version	str	Path to export the intermediate PDFs and compiled PDFs
path	str	Version string to use
output_filename	str	Final filename to use for output
verbose	int	Verbose console output
css_file	str	static url of CSS file to download
stored_css	dict[str:str]	stored CSS for tags and classes