The workflow is... Gather Comics to pull down the webarchives mv_random to divide the corpus enfolder.py to create a deeper folder hierarchy extract.js to extract rendered rects and HTML