XKCD Webscraping TUI

A software to scrape comic title, id, url, image url and the alternative text from the website xkcd in form of a TUI, Text-/Terminal-based User Interface

Tools & Technologies

Features

  • scrape comic title, id, url, image url and the alternative text from xkcd
  • scrape a specific queue with comic id's from xkcd (800-1000)
  • save scraped comic data in a json/csv file
  • CPU usage

How it works?

  • go to field Comic ID(s): and enter the number of your comic id's
    • input format for one ID 1000
    • input format for more specific ID's 1, 800, 1000
    • input format for several comic ID's 800-1000
    • input format for one specific ID and several ID's in a range 1, 800-1000
    • input format for several queues of ID's 1-50, 800-1000
    • input can also be 2488-*, which will start at comic 2488 and finish at the latest comic, or 1-*, which will scrape all available comics
  • to select the file format of the output click on the text behind File Format:
    • can change between JSON or CSV
  • click on START to start crawling process
  • click on Show Image to show the image in console in ASCII art (picture 2&3)
  • click on 'Open Folder' to open the folder where the scraped comic is located
  • with buttons Back & Next you can switch between results
  • see the magic

Theme Interpretation (Think Inside the Box)

Our interpretation of the theme was that "thinking inside the box" should imply that it does not require broad thinking. For example, something that is simple to understand, execute, and obtain the results of. We decided to go for a basic TUI style which is self-contained in a single menu (a box if you will). We also chose xkcd to scrape out of all websites because of this comic titled "AI-Box Experiment".

Other

Some xkcd comics are build-yourself, so some if the information to be scraped will not be available, but your output will indicate on whether it is a build-yourself comic

Installation

Clone this repository

  git clone git@github.com:aiyayayaya/canny-capybaras-collab-code-contest.git

Create a virtual environment (in this example we will be using pipenv)

  pipenv --python 3.9

Install the required packages

  pipenv install -r requirement.txt

  pipenv install -d dev-requirements.txt  # is not really needed

Run the project

  pipenv run py __main__

Scraped comics will be placed in a folder within the output folder. Folder naming is done according to the comic number

Authors

Known Issues

  • Making the window too small results in a crash and resizing it too fast may result in a weird-looking TUI
  • Scraping comic 404 will result in a crash (this includes scraping ranges that include comic 404 such as 1-* which would scrape all the comics)
  • If you try to scrape a comic that doesn't exist the program will crash
  • Scraping choose-your-own-adventure comics (such as 1350) will result in incorrect and weird information being displayed