TODO

  • Good logging library, instead of print(flush=True)
  • Crawler
    • Honor robots.txt
    • Limit max response size (2MB for example)
    • Better filtering of unwanted urls