/spiderman

Enhanced web scraping tool for handling embedded links, tables and lists

Primary LanguagePython

Spiderman Web Scraper

After reading through the BeautifulSoup documentation, I realised that many common operations are not in the module. As such, I filled in as many holes as I possibly can, applying OOP principles to boost the extensibility of my web scraper. Among its features are the following:

  • Extract all tables from a particular webpage and merge them based on whichever tables have the same column names
  • Extract hrefs and insert them into the text itself using delimiters like brackets (same can be done for tables and lists)
  • Standardises all hrefs to be complete links, rather than relational ones

I use this module most frequently, so I feel that it is the most impactful out of my earlier projects (to me, at least).