/wbkraw

web crawler

Primary LanguageGoMIT LicenseMIT

web crawler for given a URL, it outputs a simple textual sitemap.

The crawler has limited to one subdomain - so when you start with https://xhentai.com/about, it crawls all pages within example.com, but not follow external links, for example to nhentai.com or subdomain.example.com.

Features

The list bellow represents feaures status for the project:

For the current version

  • Concurrent pages crawling, multiple crawlers run simultaneously
  • Use workers pool to limit number of crawlers
  • Arbitrary starting page
  • Collecting absolute and relative links on a page
  • Report the list of uniq URLs
  • Signal handling
  • In memory URL storage
  • Unit testing
  • Flexible and extendable applictaion design
  • Initial build environment - one only needs docker and make util to build and test the project.
    • make tests to run unit tests
    • make checks to run linters
    • make build to build binaries under ./bin/ direcory:
      • linux i386, amd64 and arm7
      • windows 32 and 64 bits
      • MacOS