to run on linux ( java installed )
chmod a+x ./run_crawler.sh
./run_crawler.sh URL_TO_CRAWL
theres a small site added in script by default
also if you don't have java or linux based system
you can ssh to ubuntu@crawler.shaneconnolly.io
with the private key shared in email
follow the terminal instructions then.
-
get all links from domain given
-
add domain to set of visited links
-
while unvisited has links, take a link from unvisited and get all its links
-
if link follows rules and is not in unvisited or visited sets
-
add it to unvisited
-
there is a max loop count too of MAX_PAGES_TO_LOAD = 2000
- Tests, they are not completed yet.
- endpoint to execute crawler over http
- returns sitemap