This Project does not use external libraries to do the actual crawling. The 'Spiders' are created in-house, and so is the 'work'. The number of threads (Spiders) is pre-set in main.py (currently set at 8) but can easily be changed manually.
After crawling everything or upon termination, the app will produce a new repository named [your project name], which will contain two files:
- queue.txt, which will contain all uncrawled pages (empty if the crawler has completed the task)
- crawled.txt, which will contain all links preceeded by their direct parent links
Secondly, the first spider creates a singular queue of links is by pulling all the links contained inside the attribute tags from the inputted url.
Finally, the Spiders all check if there are any links to crawl in the queue, and actually crawl them (go inside, extract, note down).
- queue.txt, where the found but not yet crawled links are stored as a set
- and crawled.txt, to which the crawled links and their origins are ammended to
- domain_finder.py --> extracts domain; used when imported
- link_finder.py --> creates a class LinkFinder; locates all the links and makes them a set
- general.py --> 'housekeeping' functions, i.e. imported in multiple places and called upon
- spider.py --> defines the class Spider; the entire functionality of an individual crawler-worker
- main.py --> receives input, creates workers, creates jobs, crawls, responsible for live-progress output