This is a project for implementing a web crawler and analyzing the data from the crawl. This is part of the masters course for cyber security. More details about this course can be found here.
selenium-wire
:
pip3 install selenium-wire
tld
: (Used for extracting top level domains)
pip install tld
- This runs on Chrome driver. The exact version depends on the version of Chrome in the system. Please refer here and download the corresponding driver, and place it in the project directory.
The crawler can be started from the command line. The basic command looks something like this:
python3 script.py -m desktop -i tranco-top-500-safe.csv
With the following arguments:
-
-m
: argument to specify the crawler mode eithermobile
ordesktop
. -
-u
: argument to specify one URL to crawl takes a string input. -
-i
: argument to take a.csv
file containing URLs to be crawled. -
-head
: argument to specify the crawling mode of the browser takes one of two options, eitherheadfull
orheadless
. Default isheadless
.
Please run the attached Jupyter Notebook file with the crawl data in the same folder/file to get the analysis data.