████████╗ ██████╗ ██████╗ ██████╗ ██████╗ ████████╗ ╚══██╔══╝██╔═══██╗██╔══██╗ ██╔══██╗██╔═████╗╚══██╔══╝ ██║ ██║ ██║██████╔╝ ██████╔╝██║██╔██║ ██║ ██║ ██║ ██║██╔══██╗ ██╔══██╗████╔╝██║ ██║ ██║ ╚██████╔╝██║ ██║ ██████╔╝╚██████╔╝ ██║ ╚═╝ ╚═════╝ ╚═╝ ╚═╝ ╚═════╝ ╚═════╝ ╚═╝ `.` ` ``.:.--.` .-+++/-` `+sso:` `` /yy+. -+.oho. o../+y -s.-/:y:` .:o+-`--::oo/-` `/o+:.```---///oss+- .+o:.``...`-::-+++++sys- :y/```....``--::-yooooosh+ -h-``--.```..-:-::ssssssssd+ h:``:.``....`--:-++hsssyyyym. .d.`/.``--.```:--//odyyyyyyym/ `d.`+``:.```.--/-+/smyyhhhhhm: os`./`/````/`-/:+oydhhhhhhdh` `so.-/-:``./`.//osmddddddmd. /s/-/:/.`/..+/ydmdddddmo` `:oosso/:+/syNmddmdy/. `-/++oosyso+/.` ██████╗ ███████╗██████╗ ███████╗██████╗ ██████╗ ██╗███╗ ██╗███████╗██╗██████╗ ███████╗ ██╔══██╗██╔════╝██╔══██╗██╔════╝╚════██╗██╔════╝ ██║████╗ ██║██╔════╝██║██╔══██╗██╔════╝ ██║ ██║█████╗ ██║ ██║███████╗ █████╔╝██║ ██║██╔██╗ ██║███████╗██║██║ ██║█████╗ ██║ ██║██╔══╝ ██║ ██║╚════██║ ╚═══██╗██║ ██║██║╚██╗██║╚════██║██║██║ ██║██╔══╝ ██████╔╝███████╗██████╔╝███████║██████╔╝╚██████╗ ██║██║ ╚████║███████║██║██████╔╝███████╗ ╚═════╝ ╚══════╝╚═════╝ ╚══════╝╚═════╝ ╚═════╝ ╚═╝╚═╝ ╚═══╝╚══════╝╚═╝╚═════╝ ╚══════╝
The basic procedure executed by the web crawling algorithm takes a list of seed URLs as its input and repeatedly executes the following steps:
- Remove a URL from the URL list.
- Check existence of the page.
- Download the corresponding page.
- Check the Relevancy of the page.
- Extract any links contained in it.
- Check the cache if the links are already in it.
- Add the unique links back to the URL list.
- After all URLs are processed, return the most relevant page.
- Crawls Tor links (.onion) only.
- Returns Page title and address.
- Cache links so that there won't be duplicate links. ...(will be updated)
Contributions to this project are always welcome. To add a new feature fork this repository and give a pull request when your new feature is tested and complete. If its a new module, it should be put inside the modules directory and imported to the main file. The branch name should be your new feature name in the format <Feature_featurename_version(optional)>. For example, Feature_FasterCrawl_1.0. Contributor name will be updated to the below list. :D
- Tor
- Python 3.x (Make sure pip3 is there)
- Python Stem Module
- urllib
- Beautiful Soup 4
- Socket
- Sock
- Argparse
- Stem module
- Git
Before you run the torBot make sure the following things are done properly:
-
Run tor service
sudo service tor start
-
Set a password for tor
tor --hash-password "my_password"
-
Give the password inside torbot.py
from stem.control import Controller with Controller.from_port(port = 9051) as controller: controller.authenticate("your_password_hash") controller.signal(Signal.NEWNYM)
python3 torBot.py
`usage: torBot.py [-h] [-q] [-u URL] [-m] [-e EXTENSION] [-l]
optional arguments: -h, --help show this help message and exit -q, --quiet -u URL, --url URL Specifiy a website link to crawl -m, --mail Get e-mail addresses from the crawled sites -e EXTENSION, --extension EXTENSION Specifiy additional website extensions to the list(.com or .org etc) -l, --live Check if websites are live or not (slow)`
Read more about torrc here : Torrc
If you have new ideas which is worth implementing, mention those by starting a new issue with the title [FEATURE_REQUEST]. If the idea is worth implementing, congratz you are now a contributor.
GNU Public License
- P5N4PPZ - Owner
- agrepravin - Contributor