████████╗ ██████╗ ██████╗ ██████╗ ██████╗ ████████╗ ╚══██╔══╝██╔═══██╗██╔══██╗ ██╔══██╗██╔═████╗╚══██╔══╝ ██║ ██║ ██║██████╔╝ ██████╔╝██║██╔██║ ██║ ██║ ██║ ██║██╔══██╗ ██╔══██╗████╔╝██║ ██║ ██║ ╚██████╔╝██║ ██║ ██████╔╝╚██████╔╝ ██║ ╚═╝ ╚═════╝ ╚═╝ ╚═╝ ╚═════╝ ╚═════╝ ╚═╝ `.` ` ``.:.--.` .-+++/-` `+sso:` `` /yy+. -+.oho. o../+y -s.-/:y:` .:o+-`--::oo/-` `/o+:.```---///oss+- .+o:.``...`-::-+++++sys- :y/```....``--::-yooooosh+ -h-``--.```..-:-::ssssssssd+ h:``:.``....`--:-++hsssyyyym. .d.`/.``--.```:--//odyyyyyyym/ `d.`+``:.```.--/-+/smyyhhhhhm: os`./`/````/`-/:+oydhhhhhhdh` `so.-/-:``./`.//osmddddddmd. /s/-/:/.`/..+/ydmdddddmo` `:oosso/:+/syNmddmdy/. `-/++oosyso+/.` ██████╗ ███████╗██████╗ ███████╗██████╗ ██████╗ ██╗███╗ ██╗███████╗██╗██████╗ ███████╗ ██╔══██╗██╔════╝██╔══██╗██╔════╝╚════██╗██╔════╝ ██║████╗ ██║██╔════╝██║██╔══██╗██╔════╝ ██║ ██║█████╗ ██║ ██║███████╗ █████╔╝██║ ██║██╔██╗ ██║███████╗██║██║ ██║█████╗ ██║ ██║██╔══╝ ██║ ██║╚════██║ ╚═══██╗██║ ██║██║╚██╗██║╚════██║██║██║ ██║██╔══╝ ██████╔╝███████╗██████╔╝███████║██████╔╝╚██████╗ ██║██║ ╚████║███████║██║██████╔╝███████╗ ╚═════╝ ╚══════╝╚═════╝ ╚══════╝╚═════╝ ╚═════╝ ╚═╝╚═╝ ╚═══╝╚══════╝╚═╝╚═════╝ ╚══════╝
The basic procedure executed by the web crawling algorithm takes a list of seed URLs as its input and repeatedly executes the following steps:
- Remove a URL from the URL list.
- Check existence of the page.
- Download the corresponding page.
- Check the Relevancy of the page.
- Extract any links contained in it.
- Check the cache if the links are already in it.
- Add the unique links back to the URL list.
- After all URLs are processed, return the most relevant page.
- Onion Crawler (.onion).(Completed)
- Returns Page title and address with a short description about the site.(Partially Completed)
- Save links to database.(Not Started)
- Get emails from site.(Completed)
- Save crawl info to JSON file.(Completed)
- Crawl custom domains.(Completed)
- Check if the link is live.(Completed)
- Built-in Updater.(Completed) ...(will be updated)
Contributions to this project are always welcome. To add a new feature fork the dev branch and give a pull request when your new feature is tested and complete. If its a new module, it should be put inside the modules directory and imported to the main file. The branch name should be your new feature name in the format <Feature_featurename_version(optional)>. For example, Feature_FasterCrawl_1.0. Contributor name will be updated to the below list. :D
- Tor
- Python 3.x (Make sure pip3 is installed)
- requests
- Beautiful Soup 4
- Socket
- Sock
- Argparse
- Git
- termcolor
- tldextract
Before you run the torBot make sure the following things are done properly:
-
Run tor service
sudo service tor start
-
Make sure that your torrc is configured to SOCKS_PORT localhost:9050
python3 torBot.py or use the -h/--help argument
`usage: torBot.py [-h] [-v] [--update] [-q] [-u URL] [-s] [-m] [-e EXTENSION] [-l] [-i] optional arguments: -h, --help Show this help message and exit -v, --version Show current version of TorBot. --update Update TorBot to the latest stable version -q, --quiet Prevent header from displaying -u URL, --url URL Specifiy a website link to crawl, currently returns links on that page -s, --save Save results to a file in json format -m, --mail Get e-mail addresses from the crawled sites -e EXTENSION, --extension EXTENSION Specifiy additional website extensions to the list(.com or .org etc) -l, --live Check if websites are live or not (slow) -i, --info Info displays basic info of the scanned site (very slow)`
- NOTE: All flags under -u URL, --url URL must also be passed a -u flag.
Read more about torrc here : Torrc
- Implement A* Search for webcrawler
- Multithreading
- Optimization
- Randomize Tor Connection (Random Header and Identity)
If you have new ideas which is worth implementing, mention those by starting a new issue with the title [FEATURE_REQUEST]. If the idea is worth implementing, congratz you are now a contributor.
GNU Public License
- P5N4PPZ - Owner
- agrepravin - Contributor,Reviewer
- KingAkeem - Experienced Contributor,Reviewer
- y-mehta - Contributor
- Manfredi Martorana - Contributor
- Evan Sia Wai Suan - New Contributor
- Lean - New Contributor
- shivankar-madaan - New Contributor