This is a tool try to parse a webpage html code and find all potential urls with the script myself or using urlscan API
If you want to use, just decide which method you'd prefer(My own script or urlscan's API). And then just go to their own folder. My script has 2 version, one is use the traditional way to parse all potential attribute to get the link, and the other one is using selenium to parse potential tag that may store the link. Actually, two of them are similar to use, however, I strongly recommend to use selenium version. Because, it's stable to get whole dom file from webdriver. If you use traditional version, sometimes the dom file you fetch by requests library is not complete.
Put your target in scanRootURLs.txt
or using -u
argument. It'll parse all a
tag, link
tag, img
tag, script
tag, iframe
tag, source
tag, area
tag, and find url for each attribute as more as possible. The output is default stored at ./Self Script/scan_output/
-
It'll use ./scanRootURLs.txt as input target Root URL to scan and put output to ./scan_output
$ python scan.py
-
You can also use
-u
argument to scan single url.$ python parse_html_potential_url.py -u {single url}
-
If you don't want to scan specific tag during the process, you can use
--no_{tag name}_tag
$ python parse_html_potential_url.py --no_a_tag
-
All CLI arguments
$ python scan.py -h usage: scan.py [-h] [-f FILE] [-o OUTPUT] [--no_a_tag] [--no_link_tag] [--no_img_tag] [--no_script_tag] [--no_iframe_tag] [--no_source_tag] [--no_area_tag] [-u URL] Process some integers. options: -h, --help show this help message and exit -f FILE, --file FILE static html file path -o OUTPUT, --output OUTPUT output file path --no_a_tag scan a tag or not --no_link_tag scan link tag or not --no_img_tag scan img tag or not --no_script_tag scan script tag or not --no_iframe_tag scan iframe tag or not --no_source_tag scan source tag or not --no_area_tag scan area tag or not -u URL, --url URL Single url to scan
Put your target in scanRootURLs.txt
. It'll parse all a
tag, link
tag, img
tag, script
tag, iframe
tag, source
tag, area
tag, and find url for each attribute as more as possible. The output is default stored at ./Self Script/scan_output/
-
It'll use ./scanRootURLs.txt as input target Root URL to scan and put output to ./scan_output
$ python parse_html_potential_url.py
-
To use static scan attribute, you must manually store the webpage html in ./static_scan_input and it'll use ./scanRootURLs.txt as target to fetch the static file in ./static_scan_input
$ python parse_html_potential_url.py --static_scan
-
You can also use
-u
argument to scan single url.$ python parse_html_potential_url.py -u {single url} # or $ python parse_html_potential_url.py -u {single url} --static_scan
-
If you don't want to scan specific tag during the process, you can use
--no_{tag name}_tag
$ python parse_html_potential_url.py --no_a_tag
-
All CLI arguments
$ python Self\ Script/parse_html_potential_url.py -h usage: parse_html_potential_url.py [-h] [--static_scan] [-f FILE] [-o OUTPUT] [-i INPUT] [--no_a_tag] [--no_link_tag] [--no_img_tag] [--no_script_tag] [--no_iframe_tag] [--no_source_tag] [--no_area_tag] [-u URL] Process some integers. optional arguments: -h, --help show this help message and exit --static_scan use static html file to scan or not -f FILE, --file FILE static html file path -o OUTPUT, --output OUTPUT output file path -i INPUT, --input INPUT input file path --no_a_tag scan a tag or not --no_link_tag scan link tag or not --no_img_tag scan img tag or not --no_script_tag scan script tag or not --no_iframe_tag scan iframe tag or not --no_source_tag scan source tag or not --no_area_tag scan area tag or not -u URL, --url URL Single url to scan
I just try to call urlscan's API and parse the result to store in ./scan_output
. Some URL cannot be parsed by urlscan because of some reason such as looking like a spam or submitted URL was blocked from scannning.
-
It'll use the URLs in ./scanRootURLs.txt as default and store the output result in ./scan_output
$ python urlscan.py
-
You can also use
-u
argument to scan single url.$ python parse_html_potential_url.py -u {single url} # or $ python parse_html_potential_url.py -u {single url} --static_scan
-
All CLI arguments
$ python URLScan/urlscan.py -h usage: urlscan.py [-h] [-f FILE] [-o OUTPUT] [-u URL] Process some integers. optional arguments: -h, --help show this help message and exit -f FILE, --file FILE static html file path -o OUTPUT, --output OUTPUT output file path -u URL, --url URL Single url to scan