A general-purpose email scrapper with multithread capabilities that crawles domains to extract email addresses from the domain's websites.
The email scrapper is a two-step procedure:
-
Extract all links:
python scrap_emails.py url
where:
url
: The URL to extract links from
-
Proccess all e-mail addresses from the file with e-mails:
python clean_emails.py file
where:
file
: File.txt
containing e-mai's to clean.
Prefer https://
over http://
:
python scrap_emails.py "https://example.net" or "https://www.example.net"
python clean_emails.py "www.example.net.txt"
You can use a for
-loop in bash for massive scraping:
for i in $(cat domain_list.txt); do python scrap_emails.py "$i"; done
for i in $(ls *.txt); do python clean_emails.py $i; done
In the root folder of the repository run:
conda env create --file environment.yml
-
RegEx to clean url
http[s]?:\/\/(?:[w]{3}\.)?(.*)(?<!\/)
-
RegEx for Email
(?:[a-z0-9!#$%&'*+\/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+\/=?^_`{|}~-]+)*|"(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21\x23-\x5b\x5d-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])*")@(?:(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?|\[(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?|[a-z0-9-]*[a-z0-9]:(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21-\x5a\x53-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])+)\])