/email-address-harvester

Script to crawl a website and harvest email addresses

Primary LanguagePython

Email Address Harvester

Crawls a website and harvests email addresses. Email addresses are identified from <a href="mailto:..."> tags, or text that matches common email format of user@domain.tld

It will crawl a site by following new internal links in a breadth-first search. Logs each URL it visits in a .log file. URLs that end with a common file extension like .jpg, .mov, .zip, etc are ignored, because they are most likely not text/html content and will not have any mailto: links. Results are saved to a .csv file with the columns:

  • Email - The email address found
  • Text - The text content of the mailto: link
  • Context - The nearest or most likely descriptive text or heading in relation the the email address
  • URL - The URL it was extracted from

Usage example: python .\src\emailharvester.py https://example.com/

An existing .log file for a domain will inform the script of pages it has already completely crawled, so you can restart a crawl in progress. To completely restart a crawl, delete or rename the .log and .csv files.

Progress will be printed to the console when a page crawl begins, when an email is found, and a summary table is displayed when a page crawl completes:

    Now crawling https://www.example.com/
    ...
    Found new email address: public-outreach@example.com
    Found new email address: Manager@example.com
    Completed crawling about/economicdevelopment.asp:
            146     total links.
            0       newly discovered.
            104     already crawled.
            42      excluded.
            1       new emails found.
            1       existing emails skipped.
    Completed crawling /about/contact:
            144     total links.
            0       newly discovered.
            104     already crawled.
            40      excluded.
            0       new emails found.
            0       existing emails skipped.
    ...
    Done. Visited 134 new URLs and skipped 322 existing.
    Found 92 total email addresses, 34 new.
    Duration: 0 days, 0 hrs, 2 mins and 30 secs

Batchfile Usage

You could run several crawls seqentiually by calling the program with many different urls in a single .bat file, and pipe the Enter key to continue when the script pauses at the end of a run. For example if you use the released .exe, create multisite.bat with the following contents:

    echo/|call emailaddressharvester.exe https://example.com
    echo/|call emailaddressharvester.exe https://example2.com
    emailaddressharvester.exe https://example3.com

Then double-click it. Results will be saved to separate files per site.