/webscreenshot

A simple script to screenshot a list of websites

Primary LanguagePythonGNU Lesser General Public License v3.0LGPL-3.0

webscreenshot

Description

A simple script to screenshot a list of websites, based on the url-to-image PhantomJS script.

Features

  • Integrating url-to-image 'lazy-rendering' for AJAX resources
  • Fully functional on Windows and Linux systems
  • Cookie and custom HTTP header definition support for the PhantomJS renderer
  • Multiprocessing and killing of unresponding processes after a user-definable timeout
  • Accepting several formats as input target
  • Customizing screenshot size (width, height), format and quality
  • Mapping useful options of PhantomJS such as ignoring ssl error, proxy definition and proxy authentication, HTTP Basic Authentication
  • Supports multiple renderers:
    • PhantomJS, which is legacy and abandoned but the one still producing the best results
    • Chrome and Chromium, which will replace PhantomJS but currently have some limitations: screenshoting an HTTPS website not having a valid certificate, for instance a self-signed one, will produce an empty screenshot.
      The reason is that the --ignore-certificate-errors option doesn't work and will never work anymore: the solution is to use a proper webdriver, but to date webscreenshot doesn't aim to support this rather complex method requiring some third-party tools.
    • Firefox can also be used as a renderer but has some serious limitations (so don't use it for the moment):
      • Impossibility to perform multiple screenshots at the time: no multi-instance of the firefox process
      • No incognito mode, using webscreenshot will pollute your browsing history

Usage

Put your targets in a text file and pass it with the -i option, or as a positional argument if you have just a single URL.
Screenshots will be available, by default, in your current ./screenshots/ directory.
Accepted input formats are the following:

http(s)://domain_or_ip:port(/ressource)
domain_or_ip:port(/ressource)
domain_or_ip(/ressource)

Options

webscreenshot.py version 2.5

usage: webscreenshot.py [-h] [-i INPUT_FILE] [-o OUTPUT_DIRECTORY]
                        [-w WORKERS] [-v]
                        [-r {phantomjs,chrome,chromium,firefox}]
                        [--renderer-binary RENDERER_BINARY] [--no-xserver]
                        [--window-size WINDOW_SIZE]
                        [-f {pdf,png,jpg,jpeg,bmp,ppm}] [-q [0-100]] [-p PORT]
                        [-s] [-m] [-c COOKIE] [-a HEADER] [-u HTTP_USERNAME]
                        [-b HTTP_PASSWORD] [-P PROXY] [-A PROXY_AUTH]
                        [-T PROXY_TYPE] [-t TIMEOUT]
                        [URL]

optional arguments:
  -h, --help            show this help message and exit

Main parameters:
  URL                   Single URL target given as a positional argument
  -i INPUT_FILE, --input-file INPUT_FILE
                        <INPUT_FILE> text file containing the target list. Ex:
                        list.txt
  -o OUTPUT_DIRECTORY, --output-directory OUTPUT_DIRECTORY
                        <OUTPUT_DIRECTORY> (optional): screenshots output
                        directory (default './screenshots/')
  -w WORKERS, --workers WORKERS
                        <WORKERS> (optional): number of parallel execution
                        workers (default 4)
  -v, --verbosity       <VERBOSITY> (optional): verbosity level, repeat it to
                        increase the level { -v INFO, -vv DEBUG } (default
                        verbosity ERROR)

Screenshot parameters:
  -r {phantomjs,chrome,chromium,firefox}, --renderer {phantomjs,chrome,chromium,firefox}
                        <RENDERER> (optional): renderer to use among
                        'phantomjs' (legacy but best results), 'chrome',
                        'chromium', 'firefox' (version > 57) (default
                        'phantomjs')
  --renderer-binary RENDERER_BINARY
                        <RENDERER_BINARY> (optional): path to the renderer
                        executable if it cannot be found in $PATH
  --no-xserver          <NO_X_SERVER> (optional): if you are running without
                        an X server, will use xvfb-run to execute the renderer
  --window-size WINDOW_SIZE
                        <WINDOW_SIZE> (optional): width and height of the
                        screen capture (default '1200,800')
  -f {pdf,png,jpg,jpeg,bmp,ppm}, --format {pdf,png,jpg,jpeg,bmp,ppm}
                        <FORMAT> (optional, phantomjs only): specify an output
                        image file format, "pdf", "png", "jpg", "jpeg", "bmp"
                        or "ppm" (default 'png')
  -q [0-100], --quality [0-100]
                        <QUALITY> (optional, phantomjs only): specify the
                        output image quality, an integer between 0 and 100
                        (default 75)

Input processing parameters:
  -p PORT, --port PORT  <PORT> (optional): use the specified port for each
                        target in the input list. Ex: -p 80
  -s, --ssl             <SSL> (optional): enforce ssl for every connection
  -m, --multiprotocol   <MULTIPROTOCOL> (optional): perform screenshots over
                        HTTP and HTTPS for each target

HTTP parameters:
  -c COOKIE, --cookie COOKIE
                        <COOKIE_STRING> (optional): cookie string to add. Ex:
                        -c "JSESSIONID=1234; YOLO=SWAG"
  -a HEADER, --header HEADER
                        <HEADER> (optional): custom or additional header.
                        Repeat this option for every header. Ex: -a "Host:
                        localhost" -a "Foo: bar"
  -u HTTP_USERNAME, --http-username HTTP_USERNAME
                        <HTTP_USERNAME> (optional): specify a username for
                        HTTP Basic Authentication.
  -b HTTP_PASSWORD, --http-password HTTP_PASSWORD
                        <HTTP_PASSWORD> (optional): specify a password for
                        HTTP Basic Authentication.

Connection parameters:
  -P PROXY, --proxy PROXY
                        <PROXY> (optional): specify a proxy. Ex: -P
                        http://proxy.company.com:8080
  -A PROXY_AUTH, --proxy-auth PROXY_AUTH
                        <PROXY_AUTH> (optional): provides authentication
                        information for the proxy. Ex: -A user:password
  -T PROXY_TYPE, --proxy-type PROXY_TYPE
                        <PROXY_TYPE> (optional): specifies the proxy type,
                        "http" (default), "none" (disable completely), or
                        "socks5". Ex: -T socks
  -t TIMEOUT, --timeout TIMEOUT
                        <TIMEOUT> (optional): renderer execution timeout in
                        seconds (default 30 sec)

Examples

list.txt
--------
http://google.fr
https://216.58.213.131
216.58.213.131
https://duckduckgo.com/robots.txt


Default execution with a list
-----------------------------
$ python webscreenshot.py version 2.3

[+] 4 URLs to be screenshot
[+] 4 actual URLs screenshot
[+] 0 error(s)


Default execution with a single URL
-----------------------------------
$ python webscreenshot.py -v google.fr
webscreenshot.py version 2.3

[INFO][General] 'google.fr' has been formatted as 'http://google.fr:80' with supplied overriding options
[+] 1 URLs to be screenshot
[INFO][http://google.fr:80] Screenshot OK

[+] 1 actual URLs screenshot
[+] 0 error(s)


Increasing verbosity level execution
-----------------------------------
$ python webscreenshot.py -i list.txt -v
webscreenshot.py version 2.3

[INFO][General] 'http://google.fr' has been formatted as 'http://google.fr:80' with supplied overriding options
[INFO][General] 'https://216.58.213.131' has been formatted as 'https://216.58.213.131:443' with supplied overriding options
[INFO][General] '216.58.213.131' has been formatted as 'http://216.58.213.131:80' with supplied overriding options
[INFO][General] 'https://duckduckgo.com/robots.txt' has been formatted as 'https://duckduckgo.com:443/robots.txt' with supplied overriding options
[+] 4 URLs to be screenshot
[INFO][https://duckduckgo.com:443/robots.txt] Screenshot OK

[INFO][http://216.58.213.131:80] Screenshot OK

[INFO][https://216.58.213.131:443] Screenshot OK

[INFO][http://google.fr:80] Screenshot OK

[+] 4 actual URLs screenshot
[+] 0 error(s)


Results
-------
$ ls -l screenshots/
total 187
-rwxrwxrwx 1 root root 53805 May 19 16:04 http_216.58.213.131_80.png
-rwxrwxrwx 1 root root 53805 May 19 16:05 http_google.fr_80.png
-rwxrwxrwx 1 root root 53805 May 19 16:04 https_216.58.213.131_443.png
-rwxrwxrwx 1 root root 27864 May 19 16:04 https_duckduckgo.com_443_robots.txt.png

Supported options by renderers

Options not listed here below are supported by every current renderer

Option category Option PhantomJS renderer Chrome / Chromium renderer Firefox renderer
Screenshot parameters
format (-f) Yes No No
quality (-q) Yes No No
HTTP parameters
cookie (-c) Yes No No
header (-a) Yes No No
http_username (-u) Yes No No
http_password (-b) Yes No No
Connection parameters
proxy (-P) Yes Yes No
proxy_auth (-A) Yes No No
proxy_type (-T) Yes No No
Ability to screenshot a HTTPS website with a non-publicly-signed certificate Yes No No

Requirements

  • A Python interpreter with version 2.7 or 3.X
  • The webscreenshot python script:
    • The easiest way to setup it: pip install webscreenshot and then directly use $ webscreenshot
    • Or git clone that repository and pip install -r requirements.txt and then python webscreenshot.py
  • The PhantomJS tool with at least version 2: follow the installation guide and check the FAQ if necessary
  • Chrome, Chromium or Firefox > 57 if you want to use one of these renderers
  • xvfb if you want to run webscreenshot in an headless OS: use the --no-xserver webscreenshot option to ease everything

Changelog

  • version 2.5 - 09/22/2019: Image quality and format options added, PhantomJS useragent updated, modern TLD support
  • version 2.4 - 05/30/2019: Few fixes for Windows support
  • version 2.3 - 05/19/2019: Python 3 compatibility, Firefox renderer added, no-xserver option added
  • version 2.2 - 08/13/2018: Chrome and Chromium renderers support and single URL support
  • version 2.1 - 01/14/2018: Multiprotocol option addition and PyPI packaging
  • version 2.0 - 03/08/2017: Adding proxy-type option
  • version 1.9 - 01/10/2017: Using ALL SSL/TLS ciphers
  • version 1.8 - 07/05/2015: Option groups definition
  • version 1.7 - 06/28/2015: HTTP basic authentication support + loglevel option changed to verbosity
  • version 1.6 - 04/23/2015: Transparent background fix
  • version 1.5 - 01/11/2015: Cookie and custom HTTP header support
  • version 1.4 - 10/12/2014: url-to-image PhantomJS script integration + few bugs corrected
  • version 1.3 - 08/05/2014: Windows support + few bugs corrected
  • version 1.2 - 04/27/2014: Few bugs corrected
  • version 1.1 - 04/21/2014: Changed the script to use PhantomJS instead of the buggy wkhtml binary
  • version 1.0 - 01/12/2014: Initial commit

Copyright and license

webscreenshot is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.

webscreenshot is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with webscreenshot. If not, see http://www.gnu.org/licenses/.

Contact

  • Thomas Debize < tdebize at mail d0t com >