/cloudflare-bypass-headless-web-scraper

Headless web-scraper template that bypasses the Cloudflare IUAM protection. Working on X virtual frame buffer (Xvfb) and Perl modified WWW::Mechanize::Chrome module.

Primary LanguageShellISC LicenseISC

cloudflare-bypass-headless-web-scraper

Author License Stars

Headless web-scraper template that bypasses the Cloudflare IUAM protection. Working on X virtual frame buffer (Xvfb) and Perl modified WWW::Mechanize::Chrome module.

This modification involves neither new methods nor functions, but much of fixes and alterations, including the removal of functionality that is unnecessary for this template, such as Windows-specific requirements. This modified WWW::Mechanize::Chrome is provided here with all its dependencies, a total size of which, including WWW::Mechanize::Chrome itself, is less than 4.9 Mb.

The main advantage caused by this modification is the ability of $mech->content_as_pdf(%options) content rendering method to succesfully work even if a browser is not executed in its built-in headless mode; see metacpan.org/pod/WWW::Mechanize::Chrome#$mech->content_as_pdf(%options).

The one and only inconvenience caused by this modificitation is the neglection of Imager module that is used to process screenshots (e.g. in $mech->content_as_png() method). However, xwd and convert utilities combination can be and is used in this template instead in order to achieve a comparable results.

Note

If no web-pages screenshots are to be taken, xwd and convert (ImageMagick) may not be installed.
To not install these automatically, edit the main.sh appropriately.

As opposed to most other web-scrapers, this template does not require user to have a certain WebDriver, but only a certain Chromium or Google Chrome instance. If no such is present on a user operating system, one could be downloaded and installed automatically when main.sh is executed.

WebDriver is not required, since this template is working with Chromium or Google Chrome directly via DevTools and a local WebSockets connections. This solution reduces the number of dependencies.

Perl is in use, as it is installed on most Linux systems by default and it is effecient to process text with, which is essential for a parser to be written as soon as possible. In conclusion, it may be noted that this template can be succesfully executed in a Docker container with little to no modifications.

Execution phases

  1. Installation of yet not installed required utilities from the list: perl5, Xvfb, xwd, convert. Decompression of the external librarie archive if it is not decompressed yet.
  2. Search for a compatible browser executables in the currect directory. If such are found, the prompt to use the latest version one or proceed is shown.
  3. If a browser executable was not declared in the previous phase, search for a compatible browser executables that could be runt by commands. If such are found, the promt to use the latest verison one or proceed is shown.
  4. If a browser executable was not declared in the previous phase, the prompt to either specify a compatible browser executable absolute path (its command) or proceed is shown. While the specified executable absolute path or command is non-compatible, this prompt is again shown.
  5. If a browser executable was not declared in the previous phase, the prompt to download, install and use the latest Ungoogled Chromium automatically or interrupt execution is shown.
  6. The compatible browser executable specified in one of the prior phases is executed on a virtual screen.
  7. The scraper.pl is launched and connected to this browser via local WebSockets.
  8. During scraper.pl execution, https://cloudflare.com/ web-page is loaded and saved as .pdf.
  9. A virtual screen state is saved via xwd and then converted to .png via convert.

Structure

[ 1.1M] cloudflare-bypass-headless-web-scraper
!! [  753] LICENSE.txt
~~ [ 4.5K] README.md
~~ [ 1.1M] extlib.tar.bz2
++ [ 6.0K] main.sh
~~ [ 1.0K] scraper.pl

5 files, 1 directory

Installation

git clone -q https://github.com/faraui/cloudflare-bypass-headless-web-scraper.git && \
cd cloudflare-bypass-headless-web-scraper && \
chmod ugo+x main.sh

Launch

./main.sh