Headless web-scraper template that bypasses the Cloudflare IUAM protection. Working on X virtual frame buffer (Xvfb) and Perl modified WWW::Mechanize::Chrome module.
This modification involves neither new methods nor functions, but much of fixes and alterations, including the removal of functionality that is unnecessary for this template, such as Windows-specific requirements. This modified WWW::Mechanize::Chrome is provided here with all its dependencies, a total size of which, including WWW::Mechanize::Chrome itself, is less than 4.9 Mb.
The main advantage caused by this modification is the ability of $mech->content_as_pdf(%options)
content rendering method to succesfully work even if a browser is not executed in its built-in headless mode; see metacpan.org/pod/WWW::Mechanize::Chrome#$mech->content_as_pdf(%options).
The one and only inconvenience caused by this modificitation is the neglection of Imager module that is used to process screenshots (e.g. in $mech->content_as_png()
method). However, xwd and convert utilities combination can be and is used in this template instead in order to achieve a comparable results.
Note
If no web-pages screenshots are to be taken, xwd and convert (ImageMagick) may not be installed.
To not install these automatically, edit the main.sh
appropriately.
As opposed to most other web-scrapers, this template does not require user to have a certain WebDriver, but only a certain Chromium or Google Chrome instance. If no such is present on a user operating system, one could be downloaded and installed automatically when main.sh
is executed.
WebDriver is not required, since this template is working with Chromium or Google Chrome directly via DevTools and a local WebSockets connections. This solution reduces the number of dependencies.
Perl is in use, as it is installed on most Linux systems by default and it is effecient to process text with, which is essential for a parser to be written as soon as possible. In conclusion, it may be noted that this template can be succesfully executed in a Docker container with little to no modifications.
- Installation of yet not installed required utilities from the list: perl5, Xvfb, xwd, convert. Decompression of the external librarie archive if it is not decompressed yet.
- Search for a compatible browser executables in the currect directory. If such are found, the prompt to use the latest version one or proceed is shown.
- If a browser executable was not declared in the previous phase, search for a compatible browser executables that could be runt by commands. If such are found, the promt to use the latest verison one or proceed is shown.
- If a browser executable was not declared in the previous phase, the prompt to either specify a compatible browser executable absolute path (its command) or proceed is shown. While the specified executable absolute path or command is non-compatible, this prompt is again shown.
- If a browser executable was not declared in the previous phase, the prompt to download, install and use the latest Ungoogled Chromium automatically or interrupt execution is shown.
- The compatible browser executable specified in one of the prior phases is executed on a virtual screen.
- The
scraper.pl
is launched and connected to this browser via local WebSockets. - During
scraper.pl
execution, https://cloudflare.com/ web-page is loaded and saved as.pdf
. - A virtual screen state is saved via xwd and then converted to
.png
via convert.
[ 1.1M] cloudflare-bypass-headless-web-scraper
!! [ 753] LICENSE.txt
~~ [ 4.5K] README.md
~~ [ 1.1M] extlib.tar.bz2
++ [ 6.0K] main.sh
~~ [ 1.0K] scraper.pl
5 files, 1 directory
git clone -q https://github.com/faraui/cloudflare-bypass-headless-web-scraper.git && \
cd cloudflare-bypass-headless-web-scraper && \
chmod ugo+x main.sh
./main.sh