/nutch-test

Different example of using Nutch: with Solr, Selenium Hub, standalone web drivers

Primary LanguageDockerfileMIT LicenseMIT

Installating Nutch

Option 1: Nutch only

docker build --force-rm -t nutch .

Option 2: selenium hub + nutch + solr

Selenium hub with 10 Chrome nodes and 10 Firefox nodes each in headless mode

docker-compose -f docker-compose_selenium_nutch_solr.yaml up -d --scale chrome=10 --scale firefox=10

Option 3: nutch + solr

docker-compose -f docker-compose_nutch_solr.yaml up -d

Option 4: selenium hub + nutch + solr + tor instances

docker-compose -f docker-compose_selenium_nutch_solr_tor.yaml up -d --scale firefox=40

Installing Chrome Driver

This is an option when not using Selenium HUB.

  1. Install Chrome browser:
  • edit sources.list
vi /etc/apt/sources.list
# add at the bottom of the file
deb [arch=amd64] http://dl.google.com/linux/chrome/deb/ stable main
  • Download the signing key
wget https://dl.google.com/linux/linux_signing_key.pub
apt-key add linux_signing_key.pub
  • Install the stable version of Google Chrome
apt update
apt install google-chrome-stable

NB You may need to upgrade and then update your packages:

apt upgrade
apt update
  1. download chrome driver from the download page
cd ~
wget https://chromedriver.storage.googleapis.com/2.44/chromedriver_linux64.zip
unzip chromedriver_linux64.zip
rm chromedriver_linux64.zip
  1. Change the location of the ChromeDriver binary path if necessary in nutch-default.xml or nutch-site.xml by specifying the value for selenium.grid.binary

Installing Firefox Driver

This is an option when not using Selenium HUB.

  1. Install Firefox browser:
apt install firefox
  1. download gecko driver from the download page
cd ~
wget https://github.com/mozilla/geckodriver/releases/download/v0.23.0/geckodriver-v0.23.0-linux64.tar.gz
tar -zxvf geckodriver-v0.23.0-linux64.tar.gz
rm geckodriver-v0.23.0-linux64.tar.gz
  1. Change the location of the gecko binary path if necessary in nutch-default.xml or nutch-site.xml by specifying the value for selenium.grid.binary

Installing Opera Driver

This is an option when not using Selenium HUB.

  1. Install Opera browser by downloading the last version from link
wget http://download4.operacdn.com/ftp/pub/opera/desktop/56.0.3051.99/linux/opera-stable_56.0.3051.99_amd64.deb
dpkg -i opera-stable_56.0.3051.99_amd64.deb
apt install -f

NB Update to the appropriate Opera version.

  1. download opera driver from the download page
cd ~
wget wget https://github.com/operasoftware/operachromiumdriver/releases/download/v.2.40/operadriver_linux64.zip
unzip operadriver_linux64.zip
rm operadriver_linux64.zip
mv operadriver_linux64/operadriver /root
chmod +x operadriver
  1. Change the location of the gecko binary path if necessary in nutch-default.xml or nutch-site.xml by specifying the value for selenium.grid.binary

Run a test

  1. Set the value for selenium.driver in conf/nutch-site.xml to the selenium driver you want to test
  2. If you don't have a screen being attached to the server, set selenium.enable.headless to true
  3. crawl
# connect to the nutch container
docker exec -it nutch bash

# execute the crawl
/root/nutch/bin/crawl -i -D solr.server.url=http://solr:8983/solr/mycore -s urls crawler 1
  1. check the result
  • Test your result in Solr by opening in your browser: localhost:8983/
  • navigate to the created node mycore,
  • execute the default query fetch:
*:*

Hints

Regarding the redirects: if you want to follow redirects immediately in the fetcher you simply could adjust http.redirect.max (e.g., set it to 3) and Fetcher will follow the redirects immediately. Btw., for quick testing you could just set the required parameters in the command-line, e.g.:

% bin/nutch parsechecker -Dplugin.includes='protocol-selenium|parse-tika' \
   -Dselenium.grid.binary=.../geckodriver \
   -Dselenium.enable.headless=true \
   -followRedirects \
   -dumpText https://nutch.apache.org

License

License: MIT