
Nutch-based website scraper

Primary LanguageJavaGNU Lesser General Public License v3.0LGPL-3.0


  1. Oracle Java 8
    • sudo add-apt-repository ppa:webupd8team/java
    • sudo apt-get update
    • sudo apt-get install oracle-java8-installer
    • sudo apt-get install oracle-java8-set-default
  2. Maven
    • sudo apt-get update
    • sudo apt-get install maven
  3. Node.js 6.x and npm
    • curl -sL https://deb.nodesource.com/setup_6.x | sudo -E bash -
    • sudo apt-get install -y nodejs


  1. Go to project folder
  2. Run mvn clean install
  3. Find new build in path target/scraper.jar

Build and deploy

  1. Install Ruby
    • sudo apt-get update
    • sudo apt-get install ruby
  2. Install Rake
    • sudo gem install rake
  3. Install Ant
    • sudo apt-get update
    • sudo apt-get install ant
  4. Clone repository framework_templates to the same parent folder
  5. Clone repository gexcloud as vagrant to the same parent folder
  6. Clone repository nutch-fork as nutch to the same parent folder
  7. Go to parent_folder/framework_templates/scraper
  8. Run rake build_nutch
  9. Run rake build_scraper
  10. Run sudo rake build
  11. Clone appstore-apps to the same parent folder
  12. Go to parent_folder/appstore-apps
  13. Increment version in /appstore-apps/apps/scraper/build_config.rb
  14. Run gex_env=main rake deploy:upload['scraper']


  1. Build project

  2. Application needs Consul, ElasticSearch and Nutch REST API running.

  3. For logs you should create folder with path /usr/local/scraper with write permissions to all

  4. Run project from main class(io.gex.scraper.api.Main) with two parameters path_to_config and -dev

    • Config file example { "appId":"1234", "webServerPort": 4567, "consulHost": "localhost", "consulPort": 8500, "nutchHost": "", "nutchPort": 8081, "defScrapArchJob": { "urls": null, "crawlIndexesHost": "http://index.commoncrawl.org", "warcFilesHost": "http://commoncrawl.s3.amazonaws.com/", "crawlLinksLimit": null, "fromYear": 2017, "toYear": 2017, "fetchThreadsNum": 32, "elastic": { "host": "", "port": 9300, "clusterName": null, "indexName": "scraper", "type": "scrap_old_data" } }, "defScrapJob": { "urls": null, "depth": 2, "interval": 7200, "extractArticle": false, "elasticIndexName": "scraper" } }
  5. Go to By default for debug start up two web servers: java web server on port 4567 and node.js web server on port 3000 which proxy java web server for dynamically adding assets.