/cc-nutch-example

Apache Nutch example project to archive content in WARC files

Primary LanguageShellApache License 2.0Apache-2.0

Example Usage of Common Crawl's Fork of Apache Nutch to Crawl and Write WARC files

Requirements and installation

sudo apt install libcld2-0 libcld2-dev ant maven

Compile Nutch and required projects

git clone git@github.com:crawler-commons/crawler-commons.git
cd crawler-commons/
mvn install
cd ..

git clone git@github.com:commoncrawl/language-detection-cld2.git
cd language-detection-cld2/
mvn install
cd ..

git clone https://github.com/apache/nutch.git nutch-cc
cd nutch-cc/
ant runtime
cd ..

Run crawl

echo -e "https://nutch.apache.org/\tnutch.score=1.0" >urls.txt

./crawl.sh crawl 3 urls.txt