pace-commoncrawl-scanner

Scans CommonCrawl datasets for keywords. Scans the whole month of CommonCrawl data using Amazon EC2 c5n.16xlarge instance for hundreds of keywords in about 4 hours. Developed with support from the EU and the Populism & Civic Engagement H2020 project.

Various setup steps for installing on a AWS Ubuntu 20.04

wget -O- https://apt.corretto.aws/corretto.key | sudo apt-key add - 
sudo add-apt-repository 'deb https://apt.corretto.aws stable main'
sudo apt-get update; sudo apt-get install -y java-15-amazon-corretto-jdk

sudo apt install build-essential cmake libboost-all-dev ragel maven

git clone git://github.com/intel/hyperscan
cd hyperscan
cmake -DBUILD_SHARED_LIBS=YES
make 
sudo make install

cd

git clone https://github.com/CitizensFoundation/pace-commoncrawl-scanner.git
cd pace-commoncrawl-scanner
mvn clean package

mkdir /home/ubuntu/pace-commoncrawl-scanner/results

cd /home
sudo ln -s ubuntu/ robert

cd
cd pace-commoncrawl-scanner

Prepare the page ranks file into the condensed format

processScripts/getLatestPageRanking.sh 2020 11 https://commoncrawl.s3.amazonaws.com/projects/hyperlinkgraph/cc-main-2020-jul-aug-sep/host/cc-main-2020-jul-aug-sep-host-ranks.txt.gz
processScripts/processHostRanksFile.sh 2020 11

Step 1 - Download files list

processScripts/getLatestWetPathsAndDownloadAll.sh 2020 11 https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2020-50/wet.paths.gz 72000

Step 2- Download, gunzip and scan the files

processScripts/scan.sh 2020 11

Step 3 - Import into ElasticSearch (can be done in parallel with step 2)

processScripts/importToES.sh 2020 11

This project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 822337. Any dissemination of results here presented reflects only the consortium’s view. The Agency is not responsible for any use that may be made of the information it contains.