This repository contains the source code of the sign-of-life domain crawler.
├── .docker
│ └── data # Folder for input data of docker instances
├── app_domains # Source code of the crawler in particular:
│ ├── # General parameters of the crawler, including the input folder to scan "input_folder"
│ └── # Main script of the crawler
├── input # Input list of URLS + various necessary input files (Registrar lists, Trained ML models etc...)
│ └── folder_X # The CSV files with the input lists of domains must be added to a folder in input
├── inter # Folder containing intermediary results
├── output # Folder containing output files with the crawler classification, 1 file/input file
├── docs # Documentation of the crawler
├── env_required # Specific library to install
└── logging # Logging folder if the option is activated in
System requirements:
- number of files opened/stored in one folder: 10k files (CONTROLLER_LIMIT parameter in
- memory at gathering of website pages: ~~ 8GB RAM (average) for 10k (CONTROLLER_LIMIT parameter in
- number of processes in parallel: must be less than number of CPUs of your machine (MAX_PROCESSES and WORKERS_POST_PROCESSING in
- No/deactivated antivirus (that may remove intermediary files that contain HTML pages)
You can manually install all dependencies with Anaconda or use contenerised system using Docker.
Set the .env ( cp .docker/.env.example .docker/.env
) to configure the database connection.
There is service inside for psql database, but code accepts remote connections
After this build docker normally (cd .docker && docker compose build
Running docker system allows you to deploy crawler in number of instances (replicas).
By default docker compose deploys 2 replicas which needs at least 16GB of RAM.
To controll number of replicas use --scale crawler=<number_of_instances>
In summary to run crawler by docker:
docker compose up --scale crawler=<number_of_instances>
Every container scans data/*.csv
files for a URLs (it's a shared directory with the .docker/data
from the host).
Download the lastest anaconda at
Install it, tick the box to add conda to the PATH.
At the end of this step, the command "conda" must be recognized by the command line
In Windows:
- Download and install PostGreSQL at
In Linux, run:
sudo apt-get install -y libpq-dev
In Windows:
- If not already installed, download and install Chrome at
In Linux:
- for a system-wide chrome, run the following commands:
sudo apt install ./google-chrome-stable_current_amd64.deb
Get chromedriver link at
Make sure the version matches your version of Chrome (available by typing in Chrome address bar: chrome://settings/help)
In Windows:
- Download and unzip the driver in given location. Then add that location to the PATH variable.
In Linux :
for example with the version 83:
cp ./chromedriver /usr/bin/chromedriver
In windows:
- Download and install Visual studio build tools at
When installing, make the box "C++ build tools" is ticked
In Linux:
sudo apt install build-essential
To install the required libraries, run the following commands:
conda create -n centr python=3.7
conda activate centr
python -m pip install -U pip
pip install -r requirements_before_torch.txt
pip install torch==1.3.1+cpu -f
pip install -r requirements_after_torch.txt
To compromise on some conflicts of versions:
pip install -r requirements_final.txt
For word forms library:
navigate to "sign_of_life/env_required/word_forms-master"
conda activate centr
(if not already activated)
python install
The domains are classified into the following categories:
Put the list of domains to scan in 1 or multiple CSV files. Each file must have a column named "url".
Move these files into a unique folder (named "folder_X" for this example) in sign_of_life_crawler\input -
In, update the parameter "input_folder" = join(MAIN_DIR, "input", "folder_X")
run the main script:
At the end of the run, the output file is in saved in output, with the same name as the input file to which the prefix "final_" is added
Set the connection to the database in .env fill ( cp .env.example .env