.
├── config # Relevant configuration files
├── cookie_crawler # Crawler package
│ ├── commands # Commands to detect and explore cookie banners
│ ├── scripts # JavaScript code needed for the crawl
│ ├── utils # Crawl-related utility functions
│ ├── crawl_summary.py # Script to show a summary of the crawl results
│ └── run_crawler.py # Script to run the crawler
├── classifiers # Package for ML models
├── database # Database package
├── docker # Contains Docker-related configuration files
├── domains # Cache directory for CrUX and Tranco domains
├── experiments # Directory where the crawl data is dumped
├── shared_utils # Contains general utility functions
└── run.sh # Script to run the crawler, predictions and observatory with Docker
1. Make sure to have Docker and Docker Compose installed.
In order to run Docker without root privileges, add your user to the docker group (sudo usermod -a -G docker $USER
). Exit your current terminal session and start a new one for the changes to take effect.
The crawler uses the ports 5000, 5001 and 5432. Please make sure these ports are not used by other processes or adapt docker/docker-compose.yaml
to use different ports.
Crawl parameters can be modified in config/experiment_config.yaml
. Alternatively, any parameter param
can be modified using the command line argument --<param>
. For boolean parameters use the flags --<param>
or --no_<param>
. For list parameters, provide each element of the list separately --<param> val1 --<param> val2
.
For example, to set explore_cookie_banner
to False
, num_browsers
to 5
, and countries
under domains_config
to ["de", "fr"]
, use the following arguments:
--no_explore_cookie_banner --num_browsers 5 --countries de --countries fr
By default, crawl results are saved to a local PostgreSQL database. Change the DB_*
parameters in .env
to use a different one.
An API key is needed to download domains from the Chrome User Experience Report. Follow this link to create one, and specify its path in GOOGLE_APPLICATION_CREDENTIALS
under .env
.
Alternatively, if you run the crawl with the default domains_config
parameters, it will use a cached list under the domains/
directory. You can also use a custom crawling list with the arguments --domains_source list
and --domains_path domains/custom_domains.txt
. Please refer to domains/custom_domains.txt
for an example of such a list.
Use the following command to run the crawl:
./run.sh --crawler
The default parameters correspond to those used in the main crawl that we present in the paper.
You can use the --num_domains
argument to sample a subset of websites to crawl instead of crawling the entire list.
For example, the following command crawls 10000 websites at random from the CRUX list we used for our crawl.
./run.sh --crawler --num_domains 10000
You can also change the number of parallel browsers using --num_browsers
, and create a custom crawling list using the parameters --domains_source
and --domains_path
.
For example, the following command crawls the websites included in domains/custom_domains.txt
using two parallel browsers.
./run.sh --crawler --domains_source list --domains_path domains/custom_domains.txt --num_browsers 2
Please check config/experiment_config.yaml
for more details about crawl parameters and their default values.
Once the crawl finishes, get the corresponding experiment ID by running:
./run,.sh --ls
Then, use the following command to run predictions on the crawl results:
./run.sh --predictor --experiment_id <experiment_id>
Then, run the following command to display a summary of the crawl results
./run.sh --summary --experiment_id <experiment_id>
-
When running
run.sh
, please ignore theWARN[0000] YOUR_PATH/docker-compose.yaml:version is obsolete
warning. It can be resolved by removing the first line ofdocker/docker-compose.yaml
, but we keep it for backward compatibility as it does not affect functionality. -
During the crawl, please ignore the following error message that is frequently displayed.
psycopg2.OperationalError: server closed the connection unexpectedly This probably means the server terminated abnormally before or while processing the request.
-
The crawler may run into other errors if it is run with limited resources.
Run:
./run.sh --ml-eval