UKCrawl is a Python library for retrieving and processing data from UK websites using the Common Crawl archives. It queries AWS S3 for data retrieval, then processes using Named Entity Recognition for extracting named entities, DuckDB for extracting postcodes, and a HuggingFace transformers model to classify webpages.
This project uses Dagster to monitor daily for new Common Crawl archives. Within src/common/utils.py
there is a list of years that are retrieved. For any processed files that are missing, Dagster initiates a job. To run this orchestration pipeline:
-
Install Podman and Podman Compose: Ensure that Podman and Podman Compose are installed on your system. You can install them using your package manager or by following the instructions provided in their respective documentation.
-
Clone the Repository: Clone the UKCrawl repository from GitHub using the following command:
git clone https://github.com/cjber/ukcrawl.git
-
Navigate to the Project Directory: Change your current directory to the root directory of the UKCrawl project:
cd ukcrawl
-
Start the Containers: Use Podman Compose to start the containers defined in the
compose.yml
file:podman-compose up -d
-
Access Dagster Web Interface: Once the containers are up and running, access the Dagster web interface by navigating to
http://localhost:3000
in your web browser.