This project is designed to scrape information from a selected site on habr.com. It features automated fetching of articles, parsing their content, and storing the information in a PostgreSQL database. The project is encapsulated within Docker containers for easy deployment.
-
Automated Parsing: The parser runs every 10 minutes, fetching the main page of the chosen hub and extracting links to articles.
-
Duplicate Prevention: Before saving articles, the parser analyze for duplicates, ensuring each article is unique in the database.
-
Database Storage: Parsed articles are stored in a PostgreSQL database using two models (
Author
andArticle
) with a relational connection. -
Django Admin Interface: The project includes a Django admin interface for easy management and interaction with the collected articles.
Clone the repository from GitHub:
git clone https://github.com/purechromas/habr_project
Create a .env
file based on the provided .env.example
with your configurations.
Make sure Docker is installed on your machine.
Open a terminal and run the following command:
docker-compose -f docker-compose.yml up
To work with Django admin and manage articles, create an admin user with the following command:
docker-compose -f docker-compose.yml run django poetry run python manage.py createadminuser
Initiate the Habr parser manually using the command:
docker-compose -f docker-compose.yml run django poetry run python manage.py run_habr_parser