This project is a web scraper that fetches articles from the Jamaica Observer news website. It uses FastAPI for the web server and Selenium for the web scraping.
.
├── .dockerignore
├── .DS_Store
├── .envexample
├── .gitignore
├── api
│ └── app
│ ├── __pycache__
│ ├── db_utils.py
│ ├── main.py
│ └── robots.txt
├── database
│ ├── __init__.py
│ ├── __pycache__
│ └── database.py
├── docker-compose.yml
├── Dockerfile
├── main.py
├── requirements.txt
├── start.sh
└── storage
- Clone the repository.
- Install the dependencies by running
pip install -r requirements.txt
. - Set up your environment variables in a
.env
file. You'll need to setCHROMEDRIVER_PATH
,BASE_URL_NEWS
, andBASE_URL
. - Build the Docker image by running
docker build -t jamaica-observer-scraper .
. - Run the Docker container by running
docker run -p 1876:1876 jamaica-observer-scraper
.
The scraper is started by running main.py
. By default, it only displays articles that have a date, body, and image. To display all articles, set the show_all
parameter to True
when calling scrape_jamaica_observer()
.
Web scraping should be done responsibly. As of the time of writing, the Jamaica Observer's robots.txt
file does not disallow scraping. However, always check the website's robots.txt
file and terms of service to ensure that you're allowed to scrape it. Be respectful of the website's resources and don't overload their servers with too many requests at once. This scraper is intended for educational purposes and should not be used to violate any laws or terms of service.