The AI Web Scraper is a Python-based application that uses Streamlit for the frontend, Selenium for web scraping, and Ollama's LLM for natural language processing. The application allows users to scrape websites, extract content, and parse it based on a provided description.
- Web Scraping: Leverages Selenium to scrape dynamic content from websites.
- Content Cleaning: Processes and cleans the scraped HTML content using BeautifulSoup.
- Natural Language Parsing: Uses Langchain with Ollama LLM to parse the content based on user input.
- Streamlit Interface: Provides a user-friendly interface for entering URLs, viewing content, and running parsing operations.
- CAPTCHA Handling: Utilizes Bright Data service to help unblock CAPTCHAs. Bright Data offers a free trial.
To run this project locally, follow these steps:
- Clone the repository:
git clone https://github.com/Igorth/web-scraper-ai
cd ai-web-scraper
- Set up a virtual environment and activate it:
python -m venv .venv
source .venv/bin/activate # On Windows use: .venv\Scripts\activate
- Install the required dependencies:
pip install -r requirements.txt
- Set up environment variables by creating a
.env
file:
touch .env
- Add your Selenium WebDriver path:
SBR_WEBDRIVER=<path-to-your-webdriver>
To start the application, run:
streamlit run main.py
- Input: Enter a website URL to scrape.
- Scraping: Click "Scrape Site" to fetch and display the website's content.
- Parsing: Provide a description of what you want to parse from the content.
- Result: The parsed data is displayed according to the provided description.
The project uses pytest
for testing. To run the tests:
pytest
This project is set up with GitHub Actions for CI/CD. The pipeline runs tests on every push to the main
branch and ensures that all tests pass before deploying.