AI Web Scraper

Overview

The AI Web Scraper is a Python-based application that uses Streamlit for the frontend, Selenium for web scraping, and Ollama's LLM for natural language processing. The application allows users to scrape websites, extract content, and parse it based on a provided description.

Resources

Video

Features

Web Scraping: Leverages Selenium to scrape dynamic content from websites.
Content Cleaning: Processes and cleans the scraped HTML content using BeautifulSoup.
Natural Language Parsing: Uses Langchain with Ollama LLM to parse the content based on user input.
Streamlit Interface: Provides a user-friendly interface for entering URLs, viewing content, and running parsing operations.
CAPTCHA Handling: Utilizes Bright Data service to help unblock CAPTCHAs. Bright Data offers a free trial.

Installation

To run this project locally, follow these steps:

Clone the repository:

git clone https://github.com/Igorth/web-scraper-ai
cd ai-web-scraper

Set up a virtual environment and activate it:

python -m venv .venv
source .venv/bin/activate  # On Windows use: .venv\Scripts\activate

Install the required dependencies:

pip install -r requirements.txt

Set up environment variables by creating a .env file:

touch .env

Add your Selenium WebDriver path:

SBR_WEBDRIVER=<path-to-your-webdriver>

Usage

To start the application, run:

streamlit run main.py

How It Works

Input: Enter a website URL to scrape.
Scraping: Click "Scrape Site" to fetch and display the website's content.
Parsing: Provide a description of what you want to parse from the content.
Result: The parsed data is displayed according to the provided description.

Testing

The project uses pytest for testing. To run the tests:

pytest

Continuous Integration

This project is set up with GitHub Actions for CI/CD. The pipeline runs tests on every push to the main branch and ensures that all tests pass before deploying.