/web-scraper-ai

Python Web Scraper with Streamlit

Primary LanguagePython

AI Web Scraper

Python Version Streamlit Selenium Langchain Ollama LLM Test Coverage CI/CD

Overview

The AI Web Scraper is a Python-based application that uses Streamlit for the frontend, Selenium for web scraping, and Ollama's LLM for natural language processing. The application allows users to scrape websites, extract content, and parse it based on a provided description.

Resources

Video

Features

  • Web Scraping: Leverages Selenium to scrape dynamic content from websites.
  • Content Cleaning: Processes and cleans the scraped HTML content using BeautifulSoup.
  • Natural Language Parsing: Uses Langchain with Ollama LLM to parse the content based on user input.
  • Streamlit Interface: Provides a user-friendly interface for entering URLs, viewing content, and running parsing operations.
  • CAPTCHA Handling: Utilizes Bright Data service to help unblock CAPTCHAs. Bright Data offers a free trial.

Installation

To run this project locally, follow these steps:

  1. Clone the repository:
git clone https://github.com/Igorth/web-scraper-ai
cd ai-web-scraper
  1. Set up a virtual environment and activate it:
python -m venv .venv
source .venv/bin/activate  # On Windows use: .venv\Scripts\activate
  1. Install the required dependencies:
pip install -r requirements.txt
  1. Set up environment variables by creating a .env file:
touch .env
  1. Add your Selenium WebDriver path:
SBR_WEBDRIVER=<path-to-your-webdriver>

Usage

To start the application, run:

streamlit run main.py

How It Works

  • Input: Enter a website URL to scrape.
  • Scraping: Click "Scrape Site" to fetch and display the website's content.
  • Parsing: Provide a description of what you want to parse from the content.
  • Result: The parsed data is displayed according to the provided description.

Testing

The project uses pytest for testing. To run the tests:

pytest

Continuous Integration

This project is set up with GitHub Actions for CI/CD. The pipeline runs tests on every push to the main branch and ensures that all tests pass before deploying.