/entities-extraction-web-scraper

A web scraper that utilizes OpenAI Functions for easy scraping.

Primary LanguagePython

Scrape the Web with entities extraction using OpenAI Function

What is this?

This codebase allows you to scrape any website and extract relevant data points easily. Create a schema in schemas.py, pick a url, and use them with scrape_with_playwright() in main.py to start scraping.

asyncio.run(scrape_with_playwright(
        url="https://www.bbc.com",
        schema_pydantic=SchemaNewsWebsites
    ))

Setup

Create a new Python virtual environment

python -m venv virtual-env or python3 -m venv virtual-env (Mac)

py -m venv virtual-env (Windows 11)

Activate virtual environment

.\virtual-env\Scripts\activate (Windows)

source virtual-env/bin/activate (Mac)

Install dependencies

Run poetry install --sync or poetry install

Install playwright (for SPAs or JS-heavy websites that require a browser to be opened)

playwright install

Create a new .env file

OPENAI_API_KEY=XXXXXX

Usage

Run locally

python main.py

Additional Information

  • Add onto this a FastAPI server to serve this as an API endpoint for ease of use.

  • Use caution when scraping. Don't do anything I wouldn't do (illegal)