Scrape the Web with entities extraction using OpenAI Function

What is this?

This codebase allows you to scrape any website and extract relevant data points easily. Create a schema in schemas.py, pick a url, and use them with scrape_with_playwright() in main.py to start scraping.

asyncio.run(scrape_with_playwright(
        url="https://www.bbc.com",
        schema_pydantic=SchemaNewsWebsites
    ))

Setup

Create a new Python virtual environment

python -m venv virtual-env or python3 -m venv virtual-env (Mac)

py -m venv virtual-env (Windows 11)

Activate virtual environment

.\virtual-env\Scripts\activate (Windows)

source virtual-env/bin/activate (Mac)

Install dependencies

Run poetry install --sync or poetry install

Install playwright (for SPAs or JS-heavy websites that require a browser to be opened)

playwright install

Create a new `.env` file

OPENAI_API_KEY=XXXXXX

Usage

Run locally

python main.py

Additional Information

Add onto this a FastAPI server to serve this as an API endpoint for ease of use.
Use caution when scraping. Don't do anything I wouldn't do (illegal)

ricable/entities-extraction-web-scraper