This codebase allows you to scrape any website and extract relevant data points easily.
Create a schema in schemas.py
, pick a url, and use them with scrape_with_playwright()
in main.py
to start scraping.
asyncio.run(scrape_with_playwright(
url="https://www.bbc.com",
schema_pydantic=SchemaNewsWebsites
))
python -m venv virtual-env
or python3 -m venv virtual-env
(Mac)
py -m venv virtual-env
(Windows 11)
.\virtual-env\Scripts\activate
(Windows)
source virtual-env/bin/activate
(Mac)
Run poetry install --sync
or poetry install
playwright install
OPENAI_API_KEY=XXXXXX
python main.py
-
Add onto this a FastAPI server to serve this as an API endpoint for ease of use.
-
Use caution when scraping. Don't do anything I wouldn't do (illegal)