/GPTSiteCrawler

GPTSiteCrawler

Primary LanguagePythonMIT LicenseMIT

GPT Query

Craw content from a website an all its subpages and store it in a database. Then use GPT create your own custom GPT and generate new content based on the crawled content.

screenshots:

image

Demo Link

https://chat.openai.com/g/g-RskOOlLFp-sumato-assistant

How to use

Prerequisites

  • Python 3.11

Setup

  • Clone this repo git clone https://github.com/sonpython/GPTSiteCrawler
  • craete a virtual environment: python3 -m venv venv
  • activate the virtual environment: source venv/bin/activate
  • Install dependencies: pip install -r requirements.txt
  • Run the crawler: python src/main.py https://example.com --selectors .main --annotate-size 2

Usage

> python src/main.py -h
usage: main.py [-h] [--output OUTPUT] [--stats STATS] [--selectors SELECTORS [SELECTORS ...]] [--max-links MAX_LINKS] [--annotate-size ANNOTATE_SIZE] url

Asynchronous web crawler

positional arguments:
  url                   The starting URL for the crawl

options:
  -h, --help            show this help message and exit
  --output OUTPUT       Output file for crawled data
  --stats STATS         Output file for crawl statistics
  --selectors SELECTORS [SELECTORS ...]
                        List of CSS selectors to extract text
  --max-links MAX_LINKS
                        Maximum number of visited links to allow
  --annotate-size ANNOTATE_SIZE
                        Chunk data.json to this file size in MB

Chart

I've created a chart to help you understand how the crawler works. It's a bit of a simplification, but it should help you understand the basics. You can run the chart with python src/chart.py in another terminal window to get the realtime chart updating the crawl progress.

image image

Docker

env vars

CRAWLER_URL=https://example.com 
CRAWLER_SELECTOR=.main 
CRAWLER_CHUNK_SIZE=2 # in MB

Build

docker build -t gpt-site-crawler .

Run

docker run -it --rm gpt-site-crawler

(I borrow the bellows docs from @BuilderIO)

Upload your data to OpenAI

The crawl will generate a file called output.json at the root of this project. Upload that to OpenAI to create your custom assistant or custom GPT.

Create a custom GPT

Use this option for UI access to your generated knowledge that you can easily share with others

Note: you may need a paid ChatGPT plan to create and use custom GPTs right now

  1. Go to https://chat.openai.com/
  2. Click your name in the bottom left corner
  3. Choose "My GPTs" in the menu
  4. Choose "Create a GPT"
  5. Choose "Configure"
  6. Under "Knowledge" choose "Upload a file" and upload the file you generated

Gif of how to upload a custom GPT

Create a custom assistant

Use this option for API access to your generated knowledge that you can integrate into your product.

  1. Go to https://platform.openai.com/assistants
  2. Click "+ Create"
  3. Choose "upload" and upload the file you generated

Gif of how to upload to an assistant