A Python tool that converts your website content into GPT-friendly text files by scraping your sitemap. This tool is particularly useful for creating training data or knowledge bases for GPT models from your website content.
Website to GPT automatically scrapes all pages listed in your website's sitemap.xml and converts them into clean text format. It handles JavaScript-rendered content and offers two output options:
- Individual text files for each page
- A single merged file with clear page separators
- Python 3.6 or higher π
- Google Chrome browser π
- ChromeDriver (compatible with your Chrome version) π
pip install -r requirements.txt
Required packages:
- selenium
- beautifulsoup4
- requests
- lxml
- Clone the repository:
git clone https://github.com/upnorthmedia/websiteGPT.git
cd websiteGPT
- Create and activate a virtual environment:
# Create virtual environment
python3 -m venv venv
# Activate on macOS/Linux
source venv/bin/activate
# Activate on Windows
.\venv\Scripts\activate
- Install dependencies:
pip install -r requirements.txt
- Run the script:
python websitegpt.py
-
Choose your output preference:
- Option 1: Individual text files (one per page)
- Option 2: Single merged file with headers
-
Enter your sitemap URL when prompted (e.g., https://example.com/sitemap.xml)
- Creates separate .txt files for each webpage
- Files are saved in the
output
directory - Filenames are derived from URL paths
- Creates a single
merged_output.txt
file - Each page's content is separated by headers
- Headers include the original page filename
- Handles JavaScript-rendered content π
- Processes complete sitemaps πΊοΈ
- Cleans and formats text content β¨
- Supports both individual and merged output modes π
- Headless browser operation π»
- Built-in rate limiting to prevent server overload π¦
- Ensure your website has a valid sitemap.xml
- Respect robots.txt and website terms of service
- Consider rate limiting for large websites
- Some websites may block automated access
Contributions are welcome! Please feel free to submit a Pull Request.