This project is designed to continuously refresh the dataset used by the NVIDIA RTX Chat WebUI application. It automatically scrapes visible text from a predefined list of websites and updates the dataset with the new information.
- Python 3.6 or later
- Google Chrome browser installed
- Clone this repository or download the source code.
- Install the required Python packages by running:
pip install selenium schedule
-
Open the
Scrape.py
file and modify thewebsites
andfile_names
lists to include the URLs and file names you want to scrape and save, respectively. -
Run the
app_launch.bat
file. This will set up the required environment, verify the installation, and start the scraping and refreshing processes.
The script will initially scrape the visible text from the specified websites and save it to individual text files in the AppData\Local\NVIDIA\ChatWithRTX\RAG\trt-llm-rag-windows-main\dataset
directory. After the initial scrape, the script will continue to refresh the dataset every hour by scraping the websites again and updating the corresponding text files.
Scrape.py
: This file contains the main scraping functionality. It uses Selenium to scrape visible text from a list of websites and saves it to individual text files.run.py
: This file runs two separate processes: the main application (app.py
) and the refresh script (refresh_script_runner.py
).refresh_script_runner.py
: This file runs theScrape.py
script periodically (every hour) to refresh the dataset.app_launch.bat
: This batch file sets up the required environment, verifies the installation, and runs therun.py
script.