/web-extract-with-chatgpt

A Python project that extracts data from websites with the option to process the data through @openai's ChatGPT API. The results are either printed to stdout or sent via a POST request.

Primary LanguagePython

A Python project that extracts data from websites using Selenium and BeautifulSoup4, then either prints the results to stdout or sends them via a POST request. Optionally, the extracted web data can be processed through OpenAI's ChatGPT API, with the API response included in the final output!

This project was intended for Linux, but may work on Windows after adjustments.

Preview

Before making this project, I created a private project for my modding community that utilized the above technologies which inspired me to make this open source project. I'm hoping the code in this repository helps other developers who want to integrate ChatGPT into their projects or wants to see how to parse and extract data from websites.

Extractors

The URL specified is parsed through extractors which are classes intended to extract specific data from the web page. As of right now, there are two extractors; default and discourse_topic.

The default extractor returns all text from within the <body> tag when the web page is loaded. Keep in mind that this does not wait for data generated by JavaScript to load and only returns what the web server initially renders normally (I'll most likely implement better support for waiting for specific content to load later).

The last extractor is discourse_topic which specifically extracts the text contents of a topic from a Discourse forum. I made this extractor because I used my modding forum to test this project.

Additional extractors may be created for those interested! Feel free to take a look at the source code from src/extract. You may also be interested in this lab/guide I made on how to extract data from web pages using Selenium and BeautifulSoup4 which includes examples that utilizes JavaScript and waits for specific content to load.

Requirements

  • An OpenAI key which may be retrieved after setting up billing from here.
  • The URL to a web page that supports extracting data from one of the extractors here.

Firefox Geckodriver

In order for Selenium to operate, you need the Firefox Geckodriver which may be found here along with the firefox-esr package.

# Download Geckodriver 0.35.0 driver.
wget https://github.com/mozilla/geckodriver/releases/download/v0.35.0/geckodriver-v0.35.0-linux64.tar.gz

# Extract Geckodriver binary.
tar -xzvf geckodriver-v0.35.0-linux64.tar.gz

# Move binary to /usr/bin.
sudo mv geckodriver /usr/bin

# Install the Firefox ESR package using apt (for Debian/Ubuntu-based systems).
sudo apt install -y firefox-esr

Python

Python (version 3 or higher) is required to run this project. Additionally, the following Python packages are required.

  • Selenium (4.27.1)
  • Beautifulsoup4 (4.12.3)
  • Requests (2.32.3)
  • Jinja2 (3.1.5)
  • OpenAI (1.57.4)

You may install these packages using the following command after cloning this repository.

pip3 install -r requirements.txt

Python Virtual Environment

I recommend running this project within a virtual environment in Python. You can create a virtual environment using the following command assuming you have the required Python packages on your server.

python3 -m venv venv/

You can then activate the virtual environment and install the required packages listed above inside of the environment.

# Activate virtual environment.
source venv/bin/activate

# Download required packages in new environment.
pip3 install -r requirements.txt

Command Line Usage & Running

The following command line arguments are supported when running this application.

Name Default Description
-c, --cfg ./conf.json Path to config file.
-u, --url None The URL to parse. If not specified, you will need to input after starting the program.
-e, --extractor None The extractor to use. If not specified, you will need to input after starting the program.
-s, --silent N/A When set, verbose output will be disabled.
-l, --list N/A Lists contents of the config settings and exits.
-h, --help N/A Prints the help menu and exits.
Example(s)

Run & Load Custom Config

python3 src/main.py -c /etc/mycustomconf.json

Run & List Config Settings

python3 src/main.py -l

Run & Print Help Menu

python3 src/main.py --help

Run & Load Forum Topic From My Modding Forum

python3 src/main.py -u https://forum.moddingcommunity.com/t/discord-login-integration-added/170 -e discourse_topic

Configuration

The default configuration file is located at ./conf.json, but can be changed with the command line arguments mentioned above. I recommend copying the conf.ex.json file to conf.json for new users.

Here is a list of configuration settings.

Name Type Default Description
save_to_fs bool false Saves config to file system after parsing. This is good for automatic formatting and saving all available settings to config file.
templates_dir string templates The path to the templates/ directory without the last forward slash.
extract Extract Object {} The extract object (read below).
chatgpt ChatGPT Object {} The ChatGPT object (read below).
output Output Object {} The output object (read below).
Example(s)

Save To Filesystem

{
    "save_to_fs": true,
    "extract": {},
    "chatgpt": {},
    "output": {}
}

Extract Object

The extract object contains settings related to extracting the web data.

Name Type Default Description
drv_path string /usr/bin/geckodriver The path to the Geckodriver binary file.
agents string array [] A list of user agents to randomly select from when sending web request.
Example(s)

Use Custom User Agents

{
    "agents":
    [
        "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:121.0) Gecko/20100101 Firefox/121.0"
    ]
}

ChatGPT Object

The ChatGPT object contains settings related to OpenAI and ChatGPT.

Name Type Default Description
enabled bool true When enabled, send extracted web data through the ChatGPT API and includes API response in final output.
key string null The OpenAI key (required).
model string gpt-3.5-turbo The ChatGPT model to use.
max_tokens int 500 The maximum tokens to use with the request.
temperature float 0.7 The temperature to use with the ChatGPT request. Read more about this here.
max_input int 500 The maximum characters to send to the ChatGPT API (input).
role_template string chatgpt_role The template file name to use with the system's role in ChatGPT from the templates/ directory without the file extension (.tpl).
prompt_template string chatgpt_prompt The template file name to use with the user's prompt in ChatGPT from the templates/ directory without the file extension (.tpl).

Here is a list of model code names you can use with the model setting.

  • gpt-3.5-turbo
  • gpt-3.5-turbo-16k
  • gpt-4
  • gpt-4-32k
  • gpt-4o
  • gpt-4o-mini
  • o1-mini
  • o1
  • o1-pro

NOTE - There are multiple models you may use including GPT-4o and GPT-4o-mini. However, I've found the newer models are very expensive with the API compared to GPT-3. Therefore, I recommend using the GPT-3 model when you can. For a list of models, check here! For information on pricing, check here!

Example(s)

Use GPT-4o (EXPENSIVE!)

{
    "key": "CHANGEME",
    "model": "gpt-4o",
    "max_tokens": 500,
    "temperature": 0.5,
    "max_input": 1000
}

Output Object

The output object is used for sending the ChatGPT response somewhere.

Name Type Default Description
type string stdout The type of output to use (stdout or post are supported right now).
stdout Stdout Object {} Settings for the Stdout type.
post POST Object {} Settings for the POST type.

When a POST request is sent, the content type is application/json and an example of the request body may be found below.

{
    "url": "<URL Parsed>",
    "extractor": "<Extractor Used>",
    "web_data": "<Web Data Extracted>",
    "chatgpt_res": "<ChatGPT Response>"
}

Stdout Object

The Stdout object contains settings when using the stdout type to output the results to the stdout pipe.

Name Type Default Description
use_json bool false Outputs results in JSON format.
file_path string null If set, will write the results to a file at this path.
file_append bool false If true, will append the results to the file.
Example(s)

Output With JSON & Append Results To File

{
    "use_json": true,
    "file_path": "./ewc.json",
    "file_append": true
}

POST Object

The POST object contains settings when using the post type to send results to a web endpoint via a POST request.

Name Type Default Description
url string http://localhost The URL to send the POST request to.
headers Object <string, string> {} Any headers to send with the request.
Example(s)

Send POST Request With Auth Token

{
    "type": "post",
    "post":
    {
        "url": "https://api.mydomain.com/web-extract",
        "headers":
        {
            "Authorization": "Bearer <MY TOKEN>"
        }
    }
}

Notes

Template System For ChatGPT Role & Prompt

A basic template system is used when formatting what role and prompt to send ChatGPT. Templates may be found in the templates/ directory.

The variables url and content are passed to the templates, so feel free to use them (e.g. {{ content }}).

Credits