A Python project that extracts data from websites using Selenium and BeautifulSoup4, then either prints the results to stdout
or sends them via a POST request. Optionally, the extracted web data can be processed through OpenAI's ChatGPT API, with the API response included in the final output!
This project was intended for Linux, but may work on Windows after adjustments.
Before making this project, I created a private project for my modding community that utilized the above technologies which inspired me to make this open source project. I'm hoping the code in this repository helps other developers who want to integrate ChatGPT into their projects or wants to see how to parse and extract data from websites.
The URL specified is parsed through extractors which are classes intended to extract specific data from the web page. As of right now, there are two extractors; default
and discourse_topic
.
The default
extractor returns all text from within the <body>
tag when the web page is loaded. Keep in mind that this does not wait for data generated by JavaScript to load and only returns what the web server initially renders normally (I'll most likely implement better support for waiting for specific content to load later).
The last extractor is discourse_topic
which specifically extracts the text contents of a topic from a Discourse forum. I made this extractor because I used my modding forum to test this project.
Additional extractors may be created for those interested! Feel free to take a look at the source code from src/extract
. You may also be interested in this lab/guide I made on how to extract data from web pages using Selenium and BeautifulSoup4 which includes examples that utilizes JavaScript and waits for specific content to load.
- An OpenAI key which may be retrieved after setting up billing from here.
- The URL to a web page that supports extracting data from one of the extractors here.
In order for Selenium to operate, you need the Firefox Geckodriver which may be found here along with the firefox-esr
package.
# Download Geckodriver 0.35.0 driver.
wget https://github.com/mozilla/geckodriver/releases/download/v0.35.0/geckodriver-v0.35.0-linux64.tar.gz
# Extract Geckodriver binary.
tar -xzvf geckodriver-v0.35.0-linux64.tar.gz
# Move binary to /usr/bin.
sudo mv geckodriver /usr/bin
# Install the Firefox ESR package using apt (for Debian/Ubuntu-based systems).
sudo apt install -y firefox-esr
Python (version 3 or higher) is required to run this project. Additionally, the following Python packages are required.
- Selenium (
4.27.1
) - Beautifulsoup4 (
4.12.3
) - Requests (
2.32.3
) - Jinja2 (
3.1.5
) - OpenAI (
1.57.4
)
You may install these packages using the following command after cloning this repository.
pip3 install -r requirements.txt
I recommend running this project within a virtual environment in Python. You can create a virtual environment using the following command assuming you have the required Python packages on your server.
python3 -m venv venv/
You can then activate the virtual environment and install the required packages listed above inside of the environment.
# Activate virtual environment.
source venv/bin/activate
# Download required packages in new environment.
pip3 install -r requirements.txt
The following command line arguments are supported when running this application.
Name | Default | Description |
---|---|---|
-c, --cfg | ./conf.json |
Path to config file. |
-u, --url | None |
The URL to parse. If not specified, you will need to input after starting the program. |
-e, --extractor | None |
The extractor to use. If not specified, you will need to input after starting the program. |
-s, --silent | N/A | When set, verbose output will be disabled. |
-l, --list | N/A | Lists contents of the config settings and exits. |
-h, --help | N/A | Prints the help menu and exits. |
Example(s)
python3 src/main.py -c /etc/mycustomconf.json
python3 src/main.py -l
python3 src/main.py --help
python3 src/main.py -u https://forum.moddingcommunity.com/t/discord-login-integration-added/170 -e discourse_topic
The default configuration file is located at ./conf.json
, but can be changed with the command line arguments mentioned above. I recommend copying the conf.ex.json
file to conf.json
for new users.
Here is a list of configuration settings.
Name | Type | Default | Description |
---|---|---|---|
save_to_fs | bool | false |
Saves config to file system after parsing. This is good for automatic formatting and saving all available settings to config file. |
templates_dir | string | templates |
The path to the templates/ directory without the last forward slash. |
extract | Extract Object | {} |
The extract object (read below). |
chatgpt | ChatGPT Object | {} |
The ChatGPT object (read below). |
output | Output Object | {} |
The output object (read below). |
The extract object contains settings related to extracting the web data.
Name | Type | Default | Description |
---|---|---|---|
drv_path | string | /usr/bin/geckodriver |
The path to the Geckodriver binary file. |
agents | string array | [] |
A list of user agents to randomly select from when sending web request. |
Example(s)
{
"agents":
[
"Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:121.0) Gecko/20100101 Firefox/121.0"
]
}
The ChatGPT object contains settings related to OpenAI and ChatGPT.
Name | Type | Default | Description |
---|---|---|---|
enabled | bool | true |
When enabled, send extracted web data through the ChatGPT API and includes API response in final output. |
key | string | null |
The OpenAI key (required). |
model | string | gpt-3.5-turbo |
The ChatGPT model to use. |
max_tokens | int | 500 |
The maximum tokens to use with the request. |
temperature | float | 0.7 |
The temperature to use with the ChatGPT request. Read more about this here. |
max_input | int | 500 |
The maximum characters to send to the ChatGPT API (input). |
role_template | string | chatgpt_role |
The template file name to use with the system's role in ChatGPT from the templates/ directory without the file extension (.tpl ). |
prompt_template | string | chatgpt_prompt |
The template file name to use with the user's prompt in ChatGPT from the templates/ directory without the file extension (.tpl ). |
Here is a list of model code names you can use with the model
setting.
- gpt-3.5-turbo
- gpt-3.5-turbo-16k
- gpt-4
- gpt-4-32k
- gpt-4o
- gpt-4o-mini
- o1-mini
- o1
- o1-pro
NOTE - There are multiple models you may use including GPT-4o and GPT-4o-mini. However, I've found the newer models are very expensive with the API compared to GPT-3. Therefore, I recommend using the GPT-3 model when you can. For a list of models, check here! For information on pricing, check here!
Example(s)
{
"key": "CHANGEME",
"model": "gpt-4o",
"max_tokens": 500,
"temperature": 0.5,
"max_input": 1000
}
The output object is used for sending the ChatGPT response somewhere.
Name | Type | Default | Description |
---|---|---|---|
type | string | stdout |
The type of output to use (stdout or post are supported right now). |
stdout | Stdout Object | {} |
Settings for the Stdout type. |
post | POST Object | {} |
Settings for the POST type. |
When a POST request is sent, the content type is application/json
and an example of the request body may be found below.
{
"url": "<URL Parsed>",
"extractor": "<Extractor Used>",
"web_data": "<Web Data Extracted>",
"chatgpt_res": "<ChatGPT Response>"
}
The Stdout object contains settings when using the stdout
type to output the results to the stdout
pipe.
Name | Type | Default | Description |
---|---|---|---|
use_json | bool | false |
Outputs results in JSON format. |
file_path | string | null |
If set, will write the results to a file at this path. |
file_append | bool | false |
If true, will append the results to the file. |
Example(s)
{
"use_json": true,
"file_path": "./ewc.json",
"file_append": true
}
The POST object contains settings when using the post
type to send results to a web endpoint via a POST request.
Name | Type | Default | Description |
---|---|---|---|
url | string | http://localhost |
The URL to send the POST request to. |
headers | Object <string, string> | {} |
Any headers to send with the request. |
Example(s)
{
"type": "post",
"post":
{
"url": "https://api.mydomain.com/web-extract",
"headers":
{
"Authorization": "Bearer <MY TOKEN>"
}
}
}
A basic template system is used when formatting what role and prompt to send ChatGPT. Templates may be found in the templates/
directory.
The variables url
and content
are passed to the templates, so feel free to use them (e.g. {{ content }}
).