This repository contains the code for an automated product searcher.
A single bot performs the following steps:
- Connects to the target sites search page
- Performs a search with specified keywords
- Returns the first HTML page of the results and extracts the relevant data and displays it within the console
- Stores the RAW HTML in a database for later processing
You can have multiple bots running at the same time performing searches against different sites.
- All code written in python 3
- Python libraries used:
- Beautiful Soup
- Requests
Perform the following steps to install
- In a target directory clone the repo
git clone https://github.com/alittlebroken/scraperBot
- Create a python virtual environment
python -m venv venv
-
Activate the virtual environment
unix
source venv/Scripts/activate
- Install the required packages
pip install bs4 requests
Open the setup.py file in your favourite text/code editor
Each customer is stored inside the customers list as a single tuple. To add more just add another tuple
Format:
(
id, # Unique identifier for the customer
name, # Customers name
description # Description of the customer
)
Example:
customers = [
(
1,
'techstore.com',
'The premier PC part specialist'
),
]
A scraper is added to the scrapers list as a singular tuple. If you wish to add another scraper just add another tuple to the list
Format:
(
id, # Unique id for the scraper
name, # Name for this scraper
description, # Description of the scraper
enabled, # 1 to enable and 0 to disable
customerID, # The id of the customer this scraper belongs to
searchTerms, # The terms to search on
runFrequency, # Run the scraper once every X hours
)
Example:
scrapers = [
(
1,
'techstore-motherboards',
'All motherboards on techstore.com',
1,
1,
'motherboards',
6
),
]
Each customer should have there own list of rules with each rule being it's own tuple
The rules we support currently are:
- url - The url to the customers search packages
- searches - The CSS class that identifies the search results on a page
- products - The CSS classes that identify a single product in the search results
- title - The CSS class that contains the name of an individual product in the results
- price - The CSS class which contains the price of an item in the search results
Format:
customer_scraper_rules = [
(
ruleName, # One of the rule names above
ruleValue, # Value for this rule
scraperID, # The ID of the scraper associated with this rule
customerID # The ID of the customer associated with this rule
),
]
Example:
techstore_mb_rules = [
('url', 'https://www.techstore.com/search', 1, 1),
('searches', 'class/search', 1, 1),
('products', 'class/ProductCardstyles__Wrapper-l8f8q8-1', 1, 1),
('title', 'class/ProductCardstyles__Title-l8f8q8-12', 1, 1),
('price', 'class/ProductCardstyles__PriceText-l8f8q8-14', 1, 1),
('stock', '', 1, 1)
]
You need to add one line per set of customer rules in the # Scraper rule inserts section
Don't forget to change the second variable of the below function to the actual name you gave your customer rule list
dbc.execute(rules_sql, techstore_mb_rules)
-
Save and close the file
-
Run the following command to create the customers, scrapers and rules
python setup.py
Run the following command to start the bot(s)
python main.py
Press CTRL + C to end the program
Once you are happy that the bits are working as intended then you can delete the setup.py file
- Web Interface with Flask
- Add docker files for web, DB and scrapers
- Come up with a better name for the project
- Output startup info to console
- Use a scheduler library for the scrapers
- Use environment variables for common config values
- Use SQLAlchemy for scrapers
- Add command line functions to scrapers
- Send output to log instead of console