Product Searcher

This repository contains the code for an automated product searcher.

Overview

A single bot performs the following steps:

Connects to the target sites search page
Performs a search with specified keywords
Returns the first HTML page of the results and extracts the relevant data and displays it within the console
Stores the RAW HTML in a database for later processing

You can have multiple bots running at the same time performing searches against different sites.

Dependencies

All code written in python 3
Python libraries used:
Beautiful Soup
Requests

Installation

Perform the following steps to install

In a target directory clone the repo

git clone https://github.com/alittlebroken/scraperBot

Create a python virtual environment

python -m venv venv

Activate the virtual environment

unix

source venv/Scripts/activate

Install the required packages

pip install bs4 requests

Configuration

Open the setup.py file in your favourite text/code editor

Add a customer

Each customer is stored inside the customers list as a single tuple. To add more just add another tuple

Format:

  (
    id,           # Unique identifier for the customer
    name,         # Customers name
    description   # Description of the customer
  )

Example:

  customers = [
    (
      1,
      'techstore.com',
      'The premier PC part specialist'
    ),
  ]

Add a scraper

A scraper is added to the scrapers list as a singular tuple. If you wish to add another scraper just add another tuple to the list

Format:

  (
    id,             # Unique id for the scraper
    name,           # Name for this scraper
    description,    # Description of the scraper
    enabled,        # 1 to enable and 0 to disable
    customerID,     # The id of the customer this scraper belongs to
    searchTerms,    # The terms to search on
    runFrequency,   # Run the scraper once every X hours
  )

Example:

  scrapers = [
      (
        1,
        'techstore-motherboards',
        'All motherboards on techstore.com',
        1,
        1,
        'motherboards',
        6
      ),
  ]

Add new rules

Each customer should have there own list of rules with each rule being it's own tuple

The rules we support currently are:

url - The url to the customers search packages
searches - The CSS class that identifies the search results on a page
products - The CSS classes that identify a single product in the search results
title - The CSS class that contains the name of an individual product in the results
price - The CSS class which contains the price of an item in the search results

Format:

customer_scraper_rules = [
  (
    ruleName,         # One of the rule names above
    ruleValue,        # Value for this rule
    scraperID,        # The ID of the scraper associated with this rule
    customerID        # The ID of the customer associated with this rule
  ),
]

Example:

techstore_mb_rules = [
        ('url', 'https://www.techstore.com/search', 1, 1),
        ('searches', 'class/search', 1, 1),
        ('products', 'class/ProductCardstyles__Wrapper-l8f8q8-1', 1, 1),
        ('title', 'class/ProductCardstyles__Title-l8f8q8-12', 1, 1),
        ('price', 'class/ProductCardstyles__PriceText-l8f8q8-14', 1, 1),
        ('stock', '', 1, 1)
]

Ensure the customer specific rules get added to the database

You need to add one line per set of customer rules in the # Scraper rule inserts section

Don't forget to change the second variable of the below function to the actual name you gave your customer rule list

dbc.execute(rules_sql, techstore_mb_rules)

Final Steps

Save and close the file
Run the following command to create the customers, scrapers and rules
```
python setup.py
```

Startup

Run the following command to start the bot(s)

python main.py

Press CTRL + C to end the program

Cleanup

Once you are happy that the bits are working as intended then you can delete the setup.py file

TODO

Web Interface with Flask
Add docker files for web, DB and scrapers
Come up with a better name for the project
Output startup info to console
Use a scheduler library for the scrapers
Use environment variables for common config values
Use SQLAlchemy for scrapers
Add command line functions to scrapers
Send output to log instead of console