/web_scraper_Hasmo

A python web scraper designed to collect data from competitor websites, news articles, and market research reports.

Primary LanguageJupyter Notebook

Consulting Web Scraper

This project aims to develop a web scraper using Python. The scraper is designed to collect data from competitor websites, news articles, and market research reports. This data will provide valuable market insights, supporting competitive analysis and strategic decision-making.

Table of Contents

Project Description

The Web Scraper systematically gathers data to help understand market trends, competitor strategies, and overall market dynamics. Key benefits include:

  • Market Insights: Gaining a deeper understanding of current market trends to identify emerging patterns and industry shifts.
  • Competitive Analysis: Collecting data on competitors' offerings, pricing, marketing campaigns, and market positioning.
  • Data-Driven Decisions: Empowering the company to make informed and strategic business decisions based on the collected data.

Focus Areas

  1. Competitor Websites:

    • McKinsey & Company: mckinsey.com
    • Boston Consulting Group: bcg.com
    • Deloitte: deloitte.com
    • Data to Scrape: Service offerings, case studies, client testimonials, thought leadership articles, and market insights.
  2. Industry News and Reports:

    • Bloomberg: bloomberg.com
    • Reuters: reuters.com
    • Financial Times: ft.com
    • Data to Scrape: Latest news articles, market reports, financial analysis, and global economic trends.
  3. Market Research Portals:

Setup Instructions

Virtual Environment Setup

  1. Clone the Repository:

    git clone https://github.com/intel00000/web_scraper_Hasmo.git
    cd web_scraper_Hasmo
  2. Create a Virtual Environment:

    • Windows:
      python -m venv venv
    • Linux & MacOS:
      python3 -m venv venv
  3. Activate the Virtual Environment:

    • Windows:
      .\\venv\\Scripts\\Activate.ps1
    • Linux & MacOS:
      source ./venv/bin/activate

Installing Required Libraries

  1. Install the Required Packages:
    pip install -r requirements.txt

Usage

  1. Running the Spiders:

    • If you want to enable summary and Google sheet update, obtain the openai API key and Google service account json private key

      • Create a .env file in the main folder with content
      OPENAI_API_KEY={Your openai API key}
      • Download the Google service account private key as json, save to the main folder and rename to credentials.json
      • it should have a format like
       {
       "type": "service_account",
       "project_id": "",
       "private_key_id": "",
       "private_key": "",
       "client_email": "xxx@developer.gserviceaccount.com",
       "client_id": "",
       "auth_uri": "https://accounts.google.com/o/oauth2/auth",
       "token_uri": "https://oauth2.googleapis.com/token",
       "auth_provider_x509_cert_url": "https://www.googleapis.com/oauth2/v1/certs",
       "client_x509_cert_url": "",
       "universe_domain": "googleapis.com"
       }
    • Navigate to the desired scraper directory:

      • For Deloitte: cd deloitte_scraper
      • For McKinsey: cd mckinsey_scraper
    • To list all available spiders, run:

      scrapy list
    • To run a specific spider, use:

      scrapy crawl spider_name

      Replace spider_name with the name of the spider you wish to run.

    • Alternatively, to run all spiders at once, use:

      python run_all_spiders.py
  2. Data Storage:

    • The scraped data will be saved in the data/raw/ directory as CSV or JSON files.
  3. Configuring the Scraper:

    • Adjust configuration settings, such as target URLs, data points to extract, and output formats, in the settings.py file within the respective scraper directory (deloitte_scraper or mckinsey_scraper).

Project Structure

web_scraper_Hasmo/
├── .env                             # add your OPENAI_API_KEY here
├── .gitignore
├── credentials.json                 # Google GCP service account API
├── README.md                        # Documentation for the project
├── requirements.txt                 # List of required Python packages
│
├── data/                            # Scraped data
│   └── raw/                         # Subdirectory containing raw CSV and JSON data files
│
├── deloitte_scraper/                # Contains the Scrapy project for Deloitte data
│   ├── scrapy.cfg                   # Scrapy configuration file
│   └── deloitte_scraper/
│       ├── items.py                 # Scraped items structure
│       ├── middlewares.py           # Middlewares for Scrapy
│       ├── pipelines.py             # Pipeline for processing scraped data
│       ├── settings.py              # Scrapy settings
│       ├── spiders/                 # Spiders directory
│
├── mckinsey_scraper/                # Contains the Scrapy project for McKinsey data
│   ├── scrapy.cfg                   # Scrapy configuration file
│   └── mckinsey_scraper/
│       ├── items.py                 # Scraped items structure
│       ├── middlewares.py           # Middlewares for Scrapy
│       ├── pipelines.py             # Pipeline for processing scraped data
│       ├── settings.py              # Scrapy settings
│       ├── spiders/                 # Spiders directory
│
├── notebooks/
│   ├── bcg_capabilities.ipynb       # Notebook for scraping BCG capabilities
│   ├── bcg_industries.ipynb         # Notebook for scraping BCG industries
│   ├── bcg_search_results.ipynb     # Notebook for scraping BCG search
│   ├── helper_functions.py          # Helper functions adapted from Scrapy pipelines
│   └── scrapy.ipynb                 # Notebook for testing
│
└── scripts/                         # Directory containing testing scripts
    ├── google_sheet_testing.py      # Google Sheets pipeline
    ├── openai_testing.py            # OpenAI API pipeline
    └── sample_input.json            # Sample input JSON file for testing
    └── sample_output_with_summaries.json  # Sample output JSON with generated summaries
  • data/: Directory where the scraped data is saved.
  • venv/: Virtual environment directory.
  • config.py: Configuration file for the scraper settings.
  • requirements.txt: List of required Python packages.
  • scraper.py: Main script for scraping data.
  • README.md: Project documentation.

Contact

For any issues, questions, or contributions, please open an issue or submit a pull request on GitHub.