/web-scraper

Tool to download information from web pages

Primary LanguagePython

Web Scraper for list of companies from OfferZen

  • Download company information to customize how I process it.
  • Use as a commandline client to process parts as necessary.

Using..

  • Framework
    • ~~ Scrapy - Framework for scraping a website ~~
    • pip install --user Scrapy
  • Libraries
    • BeautifulSoup
    • Requests
    • click - create commandline interface for package
    • Yaml - store config info in yaml file
    • time - sleep
    • random - shuffle list of links
  • Database
    • Mongod - using mongo for data storage.

Environment

  • Virtualenv - Using Virtualenvwrapper.
    • Explicitly using python36
    • makevirtualenv -a . -p python36 jobscraper
  • Starting out with Requests, BeautifulSoup4 and saving in MongoDb.
  • Install package - pip install beautifulsoup4 requests pymongo lxml

Setup

  • Config settings
    • Use ruamel.yaml for reading YAML file.

Issues

  • UTF-8 errors when trying to save result of BeautifulSoup page.
    • Save the request.text response instead. No need to tranform with BeautifulSoup yet.

Flow

  1. Download main page with links to all the company pages
    1. Save page as request.text in mongodb
    2. Parse the main page to get links for all the companies
  2. Using links from main page, download individual company pages.
  3. Save pages in mongodb
  4. Process company details
    1. From main page
      1. City option list
      2. Technology option list
      3. Individual company info:
        • Elevator pitch
        • Location
        • Company Size
        • Technologies
        • City category - data-cities
        • Technology stack - data-tech-services
        • Company Id - data-id
    2. Retrieve information
      1. Company name
      2. Company url
      3. Company stack
      4. Company address/location