/scrape-linkedin-selenium

`scrape_linkedin` is a python package to scrape all details from public LinkedIn profiles, turning the data into structured json.

Primary LanguagePython

scrape_linkedin

Introduction

scrape_linkedin is a python package to scrape all details from public LinkedIn profiles, turning the data into structured json. You can scrape Companies and user profiles with this package.

Warning: LinkedIn has strong anti-scraping policies, they may blacklist ips making unauthenticated or unusual requests

Table of Contents

Installation

Install with pip

Run pip install git+git://github.com/austinoboyle/scrape-linkedin-selenium.git

Install from source

git clone https://github.com/austinoboyle/scrape-linkedin-selenium.git

Run python setup.py install

Tests

Tests are (so far) only run on static html files. One of which is a linkedin profile, the other is just used to test some utility functions.

Getting & Setting LI_AT

Because of Linkedin's anti-scraping measures, you must make your selenium browser look like an actual user. To do this, you need to add the li_at cookie to the selenium session.

Getting LI_AT

  1. Navigate to www.linkedin.com and log in
  2. Open browser developer tools (Ctrl-Shift-I or right click -> inspect element)
  3. Select the appropriate tab for your browser (Application on Chrome, Storage on Firefox)
  4. Click the Cookies dropdown on the left-hand menu, and select the www.linkedin.com option
  5. Find and copy the li_at value

Setting LI_AT

There are two ways to set your li_at cookie:

  1. Set the LI_AT environment variable
    • $ export LI_AT=YOUR_LI_AT_VALUE
    • On Windows: `$ set LI_AT=YOUR_LI_AT_VALUE
  2. Pass the cookie as a parameter to the Scraper object.

    >>> with ProfileScraper(cookie='YOUR_LI_AT_VALUE') as scraper:

A cookie value passed directly to the Scraper will override your environment variable if both are set.

Usage

Command Line

scrape_linkedin comes with a command line argument module scrapeli created using click.

Note: CLI only works with Personal Profiles as of now.

Options:

  • --url : Full Url of the profile you want to scrape
  • --user: www.linkedin.com/in/USER
  • -a --attribute : return only a specific attribute (default: return all attributes)
  • -i --input_file : Raw path to html file of the profile you want to scrape
  • -o --output_file: Raw path to output file for structured json profile (just prints results by default)
  • -h --help : Show this screen.

Examples:

  • Get Austin O'Boyle's profile info: $ scrapeli --user=austinoboyle
  • Get only the skills of Austin O'Boyle: $ scrapeli --user=austinoboyle -a skills
  • Parse stored html profile and save json output: $ scrapeli -i /path/file.html -o output.json

Python Package

Profiles

Use ProfileScraper component to scrape profiles.

from scrape_linkedin import ProfileScraper

with ProfileScraper() as scraper:
    profile = scraper.scrape(user='austinoboyle')
print(profile.to_dict())

Profile - the class that has properties to access all information pulled from a profile. Also has a to_dict() method that returns all of the data as a dict

with open('profile.html', 'r') as profile_file:
    profile = Profile(profile_file.read())

print (profile.skills)
# [{...} ,{...}, ...]
print (profile.experiences)
# {jobs: [...], volunteering: [...],...}
print (profile.to_dict())
# {personal_info: {...}, experiences: {...}, ...}

Structure of the fields scraped

  • personal_info
    • name
    • company
    • school
    • headline
    • followers
    • summary
  • skills
  • experiences
    • volunteering
    • jobs
    • education
  • interests
  • accomplishments
    • publications
    • cerfifications
    • patents
    • courses
    • projects
    • honors
    • test scores
    • languages
    • organizations

Companies

Use CompanyScraper component to scrape companies.

from scrape_linkedin import CompanyScraper

with CompanyScraper() as scraper:
    company = scraper.scrape(company='facebook')
print(company.to_dict())

Company - the class that has properties to access all information pulled from a company profile. There will be three properties: overview, jobs, and life. Overview is the only one currently implemented.

with open('overview.html', 'r') as overview,
    open('jobs.html', 'r') as jobs,
    open('life.html', 'r') as life:
        company = Company(overview, jobs, life)

print (company.overview)
# {...}

Structure of the fields scraped

  • overview
    • name
    • industry
    • description
    • location
    • website
    • year_founded
    • company_type
    • company_size
    • num_employees
  • jobs NOT YET IMPLEMENTED
  • life NOT YET IMPLEMENTED

config

Pass these keyword arguments into the constructor of your Scraper to override default values. You may (for example) want to decrease/increase the timeout if your internet is very fast/slow.

  • cookie {str}: li_at cookie value (overrides env variable)
    • default: None
  • driver {selenium.webdriver}: driver type to use
    • default: selenium.webdriver.Chrome
  • driver_options {dict}: kwargs to pass to driver constructor
    • default: {}
  • scroll_pause {float}: time(s) to pause during scroll increments
    • default: 0.1
  • scroll_increment {int} num pixels to scroll down each time
    • default: 300
  • timeout {float}: default time to wait for async content to load
    • default: 10

Scraping in Parallel

New in version 0.2: built in parallel scraping functionality. Note that the up-front cost of starting a browser session is high, so in order for this to be beneficial, you will want to be scraping many (> 15) profiles.

Example

from scrape_linkedin import scrape_in_parallel, CompanyScraper

companies = ['facebook', 'google', 'amazon', 'microsoft', ...]

#Scrape all companies, output to 'companies.json' file, use 4 browser instances
scrape_in_parallel(
    scraper_type=CompanyScraper,
    items=companies,
    output_file="companies.json",
    num_instances=4
)

Configuration

Parameters:

  • scraper_type {scrape_linkedin.Scraper}: Scraper to use
  • items {list}: List of items to be scraped
  • output_file {str}: path to output file
  • num_instances {int}: number of parallel instances of selenium to run
  • temp_dir {str}: name of temporary directory to use to store data from intermediate steps
    • default: 'tmp_data'
  • driver {selenium.webdriver}: driver to use for scraping
    • default: selenium.webdriver.Chrome
  • driver_options {dict}: dict of keyword arguments to pass to the driver function.
    • default: scrape_linkedin.utils.HEADLESS_OPTIONS
  • **kwargs {any}: extra keyword arguments to pass to the scraper_type constructor for each job

Issues

Report bugs and feature requests here.