/instacabra

Like a "chupacabra" instacabra sucks instagram entire data such as Profile, Movies and Photos directly from the web profiles

Primary LanguageRuby

N|Solid

It's a Ruby gem to rip Instagram data such as posts, images, videos or users profiles info.There is also a user interface to fetch this data. This article provides the context bellow:

  • Gem documantation, features and How-To
  • User Interface and Architecture
  • Examples and how-tos

Table of Contents

  1. Gem 1.1 Basics 1.2 Drivers 1.3 Setup 1.4 Ripping a profile 1.5 Filtering profiles to be ripped 1.6 API Reference and Documentation
  2. Ripper-UI - User Interface Application
  3. Support

1 Gem

The gem was developed to fetch data from Instagram webpage through web crawler using Capybara to manipulate the page and PhantomJs to execute in background and headless. The gem depends on the bellow gems:

Gem Purpose
byebug Debugging and tests
capybara Simulates users interaction with the webpage
nokogiri Fetch and parse HTML code from pages
open-uri Http requests from web
poltergeist A PhantomJS driver for Capybara
PhantomJS A external driver to run tests headless
  • Optionally is possible to use other drives like Chromedriver, IEDriver, etc.

1.1 Basics

It has the simple gem skeleton where the main code stay in lib diretory. Drivers folder has the drivers to manipulate webpage interactions. Instagram folder has the main classes and modules. Ripper folder has the gem version file. Ripper.rb: Main module responsible to setup and prepare gem how to fetch data. Instagram.json: The main output file to get as example when data is extracted. Location.rb: This module parse data from a location profile. Profile.rb: This module parse data from an users profile. Media.rb: Is the core module to get data from both user and location profile. It is responsible to build structure, navigate pages, extract data, parse html codes and generate outputs.

The gem can be used locally installing through the bundler or starting th console mode of gem.

Bundle Installation:
$ bundle install [PATH TO GEM]
$ irb
$ require 'instagram-ripper'
#code
Console Mode:
$ cd [PATH TO GEM]
$ bin/console
#code

1.2 Drivers

The gem allow to run processes into a browser like a Chrome, Internet Explorer or Firefox. But native it runs headless (without a browser, only commands) with the software PhantomJS. Initially the gem provides support to Mozillla Selenium webdriver and Poltergeist (headless) driver. If you want to run through other browsers like Chrome or internet Explorer, you need to implement and extend a new module. New modules are in the folder lib/drivers. Base.rb is the main module and it has the method initialize_driver(driver) where param is the symbol of your desired driver. To implement new driver, you can copy other modules and renames files and module name to your desired name so, on initialize methdos, define your driver setup.

1.3 Setup

The main configuration is the module Ripper (lib/instagram/ripper.rb). In this module, you can set up the bellow constants: INSTAGRAM_URI: The base Instagram URL. Default: https://instagram.com/ DEFAULT_DRIVER: Driver responsible to run automation scripts on the webpages. Default is poltergeist to run headless (without a browser) and to be more performatic. If you want to see what is happening during the execution, run with a browser driver. The gem has implemented Selenium Firefox browser to test scripts. Change this value to :selenium to see web navigation. If you desire to run another browsers, implement a new module and configure you driver. DEFAULT_MEDIA_LIMIT: Limit of media to be ripped. Default is 30 but actually has no limit, as client required. It is important to se a limit because with no limits the process can last long and kill all the process. USER_AGENT: User agent to be used and simulated when the gem navigates on web. Some web pages block IPs when it requires many requests on the page. Using an agent it simulates a real user and avoid to be blocked. Default vaue is: "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36". Feel free to use any agent. READ_TIMEOUT: Limit time to wait page to respond, used for Nokogiri.Default is 5 seconds.

Methods

authenticate (username, password)

Method to authenticate an user to Instagram passing string of username and password. It is usefull to private profiles that requer authentication on Instagram. Caution: It does not work if the profile that your trying to fetch has not auhotized logged user on gem session.

configure_driver (:symbol)

Pass a symbol representative of a driver you want to configure. The module driver has to be implemented. This method can be utilized during a fetch process.

Setting up a Profile

Before start to rip an Instagram Profile, you need to initialize a new object. It can be an User or Location Profile. In case of Location profile the param may be an integer represanting ID identificator of the location.

user = Instagram::Profile.new "rpossan"
# Or Location Profile
location = Instagram::Location.new 012345678

This return a new object with all profile data. It has no media data like posts.

1.4 Ripping posts from a Profile

Having a variable with an user or location profile, you can fetch media like video, photo or both. To do this a profile has 3 main methods to fetch data: fetch_media,fetch_videos, fetch_photos. Fetching all type of post, fetching only videos posts or only photos posts. Example:

profile.fetch_media # Fetch all kind of media
profile.fetch_photos # Fetch all photos from profile
profile.fetch_videos # Fetch all videos from profile
Fetching with limit

You can limit the quantity of posts to be fetched passing a integer value as parameter.

profile.fetch_media 10
profile.fetch_photos 2000
profile.fetch_videos 9999
Fetching in intersection

You can fetch data in tersection from initial value until desired. But it no includes filtered posts. If you fetch videos, but in the intersection has only 2 posts of videos, your return array will have only 2 posts, not until fetch 10 videos. This feature is useful to paginate and fetch posts in batch. But it affects the performance of ripping process because, eg. to fetch 200..1000, the automated script will navigate until find the 200 post. It consumes the network and can be influented.

profile.fetch_media 0..10 # Fetch media from last post until 10 ago.
profile.fetch_videos 10..20 # Fetch videos from 10 until 20 posts.
profile.fetch_photos -20..50 # Fetch from 0 until 50.

1.5 Filtering posts to be ripped

The gem has parameters to be configured then it filters posts to be fetched. Actually has 2 valid parameters to be passed on method fetch (fetch_media, fetch_photos or fetch_videos). The parameters to be passed may be a hash of key and values. has_location: It filters if post has or not location tagged.The value can be a boolean value true | false. For default it is false, fetching all kind of posts. If you pass true, only store posts with location tagged to be extraced.When you use it in a location profile, it will fetchh all posts because all posts of a location profile has location tagged (own location). from: This parameter represents a unix timestamp to be fetched from. Default is 0. If you pass a integer as th parameter, gem will fetch data only posts after this timestamp. Examples of use:

# Fetch only posts with location tagged from last post until 200
profile.fetch_photos 0..200, { has_location: true }

# If is a Location profile, it is useless. All posts of location profile is tagged.
location_profile.fetch_media 0..200, { has_location: true }

# Fetch posts from 20/02 until now.
profile.fetch_photos 0..200, { from: 1487616668 }

# It is the same withou the parameter. Will navigate all posts.
profile.fetch_videos 0..200, { from: 0 }

# Fetch all posts from 20/02 until now that has location tagged.
profile.fetch_media 50..50, { has_location: true, from: 1487616668 }

1.5 API Reference and Documantation

Drivers::Base

initialize_driver([symbol])

Initialize a new session of driver for the main module Ripper

Drivers::Poltergeist

initialize

Set up the initialization session for Poltergeist driver to object. It requires PhantomJS installed on environment.

session

Return a new session object of Poltergeist.

Drivers::Selenium

initialize

Set up the initialization session for Selenium driver to object. It requires Selenium driver installed on environment.

session

Return a new session object of Selenium webdriver.

Instagram::Profile

initialize(String)

String value represanting username profile. It can be nil and it returns a Profile object.

get_class

Return kind of class of object. Profile or Location..

html_content

Return the html content (code) of profile fetched.

is_private?

Return a boolean represanting if profile is private or not.

fetch!

Rip html content of profile of instanced object. It is not the posts of a profile, it is profile data.

configure_driver(symbol)

Configure or reconfigure a new session driver.

Instagram::Location

initialize(integer)

Optional parameter integer representing the ID identificator of a location. It returns a new Location object.

close_connection!

Quit and close connection with webdriver.

media

Returns media object of profile.

compact

Return a compacted hash of location attributes.

html_content

Returns html content of Location profile.

fetch!

Fetch profile data for Location.

configure_driver(symbol)

Configure or reconfigure a new session with webdriver.

Instagram::Media

extract_media

Return an array only with media attributes of object.

extract

Return an array with the given json format. It has the header of json and media data. To convert the return in an explicit json, you canuse the method .to_json.

build!

Parse html fetched to object structure.

is_video?

Returns true if is a post of video or false.

is_profile?

Returns true if is an user profile or false.

is_location?

Returns true if is a location profile or false.

identifier

Returns main identificator of profile.If is user profile, returns the username.If is location, returns a number of ID.

fetch_videos

Fetch only posts of videos from an user or location. It can be parametrized.

fetch_photos

Fetch only posts of photos from an user or location. It can be parametrized.

fetch_media

Fetch all kind of posts both user or location. It can be parametrized.

is_valid?

Validates media from a profile.

fetch_usertags

Returns tags tagged on post

load_scrolling

Scrolls windows until final of loaded html.

fetch_media_by_index(index)

Get data from a post by an integer index. If it was not loaded, you have to scroll until the index of post shows.

load_more

It triggers the click on button "Load more". It is required on the first time because the segment is through the infinite scroll of page.

loaded_media

Returns the size of loaded media until the moment.

set_limit

Set logical limit for fetching posts.

set_offset

Set logical offset for fetching posts.

Instagram::Validation

validate_location(Location)

Validates if ID of location object is null or invalid value.

validate_user(string)

Validate if username is empty.

validate_html_content(Profile)

Validates if profile has content. If not fetched, will raise an validation error.

validate_permission(profile)

Validates if profiles is private or has not permission to view your posts.

2.0 Ripper-UI - User Interface Application

It is the user interface application to run most gem features as user friendly. It is a Ruby On Rails application that uses the gem locally. You can set the gem on Gemfile passing the path to gem directory. This application requires:

  • SQLite3: to store data
  • Sidekiq: for background jobs
  • Redis Server: for sidekiq jobs

3.0 Support

Both gem and ui application was developed by Ronaldo Possan. Feel free to contact me for support doubts, help, architecture or consult bugs: @rpossan or ronaldo.possan@gmail.com I will not mantain more gem and ui application.