/instamancer

Scrape Instagram's API with Puppeteer

Primary LanguageTypeScriptMIT LicenseMIT

Instamancer

Build Status Quality Coverage Speed NPM Dependencies Chat

Scrape Instagram's API with Puppeteer.

Install | Usage | Comparison | Website | FAQ

Instamancer is a new type of scraping tool that leverages Puppeteer's ability to intercept requests made by a webpage to an API.

Read more about how Instamancer works here.

Features

  • Scrape hashtags, locations and users
  • Output JSON, CSV
  • Download images, albums, and videos
  • Batch scraping
  • API response validation

Data

Metadata that Instamancer is able to gather from posts:

  • Text
  • Timestamps
  • Tagged users
  • Accessibility captions
  • Like counts
  • Comment counts
  • Images (Thumbnails, Dimensions, URLs)
  • Videos (URL, View count, Duration)
  • Comments (Timestamp, Text, Like count, User)
  • User (Username, Full name, Profile picture, Profile privacy)
  • Location (Name, Street, Zip code, City, Region, Country)

Install

Linux

See Puppeteer troubleshooting

Enable user namespace cloning:

sysctl -w kernel.unprivileged_userns_clone=1

Or run without a sandbox:

# WARNING: unsafe
export NO_SANDBOX=true

Without downloading chromium

If you wish to install Instamancer without downloading chromium, enable the PUPPETEER_SKIP_CHROMIUM_DOWNLOAD environment variable before installation

export PUPPETEER_SKIP_CHROMIUM_DOWNLOAD=true

From this repository

Requires TypeScript

git clone https://github.com/ScriptSmith/instamancer.git
cd instamancer
npm install
npm install -g

From NPM

npm install -g instamancer

If you're using root to install globally, use the following command to install the Puppeteer dependency

sudo npm install -g instamancer --unsafe-perm=true

From NPX

npx instamancer

Usage

Command Line

$ instamancer
Usage: instamancer <command> [options]

Commands:
  instamancer hashtag [id]       Scrape a hashtag
  instamancer location [id]      Scrape a location
  instamancer user [id]          Scrape a user
  instamancer post [ids]         Scrape a comma-separated list of posts
  instamancer batch [batchfile]  Read newline-separated arguments from a file

Options:
  --help                  Show help                                    [boolean]
  --version               Show version number                          [boolean]
  --count, -c             Number of posts to download. 0 to download all
                                                                    [default: 0]
  --visible               Show browser on the screen            [default: false]
  --download, -d          Save images and videos from posts
                                                      [boolean] [default: false]
  --graft, -g             Enable grafting              [boolean] [default: true]
  --full                  Get the full list of posts and their details from the
                          API and web page            [boolean] [default: false]
  --video                 Download videos. Only works in full mode
                                                      [boolean] [default: false]
  --silent                Disable progress output     [boolean] [default: false]
  --strict                Throw an error if types from Instagram API have been
                          changed                     [boolean] [default: false]  
  --sync                  Synchronously download files between API requests
                                                      [boolean] [default: false]
  --threads, -k           The number of parallel download / upload threads
                                                           [number] [default: 4]
  --waitDownload, -w      When true, media will only download once scraping is
                          finished                    [boolean] [default: false]
  --filename, --file, -f  Name of the output file              [default: "[id]"]
  --filetype, --type, -t  Type of output file
                              [choices: "csv", "json", "both"] [default: "json"]
  --downdir               Directory / Container to save media
                                          [default: "downloads/[endpoint]/[id]"]
  --mediaPath, --mp       Store the paths of downloaded media in the
                          '_mediaPath' key            [boolean] [default: false]
  --logging               Level of logger
                   [choices: "error", "none", "info", "debug"] [default: "none"]
  --logfile               Name of the log file      [default: "instamancer.log"]
  --browser               Location of the browser. Defaults to the copy
                          downloaded at installation
  --swift                 Upload media to openstack's swift object storage
                          rather than saving to disk  [boolean] [default: false]

Examples:
  instamancer hashtag instagood -d          Download all the available posts,
                                            and their thumbnails from #instagood
  instamancer location 644269022 --count    Download 200 posts tagged as being
  200                                       at the Arc Du Triomphe
  instamancer user arianagrande             Download Ariana Grande's posts to a
  --filetype=csv --logging=info --visible   CSV file with a non-headless
                                            browser, and log all events

Source code available at https://github.com/ScriptSmith/instamancer

Module

ES2018 Typescript example:

import * as Instamancer from "instamancer";

const options: Instamancer.IOptions = {
    total: 10
};

const hashtag = Instamancer.hashtag("beach", options);
(async () => {
    for await (const post of hashtag) {
        console.log(post);
    }
})();

Generator functions

Instamancer.hashtag(id, options);
Instamancer.location(id, options);
Instamancer.user(id, options);
Instamancer.post(ids, options);

Options

const options: Instamancer.IOptions = {
    // Total posts to download. 0 for unlimited
    total: number,
    
    // Run Chrome in headless mode
    headless: boolean,
    
    // Logging events
    logger: winston.Logger,
    
    // Run without output to stdout
    silent: boolean,
    
    // Time to sleep between interactions with the page
    sleepTime: number,

    // Throw an error if type validation has been failed
    strict?: boolean,
    
    // Time to sleep when rate-limited
    hibernationTime: number,
    
    // Enable the grafting process
    enableGrafting: boolean,
    
    // Extract the full amount of information from the API
    fullAPI: boolean,
    
    // Use a proxy in Chrome to connect to Instagram
    proxyURL: string,
    
    // Location of the chromium / chrome binary executable
    executablePath: string,

    // Custom io-ts validator
    validator?: Type<unknown>;
}

Comparison

A comparison of Instagram scraping tools. Please suggest more tools and criteria through a pull request.

To see a speed comparison, visit this page

Tool Hashtags Users Locations Posts Login not required Private feeds Batch mode Command-line Library/Module Download media Download metadata Scraping method Daily builds Main language Speed ____________________________ License ____________________________ Last commit ____________________________ Open Issues ____________________________ Closed Issues ____________________________ Build status ____________________________ Test coverage ____________________________ Code quality ____________________________
Instamancer ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ Web API request interception ✔️ Typescript
Instaphyte ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ Web API simulation ✔️ Python
Instaloader ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ Web API simulation Python
Instalooter ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ Web API simulation Python
Instagram crawler ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ Web DOM reading Python
Instagram Scraper ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ Web API simulation Python
Instagram Private API ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ App and Web API simulation Python
Instagram PHP Scraper ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ Web API simulation PHP