/viper_scraper

Scraping and ingesting multi-model data from online social networks with object detection integration

Primary LanguagePythonMIT LicenseMIT

Viper Scraper

Set-Up

Before using any script, run pipenv shell to enter the virtual environment.

Using the Twitter scraper requires registering as a Twitter developer and providing authentication keys. Place your keys in either .my_keys (in .gitignore) or config/keys.json. See the Twitter Developer page.

Scraping Twitter

viper_scraper.py twitter [-h] [-d Data Directory] [-t Tracking File] 
                         [-l Limit] [--photos_as_limit]

-d Data Directory : Directory to save results to

-t Tracking File : Path to a text file containing a list of phrases, one per line, to track. See the Twitter page for filteringrealtime tweets.

-l Limit : If photos as limit is true, the approximate number of images to scrape. Else the approximate number of tweets to scrape.

--photos_as_limit : If present, Limit refers to the number of images to scrape rather than number of tweets

The Twitter scraper filters realtime tweets using the Twitter API. Text, metadata, and references to downloaded images are stored in data.csv under the specified directory.

YOLO integration with Twitter

python viper_scraper.py yolo ...

The VIPER scraper also integrates You Only Look Once (YOLO) real-time object detection.

For each tweet that passes the filter, the scraper will:

  1. Download the original image, if present.
  2. Save a version of the image with bounding boxes and predictions labelled
  3. Save a .json file containing the confidences for each class
  4. Save text and metadata, along with references to these files, in data.csv under the specified directory
viper_scraper.py yolo [-h] [-d Data Directory] [-t Tracking File]
                      [-l Limit] [--photos_as_limit] --names NAMES
                      --config CONFIG --weights WEIGHTS [-c CONFIDENCE]
                      [-th THRESHOLD]

In addition to the arguments shared by the basic Twitter scraper, YOLO integration takes these additional arguments:

--names NAMES : A file containing the names, one per line, associated with the weights and config file for YOLO, e.g. coco.names.

--config CONFIG : Config file for YOLO, e.g. yolov3.cfg.

--weights WEIGHTS : Weights file for YOLO, e.g. yolov3.weights.

-c CONFIDENCE : Minimum confidence to filter weak detections, default 0.5.

-th THRESHOLD : Threshold when applying non-maxima suppression, default 0.3.

For example, to use the pretrained YOLO model (coco.names, yolov3.cfg, and yolov3.weights) with plane_tracking.txt, download the files and run:

python viper_scraper.py yolo -d data_yolo_planes -t config/plane_tracking.txt -l 1000 --names yolo/coco.names --config yolo/yolov3.cfg --weights yolo/yolov3.weights -c .5 -th .3

Scraping Instagram

python viper_scraper.py instagram ...

This script and associated utility scripts are based on Antonie Lin's non-API instagram scraper under the MIT license. Visit his (now-archived) repository at:

https://github.com/iammrhelo/InstagramCrawler

This is a non-API Instagram scraper using Selenium. As such, it is liable to break as Instagram changes their site. I will try to maintain its integrity but please feel free to contribute.

Scrape n images and associated captions from either a user or a hashtag.

Before use, run

bash utils/get_gecko.sh
bash utils/get_phantomjs.sh
source utils/set_path.sh

Usage

viper_scraper.py instagram [-h] [-d DIR_PREFIX] [-q QUERY] [-n NUMBER] [-c caption] [-l Headless] [-a AUTHENTICATION] [-f FIREFOX_PATH]

-d Directory Prefix : The directory to save data to.

-q QUERY : The target (user or hashtag) to crawl. Add '#' for hashtags

-n NUMBER : The number of posts to download.

-c Caption : Add this flag to download captions when donaloading photos.

-l headless : If set, will use PhantomJS driver to run script as headless

-a AUTHENTICATION : Path to authentication JSON file - necessary for headless.

f FIREFOX_PATH : Path to the firefox installation for selenium.

Examples:

For example,

python viper_scraper.py instagram -d data_insta_test -q "#art" -c -n 100`

Will scrape the first 100 photos and captions from the art hashtag.

Running with Docker compose

With docker or dockercompose installed using the following commands to install system.

Start the network

docker-compose up --build

To run any of the code above, after building the container use the following line.

docker-compose run viper pipenv run python viper_scraper.py twitter -h

Stop the network

docker-compose down

To start the running environment start bash:

docker-compose run viper bash

Now you can also run pipenv shell and use all the above commands. Inside the shell.

Remove containers and re-run

Stop the network

docker-compose down

To remove associated containers

docker ps -a | awk '{ print $1,$2 }' | grep viper_scraper_viper | awk '{print $1 }' | xargs -I {} docker rm {}

Print all of the images

docker images

Remove specific image

docker image rm 502285389a42

Now that this is done you can re-compose docker and run the program.