Before using any script, run pipenv shell
to enter the virtual environment.
Using the Twitter scraper requires registering as a Twitter developer and providing authentication keys. Place your keys in either .my_keys (in .gitignore) or config/keys.json. See the Twitter Developer page.
viper_scraper.py twitter [-h] [-d Data Directory] [-t Tracking File]
[-l Limit] [--photos_as_limit]
-d Data Directory
: Directory to save results to
-t Tracking File
: Path to a text file containing a list of phrases, one
per line, to track. See the Twitter page for filteringrealtime tweets.
-l Limit
: If photos as limit is true, the approximate number of
images to scrape. Else the approximate number of tweets
to scrape.
--photos_as_limit
: If present, Limit refers to the number of images to scrape rather than number of tweets
The Twitter scraper filters realtime tweets using the Twitter API. Text, metadata, and references to downloaded images are stored in data.csv under the specified directory.
python viper_scraper.py yolo ...
The VIPER scraper also integrates You Only Look Once (YOLO) real-time object detection.
For each tweet that passes the filter, the scraper will:
- Download the original image, if present.
- Save a version of the image with bounding boxes and predictions labelled
- Save a .json file containing the confidences for each class
- Save text and metadata, along with references to these files, in data.csv under the specified directory
viper_scraper.py yolo [-h] [-d Data Directory] [-t Tracking File]
[-l Limit] [--photos_as_limit] --names NAMES
--config CONFIG --weights WEIGHTS [-c CONFIDENCE]
[-th THRESHOLD]
In addition to the arguments shared by the basic Twitter scraper, YOLO integration takes these additional arguments:
--names NAMES
: A file containing the names, one per line, associated with the weights and config file for YOLO, e.g. coco.names.
--config CONFIG
: Config file for YOLO, e.g. yolov3.cfg.
--weights WEIGHTS
: Weights file for YOLO, e.g. yolov3.weights.
-c CONFIDENCE
: Minimum confidence to filter weak detections, default 0.5.
-th THRESHOLD
: Threshold when applying non-maxima suppression, default 0.3.
For example, to use the pretrained YOLO model (coco.names, yolov3.cfg, and yolov3.weights) with plane_tracking.txt, download the files and run:
python viper_scraper.py yolo -d data_yolo_planes -t config/plane_tracking.txt -l 1000 --names yolo/coco.names --config yolo/yolov3.cfg --weights yolo/yolov3.weights -c .5 -th .3
python viper_scraper.py instagram ...
This script and associated utility scripts are based on Antonie Lin's non-API instagram scraper under the MIT license. Visit his (now-archived) repository at:
https://github.com/iammrhelo/InstagramCrawler
This is a non-API Instagram scraper using Selenium. As such, it is liable to break as Instagram changes their site. I will try to maintain its integrity but please feel free to contribute.
Scrape n images and associated captions from either a user or a hashtag.
Before use, run
bash utils/get_gecko.sh
bash utils/get_phantomjs.sh
source utils/set_path.sh
Usage
viper_scraper.py instagram [-h] [-d DIR_PREFIX] [-q QUERY] [-n NUMBER] [-c caption] [-l Headless] [-a AUTHENTICATION] [-f FIREFOX_PATH]
-d Directory Prefix
: The directory to save data to.
-q QUERY
: The target (user or hashtag) to crawl. Add '#' for hashtags
-n NUMBER
: The number of posts to download.
-c Caption
: Add this flag to download captions when donaloading photos.
-l headless
: If set, will use PhantomJS driver to run script as headless
-a AUTHENTICATION
: Path to authentication JSON file - necessary for headless.
f FIREFOX_PATH
: Path to the firefox installation for selenium.
For example,
python viper_scraper.py instagram -d data_insta_test -q "#art" -c -n 100`
Will scrape the first 100 photos and captions from the art hashtag.
With docker or dockercompose installed using the following commands to install system.
Start the network
docker-compose up --build
To run any of the code above, after building the container use the following line.
docker-compose run viper pipenv run python viper_scraper.py twitter -h
Stop the network
docker-compose down
To start the running environment start bash:
docker-compose run viper bash
Now you can also run pipenv shell
and use all the above commands.
Inside the shell.
Stop the network
docker-compose down
To remove associated containers
docker ps -a | awk '{ print $1,$2 }' | grep viper_scraper_viper | awk '{print $1 }' | xargs -I {} docker rm {}
Print all of the images
docker images
Remove specific image
docker image rm 502285389a42
Now that this is done you can re-compose docker and run the program.