Gnip Python Search API Utilities
This package includes two utilities:
- Gnip Search API interactions include Search V2 and paging support
- Timeseries analysis and plotting
Installation
Install from PyPI with pip install gapi
Or to use the full time line capability, pip install gapi[timeline]
Search API
Usage:
$ gnip_search.py -h
usage: gnip_search.py [-h] [-a] [-c] [-b COUNT_BUCKET] [-e END] [-f FILTER] [-l STREAM_URL] [-n MAX] [-N HARD_MAX] [-p PASSWORD] [-q] [-s START] [-u USER] [-w OUTPUT_FILE_PATH] [-t] USE_CASE GnipSearch supports the following use cases: ['json', 'wordcount', 'users', 'rate', 'links', 'timeline', 'geo', 'audience'] positional arguments: USE_CASE Use case for this search. optional arguments: -h, --help show this help message and exit -a, --paged Paged access to ALL available results (Warning: this makes many requests) -c, --csv Return comma-separated 'date,counts' or geo data. -b COUNT_BUCKET, --bucket COUNT_BUCKET Bucket size for counts query. Options are day, hour, minute (default is 'day'). -e END, --end-date END End of datetime window, format 'YYYY-mm-DDTHH:MM' (default: most recent activities) -f FILTER, --filter FILTER PowerTrack filter rule (See: http://support.gnip.com/c ustomer/portal/articles/901152-powertrack-operators) -l STREAM_URL, --stream-url STREAM_URL Url of search endpoint. (See your Gnip console.) -n MAX, --results-max MAX Maximum results to return per page (default 100; max 500) -N HARD_MAX, --hard-max HARD_MAX Maximum results to return for all pages; see -a option -p PASSWORD, --password PASSWORD Password -q, --query View API query (no data) -s START, --start-date START Start of datetime window, format 'YYYY-mm-DDTHH:MM' (default: 30 days ago) -u USER, --user-name USER User name -w OUTPUT_FILE_PATH, --output-file-path OUTPUT_FILE_PATH Create files in ./OUTPUT-FILE-PATH. This path must exists and will not be created. This options is available only with -a option. Default is no output files. -t, --search-v2 Using search API v2 endpoint. [This is depricated and is automatically set based on endpoint.]
##Using a configuration file
To avoid entering the the -u, -p and -l options for every command, create a configuration file named ".gnip" in the directory where you will run the code. When this file contains the correct parameters, you can omit this command line parameters.
Use this template:
# export GNIP_CONFIG_FILE=<location and name of this file>
#
[creds]
un = <email use for service>
pwd = <password>
[endpoint]
# replace with your endpoint
url = https://search.gnip.com/accounts/shendrickson/search/wayback.json
Use cases
JSON
Return full, enriched, Activity Streams-format JSON payloads from the Search API endpoint. Run Gnip-Python-Search-API-Utilities/gnip_search.py from Gnip-Python-Search-API-Utilities:
Note: If you have a GNIP_CONFIG_FILE defined (try echo $GNIP_CONFIG_FILE, it should return the path to the config that you created), -u and -p arguments are not necessary.
$ ./gnip_search.py -uXXX -pXXX -f"from:Gnip" json
{"body": "RT @bbi: The #BigBoulder bloggers have been busy. Head to http://t.co/Rwve0dVA82 for recaps of the Sina Weibo, Tumblr & Academic Research s\u2026", "retweetCount": 3, "generator": {"link": "http://twitter.com", "displayName": "Twitter Web Client"}, "twitter_filter_level": "medium", "gnip": {"klout_profile": {"link": "http://klout.com/user/id/651348", "topics": [{"link": "http://klout.com/topic/id/5144818194631006088", "displayName": "Software", "
...
Notes
-a
option (paging) collects all results before printing to stdout/file and also forces -n 500
per request. The paging
option will collect up to 1/2 M tweets, which may take hours and be very costly.
Wordcount
Return top 1- and 2-grams - with counts and document frequency - from matching activities. Can modify the settings within simple ngrams package (sngrams
) to modify the range of output.
$ ./gnip_search.py -uXXX -pXXX -f"world cup" -n200 wordcount
------------------------------------------------------------
terms -- mentions activities (200)
------------------------------------------------------------
world -- 203 11.41% 198 99.00%
cup -- 203 11.41% 198 99.00%
ceremony -- 46 2.59% 45 22.50%
opening -- 45 2.53% 45 22.50%
fifa -- 25 1.41% 25 12.50%
2014 -- 22 1.24% 22 11.00%
brazil -- 20 1.12% 19 9.50%
watching -- 15 0.84% 12 6.00%
ready -- 14 0.79% 14 7.00%
tonight -- 11 0.62% 11 5.50%
game -- 11 0.62% 11 5.50%
wait -- 10 0.56% 10 5.00%
million -- 10 0.56% 8 4.00%
first -- 10 0.56% 10 5.00%
indonesia -- 10 0.56% 2 1.00%
time -- 10 0.56% 9 4.50%
niallofficial -- 9 0.51% 9 4.50%
here -- 9 0.51% 9 4.50%
majooooorr -- 9 0.51% 9 4.50%
braziiiilllll -- 9 0.51% 9 4.50%
world cup -- 198 12.54% 196 98.00%
opening ceremony -- 33 2.09% 33 16.50%
cup opening -- 23 1.46% 23 11.50%
fifa world -- 23 1.46% 23 11.50%
cup 2014 -- 13 0.82% 13 6.50%
ready world -- 12 0.76% 12 6.00%
cup tonight -- 11 0.70% 11 5.50%
niallofficial first -- 9 0.57% 9 4.50%
cima majooooorr -- 9 0.57% 9 4.50%
cmon braziiiilllll -- 9 0.57% 9 4.50%
tonight wait -- 9 0.57% 9 4.50%
wait pra -- 9 0.57% 9 4.50%
majooooorr cmon -- 9 0.57% 9 4.50%
game world -- 9 0.57% 9 4.50%
pra cima -- 9 0.57% 9 4.50%
watching world -- 9 0.57% 7 3.50%
first game -- 9 0.57% 9 4.50%
indonesia indonesia -- 8 0.51% 2 1.00%
watch world -- 8 0.51% 8 4.00%
ceremony world -- 7 0.44% 7 3.50%
------------------------------------------------------------
Users
Return the most common usernames occuring in matching activities
$ ./gnip_search.py -uXXX -pXXX -f"obama" -n500 users
------------------------------------------------------------
terms -- mentions activities (500)
------------------------------------------------------------
tsalazar66 -- 5 1.00% 5 1.00%
sunnyherring1 -- 5 1.00% 5 1.00%
debwilliams57 -- 3 0.60% 3 0.60%
tattooq -- 2 0.40% 2 0.40%
carlanae -- 2 0.40% 2 0.40%
miisslys -- 2 0.40% 2 0.40%
celtic_norse -- 2 0.40% 2 0.40%
tvkoolturaldgoh -- 2 0.40% 2 0.40%
tarynmorman -- 2 0.40% 2 0.40%
__coleston_s__ -- 2 0.40% 2 0.40%
alinka2linka -- 2 0.40% 2 0.40%
falakhzafrieyl -- 2 0.40% 2 0.40%
coolstoryluk -- 2 0.40% 2 0.40%
law_colorado -- 2 0.40% 2 0.40%
genelingerfelt -- 2 0.40% 2 0.40%
annerkissed69 -- 2 0.40% 2 0.40%
shotoftheweek -- 2 0.40% 2 0.40%
matemary1 -- 2 0.40% 2 0.40%
orlando_ooh -- 2 0.40% 2 0.40%
c0nt0stavl0s__ -- 2 0.40% 2 0.40%
------------------------------------------------------------
Rate
Calculate the approximate activity rate from matched activities.
$ ./gnip_search.py -uXXX -pXXX -f"from:jrmontag" -n500 rate
------------------------------------------------------------
PowerTrack Rule: "from:jrmontag"
Oldest Tweet (UTC): 2014-05-13 02:14:44
Newest Tweet (UTC): 2014-06-12 18:41:44.306984
Now (UTC): 2014-06-12 18:41:55
254 Tweets: 0.345 Tweets/Hour
------------------------------------------------------------
Links
Return the most frequently observed links - count and document frequency - in matching activities
$ ./gnip_search.py -uXXX -pXXX -f"from:drskippy" -n500 links
---------------------------------------------------------------------------------------------------------------------------------
links -- mentions activities (31)
---------------------------------------------------------------------------------------------------------------------------------
nolinks -- 9 27.27% 9 26.47%
http://twitter.com/mutualmind/status/476460889147600896/photo/1 -- 1 3.03% 1 2.94%
http://thenewinquiry.com/essays/the-anxieties-of-big-data/ -- 1 3.03% 1 2.94%
http://www.nytimes.com/2014/05/30/opinion/krugman-cutting-back-on-carbon.html?hp&rref=opinion&_r=0 -- 1 3.03% 1 2.94%
http://twitter.com/mdcin303/status/474991971170131968/photo/1 -- 1 3.03% 1 2.94%
http://twitter.com/notfromshrek/status/475034884189085696/photo/1 -- 1 3.03% 1 2.94%
https://github.com/dlwh/epic -- 1 3.03% 1 2.94%
http://twitter.com/jrmontag/status/471762525449900032/photo/1 -- 1 3.03% 1 2.94%
http://pandas.pydata.org/pandas-docs/stable/whatsnew.html -- 1 3.03% 1 2.94%
http://www.economist.com/blogs/graphicdetail/2014/06/daily-chart-1 -- 1 3.03% 1 2.94%
http://www.zdnet.com/google-turns-to-machine-learning-to-build-a-better-datacentre-7000029930/ -- 1 3.03% 1 2.94%
https://groups.google.com/forum/#!topic/scalanlp-discuss/bd9jhmm2nxc -- 1 3.03% 1 2.94%
http://www.ladamic.com/wordpress/?p=681 -- 1 3.03% 1 2.94%
http://www.linkedin.com/today/post/article/20140407232811-442872-do-your-analysts-really-analyze -- 1 3.03% 1 2.94%
http://twitter.com/giorgiocaviglia/status/474319737761980417/photo/1 -- 1 3.03% 1 2.94%
http://faculty.washington.edu/kstarbi/starbird_iconference2014-final.pdf -- 1 3.03% 1 2.94%
http://twitter.com/drskippy/status/474903707407384576/photo/1 -- 1 3.03% 1 2.94%
http://en.wikipedia.org/wiki/lissajous_curve#logos_and_other_uses -- 1 3.03% 1 2.94%
http://datacolorado.com/knitr_test/ -- 1 3.03% 1 2.94%
http://opendata-hackday.de/?page_id=227 -- 1 3.03% 1 2.94%
---------------------------------------------------------------------------------------------------------------------------------
Timeline
Return a count timeline of matching activities. Without further options, results are returned in JSON format...
$ ./gnip_search.py -uXXX -pXXX -f"@cia" timeline
{"results": [{"count": 32, "timePeriod": "201405130000"}, {"count": 31, "timePeriod": "201405140000"},
Results can be returned in comma-delimited format with the -c
option:
$ ./gnip_search.py -uXXX -pXXX -f"@cia" timeline -c
2014-05-13T00:00:00,32
2014-05-14T00:00:00,31
2014-05-15T00:00:00,23
2014-05-16T00:00:00,81
...
And bucket size can be adjusted with -b
:
$ ./gnip_search.py -uXXX -pXXX -f"@cia" timeline -c -b hour
...
2014-06-06T11:00:00,0
2014-06-06T12:00:00,0
2014-06-06T13:00:00,0
2014-06-06T14:00:00,0
2014-06-06T15:00:00,1
2014-06-06T16:00:00,0
2014-06-06T17:00:00,7234
2014-06-06T18:00:00,77403
2014-06-06T19:00:00,44704
2014-06-06T20:00:00,38512
2014-06-06T21:00:00,23463
2014-06-06T22:00:00,17458
2014-06-06T23:00:00,13352
2014-06-07T00:00:00,12618
2014-06-07T01:00:00,11373
2014-06-07T02:00:00,10641
2014-06-07T03:00:00,9457
...
Geo
Return JSON payloads with the latitude, longitude, timestamp, and activity id for matching activities
$ ./gnip_search.py -uXXX -pXXX -f"vamos has:geo" geo
{"latitude": 4.6662819, "postedTime": "2014-06-12T18:52:48", "id": "477161613775351808", "longitude": -74.0557122}
{"latitude": null, "postedTime": "2014-06-12T18:52:48", "id": "477161614354165760", "longitude": null}
{"latitude": -24.4162955, "postedTime": "2014-06-12T18:52:47", "id": "477161609786568704", "longitude": -53.5296426}
{"latitude": 14.66637167, "postedTime": "2014-06-12T18:52:47", "id": "477161607299342336", "longitude": -90.52661}
{"latitude": -22.94064485, "postedTime": "2014-06-12T18:52:45", "id": "477161600429088769", "longitude": -43.05257938}
...
This can also be output in delimited format:
$ ./gnip_search.py -uXXX -pXXX -f"vamos has:geo" geo -c
477161971364933632,2014-06-12T18:54:13,-6.350394,38.926667
477161943015636992,2014-06-12T18:54:07,-46.60175585,-23.63230955
477161939647623168,2014-06-12T18:54:06,-49.0363085,-26.6042339
477161938833907712,2014-06-12T18:54:06,-1.5364198,53.9949317
477161936938094592,2014-06-12T18:54:05,-76.06161259,1.84834405
477161932806692865,2014-06-12T18:54:04,None,None
477161928377516032,2014-06-12T18:54:03,-51.08593214,0.03778787
Audience
Return the list of all of the users ids represented by matching activities
$ ./gnip_search.py -n15 -f "call mom" audience
--------------------------------------------------------------------------------
229152598
458139782
1371311486
356605896
1214494260
2651237064
2468197068
1473613993
408876524
245142830
2158092706
119980244
2207663371
291388723
3106639108
Simple Timeseries Analysis
Usage:
$ gnip_time_series.py -h
usage: gnip_time_series.py [-h] [-b COUNT_BUCKET] [-e END] [-f FILTER] [-g SECOND_FILTER] [-l STREAM_URL] [-p PASSWORD] [-s START] [-u USER] [-t] [-w OUTPUT_FILE_PATH] GnipSearch timeline tools optional arguments: -h, --help show this help message and exit -b COUNT_BUCKET, --bucket COUNT_BUCKET Bucket size for counts query. Options are day, hour, minute (default is 'day'). -e END, --end-date END End of datetime window, format 'YYYY-mm-DDTHH:MM' (default: most recent activities) -f FILTER, --filter FILTER PowerTrack filter rule (See: http://support.gnip.com/c ustomer/portal/articles/901152-powertrack-operators) -g SECOND_FILTER, --second_filter SECOND_FILTER Use a second filter to show correlation plots of -f timeline vs -g timeline. -l STREAM_URL, --stream-url STREAM_URL Url of search endpoint. (See your Gnip console.) -p PASSWORD, --password PASSWORD Password -s START, --start-date START Start of datetime window, format 'YYYY-mm-DDTHH:MM' (default: 30 days ago) -u USER, --user-name USER User name -t, --get-topics Set flag to evaluate peak topics (this may take a few minutes) -w OUTPUT_FILE_PATH, --output-file-path OUTPUT_FILE_PATH Create files in ./OUTPUT-FILE-PATH. This path must exists and will not be created.
Example Plots
Example output from command:
gnip_time_series.py -f "earthquake" -s2015-10-01T00:00:00 -e2015-11-18T00:00:00 -t -bhour
Dependencies
Gnip's Search 2.0 API access is required.
In addition to the the basic Gnip Search utility described immediately above, this pakage depends on a number of other large packges:
- matplotlib
- numpy
- pandas
- statsmodels
- scipy
Notes
- You should create the path "plots" in the directory where you run the utility. This will contain the plots of time series and analysis
- This utility creates an extensive log file named time_series.log. It contains many details of parameter settings and intermediate outputs.
- On a remote machine or server, change your matplotlib backend by creating a local matplotlibrc file. Create Gnip-Python-Search-API-Utilities/matplotlibrc:
# Change the backend to Agg to avoid errors when matplotlib cannot display the plots # More information on creating and editing a matplotlibrc file at: http://matplotlib.org/users/customizing.html backend : Agg
Filter Analysis
$ ./gnip_filter_analysis.py -h
usage: gnip_filter_analysis.py [-h] [-j JOB_DESCRIPTION] [-b COUNT_BUCKET] [-l STREAM_URL] [-p PASSWORD] [-r RANK_SAMPLE] [-q] [-u USER] [-w OUTPUT_FILE_PATH] Creates an aggregated filter statistics summary from filter rules and date periods in the job description. optional arguments: -h, --help show this help message and exit -j JOB_DESCRIPTION, --job_description JOB_DESCRIPTION JSON formatted job description file -b COUNT_BUCKET, --bucket COUNT_BUCKET Bucket size for counts query. Options are day, hour, minute (default is 'day'). -l STREAM_URL, --stream-url STREAM_URL Url of search endpoint. (See your Gnip console.) -p PASSWORD, --password PASSWORD Password -r RANK_SAMPLE, --rank_sample RANK_SAMPLE Rank inclusive sampling depth. Default is None. This runs filter rule production for rank1, rank1 OR rank2, rank1 OR rank2 OR rank3, etc.to the depths specifed. -q, --query View API query (no data) -u USER, --user-name USER User name -w OUTPUT_FILE_PATH, --output-file-path OUTPUT_FILE_PATH Create files in ./OUTPUT-FILE-PATH. This path must exists and will not be created. Default is ./data
Example output to compare 7 rules across 2 time periods:
job.json:
{ "date_ranges": [ { "end": "2015-06-01T00:00:00", "start": "2015-05-01T00:00:00" }, { "end": "2015-12-01T00:00:00", "start": "2015-11-01T00:00:00" } ], "rules": [ { "tag": "common pet", "value": "dog" }, { "tag": "common pet", "value": "cat" }, { "tag": "common pet", "value": "hamster" }, { "tag": "abstract pet", "value": "pet" }, { "tag": "pet owner destination", "value": "vet" }, { "tag": "pet owner destination", "value": "kennel" }, { "tag": "diminutives", "value": "puppy OR kitten" } ] }
Output:
$ ./gnip_filter_analysis.py -r 3 ... start_date 2015-05-01T00:00:00 2015-11-01T00:00:00 All filter All 42691589 46780243 89471832 dog OR cat OR hamster OR pet OR vet OR kennel O... 20864710 22831053 43695763 dog 8096637 9218028 17314665 cat 8378681 8705244 17083925 puppy OR kitten 2392041 2659051 5051092 pet 2101044 2345140 4446184 vet 620178 749802 1369980 hamster 199634 226864 426498 kennel 38664 45061 83725 start_date 2015-05-01T00:00:00 2015-11-01T00:00:00 All filter All 63640524 69822220 133462744 dog OR cat OR hamster OR pet OR vet OR kennel O... 20864710 22831053 43695763 dog OR cat OR puppy OR kitten 18410402 20096764 38507166 dog OR cat 16268900 17662083 33930983 dog 8096512 9232320 17328832 /pre> So for this rule set, the redundancy is 89471832/43695763. - 1 = 1.0476088722835666 and the 3 rule approximation for the corpus gives 38507166/43695763. = 0.8812562902265832 or 88% of of the tweets of the full rule set. Additionally, csv output of the raw counts and a csv version of the pivot table are written to the specified data directory. #### Dependencies Gnip's Search 2.0 API access is required. In addition to the the basic Gnip Search utility described immediately above, this pakage depends on a number of other large packges: * numpy * pandas #### Notes * Unlike other utilities provided, the defualt file path is set to "./data" to provide full accsess to output results. Therefore, you should create the path "data" in the directory where you run the utility. This will contain the data ouputs. ## License Gnip-Python-Search-API-Utilities by Scott Hendrickson, Josh Montague and Jeff Kolb is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. This work is licensed under the Creative Commons Attribution-ShareAlike 3.0 Unported License. To view a copy of this license, visit http://creativecommons.org/licenses/by-sa/3.0/.