Pull instagram geolocation data from DB for area near Congressional District 39
Closed this issue · 5 comments
This is not directly connected to the website, but does require use of the backing data. Bill to the Caliparks maintenance job in QB
RLF wants us to make a map of connections between the 39th Congressional District and Instagram posts in public parks along the coast. @tsinn will make the final map (similar to a County>Coast thing we did for UCLA last year), but first we need to get some data.
Ultimately, we need all the IG posts in coastal parks where the user's home location is inside the 39th congressional district.
Pulled down current House districts to P (Mac path: P/proj_p_s/ResourcesLegacyFund/CALIPARKS/instagramMapping2018).
But for the county map, Stamen provided the data preprocessed to the best-guess county centroid of the IG user. so first question is what we have in the database to allow that. Maybe a user profile city name or similar?
If it's a string that needs geocoding, this could get time consuming/costly. In that case, we might want to restrict posts to only coastal parks first (with help from Tim to recall how we did that last time) and then only geocode those, then filter to 39th district.
Here's the county map we did last time for reference:
End product would likely be a single dot in 39th radiating out to many parks.
The coastal_photos table has these fields:
photo_id | character varying(40) |
metadata | json |
geom | geometry(Point,4326) |
superunit_id | integer |
The photo metadata does not seem to include much personal detail suich as home locations, just the poster's username and name:
{
"type":"image",
"id":"1176141650918717185_7070844",
"attribution":null,
"tags":[],
"location":{"latitude":39.392445,"name":"Jackson State Forest","longitude":-123.648923,"id":1024907777},
"comments":{"count":0,"data":[]},
"filter":"Normal","created_time":"1454427032","link":"https://www.instagram.com/p/BBSfpmcjTcB/",
"likes":{"count":7,"data":[
{"username":"nrcross_","profile_picture":"https://scontent.cdninstagram.com/t51.2885-19/s150x150/12224454_182667155409552_1826488624_a.jpg","id":"42351896","full_name":"Nathan Cross"},
{"username":"dgirard10","profile_picture":"https://scontent.cdninstagram.com/t51.2885-19/s150x150/11262620_422668554595655_1255753760_a.jpg","id":"12478249","full_name":"Dylan Girard"},
...etc...
]},
"images":{"low_resolution":{"url":"https://scontent.cdninstagram.com/t51.2885-15/s320x320/e35/12407205_1594417670778980_137398085_n.jpg","width":320,"height":320},"thumbnail":{"url":"https://scontent.cdninstagram.com/t51.2885-15/s150x150/e35/12407205_1594417670778980_137398085_n.jpg","width":150,"height":150},"standard_resolution":{"url":"https://scontent.cdninstagram.com/t51.2885-15/s640x640/sh0.08/e35/12407205_1594417670778980_137398085_n.jpg","width":640,"height":640}},
"users_in_photo":[
{"position":{"y":0.708,"x":0.749333333},"user":{"username":"sarahlizbeth","profile_picture":"https://scontent.cdninstagram.com/t51.2885-19/11055503_632051400260313_1875680088_a.jpg","id":"7067991","full_name":""}}
..etc...
],
"caption":{"created_time":"1454427032","text":"Hiking",
"from":{"username":"jeff_quinn","profile_picture":"https://scontent.cdninstagram.com/t51.2885-19/10950568_1565864193682783_157122195_a.jpg","id":"7070844","full_name":"Jeff Quinn"},"id":"1176141660691444784"},
"user":{"username":"jeff_quinn","profile_picture":"https://scontent.cdninstagram.com/t51.2885-19/10950568_1565864193682783_157122195_a.jpg","id":"7070844","full_name":"Jeff Quinn"}}
}
I show 1,826,552 photos at present. It would not be overly difficult, to export the metadata in some format, then write a Python program to tease it apart and get at what we do have:
- photographer details: name, username, photo URL
- photo location lat/lon
- photo location name, if available
Still, this does not seem to include anything to further target the poster's home location.
Given the user-ID, in theory one could brute-force the Instagram API to get at user details.
API output
https://www.instagram.com/developer/endpoints/users/#get_users
The API output seems not to include much personal info: their name, username, and profile picture which we already have in the photo metadata; URL of their website and their "bio" blurb.
API limits
I wouldn't know yet how many distinct users are represented in the 1.8M photos. Part of such a "user info scraper" would include a caching mechanism so as not to re-fetch the same username multiple times. Still, hitting the Instagram API 1 million times or even 250,000 times could be a violation of their TOU, as well as being time-consuming.
Per chat, Stamen had at some point connected users to coordinates or at least counties. Let's connect with them and see if they have any notes on that process, which could be helpful.
Here's the word from Stamen -- simpler and more impressionistic, but fine for this use case:
I don't know if I ever assigned "home" counties for users. Rather, we just looked at any users that showed up in parks within each county, and then looked at which coastal parks they also showed up in.
So, if we're drawing the connection between Riverside county and a park in, say, Ventura, we just look for any username that shows up in both places. It's possible the user's "home" location is actually in Ventura and they just happened to visit Riverside once. Or it's possible their home is in Sacramento and they just happened to visit both Riverside and Ventura. In all those cases, they'd show up as a link, but we can't distinguish which scenario is which. Also, we were only using the corpus of photos that we harvested in parks, so a user would have to have visited a park in Riverside to even show up in the database. If they live in Riverside, but only ever visited parks in Ventura, we wouldn't show them as a link, because we have no evidence of them being in Riverside.
So in this case, the query would be users that appear in coastal_parks
that also appear in parks that fall inside 39th District.
that seems much more straight forward.
To rephrase for myself, the desired end result is:
- coastal CPAD superunits
- tagged with a new integer attribute howmanyinstagramusers
This would be the count of distinct instagram userids, found among photos for that park superunit_id, where that same instagram userid is seen at least once in a photo for a park within Congdist39.
Also noted that this is likely to be replicated for other congdists.
Steps:
- export
distinct username, superunit_id
from coastal_photos - export
distinct username
from instagram_photos where photo geom is within Congdist 39 geom - export coastal CPAD; that is, all CPAD superunits which have a presence in coastal_photos
- write something to filter coastal usernames against cong39 usernames, and take tally by park,, save as a new CSV
- join tallies & CPAD for visualization
I did the calculations as expected, and have placed them onto GreenInfo's internal file storage for reference. Let's discuss!