particle
is a tool for collecting data about links on Twitter, Facebook, Webpages, and RSS feeds (for now). It joins this data together by a common url and bucketed timestamp. This enables analysis similar to that which I discuss in this blog post. particle
is like pollster
but not as developed and nowhere near as sophisticated. I just didn't want to bother learning coffeescript so I built my own version :).
particle
runs off of a yaml config file which contains information about the data sources you want to collect, referred to here as particle.yml
. If you want to see an example of such a file, check out this one, though we'll discuss it in more detail below.
When you start a new particle
project, you'll want to tell particle
where this file is:
from particle import Particle
p = Particle('particle.yml')
This one function will read in your configuration file, set it as an environmental variable (PARTICLE_CONFIG_PATH
), build a Twitter list of the the handles you want to follow, and insert a stable facebook API key (more on all this below).
Now you're all set to run particle
:
p.run(tasks=['twitter', 'facebook'])
particle
tries really hard to find specific links around the web. You can set the patterns of these links in particle.yml
:
global:
content_regexes:
- .*\.nytimes\.com/(.*/)?[0-9]+/[0-9]+/[0-9]+/.*
short_regexes:
- nyti\.ms/\w+
content_regexes
represents a list of regular expressions you want to use to detect links to content. In this case, I've written a regular expression that will find links to New York Times content.
content_regexes:
- .*\.nytimes\.com/(.*/)?[0-9]+/[0-9]+/[0-9]+/.*
- .*propublica.org/[a-z]+/[a-z0-9/-]+
This list can be as long or as short as you'd like. For instance:
content_regexes:
- .*\.nytimes\.com/(.*/)?[0-9]+/[0-9]+/[0-9]+/.*
- .*propublica.org/[a-z]+/[a-z0-9/-]+
short_regexes
represents a list of regular expressions you want to use to detect shortened links to content - they usually show up on Facebook and Twitter. particle
will then unshorten these links so that you can resolve them with regular links.
particle
unionizes these URLs in time, so you don't have to join all the data together yourself. It does this by taking in an arbitrary bucket
parameter representing the number of minutes you'd like to round timestamps down to. This value should correspond to how often you run particle
. Don't worry though, particle
will still retain the raw timestamp in case you want more granular events. In addition, particle
will also resolve all UTC timestamps to the local time zone where your content is published or promoted. In particle.yml
this looks like:
global:
bucket: 10
newsroom_timezone: America/New_York
Putting it all together, your global
settings should look something like this:
global:
bucket: 10
content_regexes:
- .*\.nytimes\.com/(.*/)?[0-9]+/[0-9]+/[0-9]+/.*
newsroom_timezone: America/New_York
short_regexes:
- nyti\.ms/\w+
particle
currently supports collection from four data sources:
- RSS Feeds
- Web Pages (we'll call them 'Promo Pages below')
Let's go through each of these and discuss how to customize your particle
project to your needs.
particle
follows an arbitrary list of Twitter accounts and detects when relevant links are shared. To set this up, you'll first have to register an app with Twitter and obtain a set of API keys. Once you have these keys, you can enter them into particle.yml
like so:
twitter:
access_token: xxx
access_token_secret: xxx
consumer_key: xxx
consumer_secret: xxx
In order to better enable following many hundreds of Twitter acounts, particle
builds a custom Twitter list and then listens for updates. You can build this list by providing three parameters: the screen name associated with your API keys, the name of the list you want to build, and a list of the screen names you want to follow. This list is built when you run particle.init()
. In particle.yml
this looks like this:
lists:
pardata-test:
list_owner: brianabelson
list_screen_names:
- nytimes
- nytimesworld
You can also set the maximum number of of posts to search through at any given time. Currently the maximum is 200. However, so long as you regularly poll Twitter, you can be relatively certain that you won't miss any tweets. You can set this value in particle.yml
like so:
limit: 200
Putting it all together, your twitter
settings should look something like this:
twitter:
access_token: xxx
access_token_secret: xxx
consumer_key: xxx
consumer_secret: xxx
list_owner: brianabelson
list_slug: pardata-test
list_screen_names:
- nytimes
- nytimesworld
As with Twitter, you'll need to register an app with Facebook to access their data. You can do that on here.
facebook:
app_id: xxx
app_secret: xxx
temp_access_token: xxx
One annoyting thing about the Facebook API is that it requires you to hit a particular endpoint to register a stable access token, or an API key that lasts more than a couple of hours. particle
helps solve this by generating a stable access token for you. However, you'll first need a temporary access token - temp_access_token
above. You can obtain one of these by going to https://developers.facebook.com/tools and make sure you are using it with the App you've registered above. You can then press "Get Access Token" and copy that string into particle.yml
. When you run particle.init()
, a stable access token will be generated using your temporary access token and insterted into particle.yml
:
facebook:
app_id: xxx
app_secret: xxx
temp_access_token: xxx
stable_access_token: xxx
stable_access_token_expires_at: 0123456789
Here, stable_access_token_expires_at
connotes the time at which this so-called "stable" access token will expire - usually about three months. However, if you simply remove stable_access_token
from particle.yml
. regenerate a temporary access token, and re-run particle.init()
, particle
will create a new stable access token for you.
particle
collects information about posts on particular facebook pages. Like Twitter, it does this by following a list of page slugs.
pages:
- nytimes
Once again, you can specify the number of posts to look through for each user name, now called page_limit
(you'll see why, below).
page_limit: 10
In addition, if you have access to the Insights accounts for a set of pages, particle
can grab highly detailed information about how people engaged with posts on these pages over time. You can specify this in particle.yml
as follows:
insights_pages:
- nytimes
In this case, you'll want to set the limit (x) according to the number of posts you'd like to track at any one time. In this logic, when a post first goes live, it will be tracked until x more posts have been published. Usually I just leave this at it's maximum: 200:
insights_limit: 200
Putting it all together, your twitter
settings should look something like this:
facebook:
app_id: xxx
app_secret: xxx
temp_access_token: xxx
stable_access_token: xxx
stable_access_token_expires_at: 0123456789
pages:
- nytimes
insights_pages:
- nytimes
insights_limit: 200
A powerful feature of particle
is it's ability to pull in links from arbitrary pages and extract metadata about those links, like their position, font size, associated text, and image attributes. This feature works by harnessing PhantomJS and Selenium to render the pages in a headless browser and crawl the links to match patterns. You can use it to detect links on homepages or really any other site around the web. This function is simply customized in particle.yml
:
promopages:
nyt_homepage: http://www.nytimes.com/
Here the key of the promopage - nyt_homepage
- indicates how the datastore will refer to an event on this page (more on this below.) and the value is the url you want to track for links.
Finally, particle
also allows you to pull in content from abrtrary rss feeds. similar to promopages
, you set these in particle.yml
by listing key value pairs that correspond to the name and the url of the feed. In this case, however, the value is a list of two parameters: feed_url
- the url of the feed - and feed_text
which indicates whether or not the feed is publishing the entirety of an articles content. If feed_text
is set to "false" it, particle
will attempt to scrape these pages for the article text. Here's all that in particle.yml
:
nyt_timeswire:
feed_url: http://www.nytimes.com/timeswire/feeds/
full_text: false
The data is stored in redis
as a sorted set in which the keys are resolved article urls, the rank is the bucketed timestamp, and the value is a json file of an event associated with that timestamp, with the key as the data source and the value as the data associated with that event. This means, by querying redis you can quickly get all the events for a url at a particular time, within a timerange, or across the entire span of the data without doing any joins.
particle
has a built-in