Tools to analyze news feeds for equity in coverage.
This is a Node.js package to parse a news organization's RSS feeds and analyze its text for arrays of key topics – whether they're towns, or politicians, or sports teams – to generate data on how equitably journalism institutions allocate their attention.
Results are saved in a PostgreSQL database described by the database.sql
file. This code can be run locally, or scheduled to run daily on a hosted Node.js server.
This app can be run locally. To do so, you'll need have Node.js and npm installed locally, plus a local PostgreSQL database.
Clone the repository to your hard drive by running these commands in your command shell or terminal:
$ git clone https://github.com//vigorousnorth/news-equity-analytics.git
$ cd news-equity-analytics
This software uses PostgreSQL database to save records of when and where specific topics are mentioned in news feeds of interest. The database is described by the database.sql
file, which can be imported from the command line to set up all the nececssary tables. Here's a quick, plain-english guide to the most important tables we're setting up:
The feeds
table includes information about the RSS feeds we're scraping, including URLs, a publisher ID that links to the publishers
table, and XML information.
The places
table describes the topics we're searching for in each news story. Generally these are names of cities, towns or neighborhoods, but this table could also be used to include lists of people, sports teams or other topics of interest.
Each entry includes a column for potential aliases and disambiguation columns (not_followed_by
and not_preceded_by
) to filter out unwanted matches (for instance, to find mentions of "Brooklyn" but not "Brooklyn Nets" or "Brooklyn Bridge, the comma-separated string "Nets,Bridge" would be included in the not_followed_by
column for Brooklyn).
These tables exist to relate RSS feeds to different publishers or media markets.
publishers
identifies parent news organizations and relates them to specific feeds in our feeds
table. We're using this table because you may wish to analyze multiple RSS feeds from a single corporate owner – for instance, to compare coverage in the New York Times real estate section vs. the New York Times arts section.
media_markets
is a two-column table that exists to group together multiple feeds from a single media market for comparison purposes. The id
column is referenced from the publishers
table.
A table of individual news articles, including columns for the headline
, date
, byline
and url
. The feed_id
column references the corresponding id from the feeds
table.
The url
column performs as a unique key, so that duplicate values of one article at the same url are not allowed.
This table is what we're most interested in: it contains one row for every time a topic in the places
table is mentioned in an article from the articles
table – basically a table of search results.
It includes 4 columns besides the unique id
for each row: an article_id
column that references the corresponding article from the article
table; a relevance_score
column, which roughly measures how prominent the mention is (a score of 100 means that the mention is included in the headline; a score of 1 means that the mention is in the very last sentence of the article); a context
column that includes the sentence where the mention was found; and a place_id
column that references the places
table.
Now that you know the rough structure of the database we're using, let's get it set up on your hard drive. First, test your local PostgreSQL database connection:
$ postgresql://localhost/postgres?user=[ your user name ]&password=[ your password ]
You should see something like this:
$ psql (10.3)
Type "help" for help.
... which specifies the version of your local PostgreSQL installation and displays a shell command line where you can run queries of your database. If it's working, type \q
to quit the shell.
If it's not working, make sure you've got PostgreSQL installed and running locally, and double-check your connection credentials.
Create an empty database named "postgres", then import the database structure:
$ createdb -T template0 postgres
$ psql postgres < database.sql
You should see a bunch of CREATE TABLE
and ALTER TABLE
messages.
Let's connect to the postgres
database from our command line and see if it worked. We'll type psql postgres
to connect to the postgres
database using the psql
interactive terminal, then we'll use the \d
command to see a list of all the tables in the database:
$ psql postgres
postgres-# \d
List of relations
Schema | Name | Type | Owner
--------+-----------------------+----------+----------
public | articles | table | [your user name here]
public | articles_id_seq | sequence |
public | feeds | table |
public | feeds_id_seq | sequence |
public | media_markets | table |
public | media_markets_id_seq | sequence |
public | place_mentions | table |
public | place_mentions_id_seq | sequence |
public | places | table |
public | places_id_seq | sequence |
public | publishers | table |
public | publishers_id_seq | sequence |
public | region_relations | table |
(13 rows)
If that's what you're seeing, your database is all set up.
We're using the dotenv package to manage database connections and credentials. For a local installation, create a new text file inside the news-equity-analytics
directory like this one:
DB_HOST=127.0.0.1
DB=postgres
DATABASE_URL=postgresql://localhost/postgres?user=[your database user name]&password=[your database password]
NODE_ENV=local
and save the file as .env
(you'll use different environment variables if and when you install this software on a hosted server with another database).
Our Node.js app (in /bin/app.js
) runs as a series of asynchronous functions in a chain of promises:
- First, the callback to the
start
function loads thelib/queryForTopics.js
file, which connects to the database and returns an array of objects that describe the contents of theplaces
table. - Then, a chain of promises inside the
start()
function consumes the array of search topics and prepares them for searching by grouping them by media market and adding disambiguation properties (to distinguish East Flatbush from Flatbush, for instance) - Then, a new
queryForFeeds()
promise described inlib/queryForFeeds.js
requests the list of RSS feeds from our pSQL database and resolves with an array of RSS feeds; - Then, the getFeedsContent() promise described in
lib/feedparser.js
returns an array of items (articles) from each RSS feed; - Then, with a nested array of feeds containing arrays of content to search, the app then loops through each article item and calls the asynchronous
searchAndAdd()
function on each one: - The
searchAndAdd()
function first attempts to inset the unique article id into the pSQL table. If the article's URL is already there, it throws an error and continues on to the next article. - The
searchAndAdd()
function then breaks the article text into sentences, loops through the list of search topics on each one, and returns aninsertPlaceMention
promise for each hit; - Each hit generates its own promise in the
insertPlaceMention(..)
function, which is is fulfilled when a result is written to theplace_mentions
table and returns its unique id. Once allinsertPlaceMention
promises are fulfilled, thesearchAndAdd
function returns the unique article id to the main promise chain. - The final
Promise.all()
function resolves when all of the article ids are returned from thesearchAndAdd()
function, and the app is finished.
Before we can test our local installation, we need a basic list of search topics to put in our places
table.
Our repository includes a file called sampleData.sql
that populates our database with a very basic sample dataset for our queries.
It adds one row in the media_markets
table – the New York metropolitan area – and two rows each in the publishers
and feeds
tables, for the New York Post and the New York Daily News.
And, for the sake of simplicity, it only adds two rows in the places
table: Brooklyn and Staten Island. Both these rows refer back to the "New York metro area" row in the media_markets
table.
To run these queries and update your database with the test data, run this command:
$ psql postgres < sampleData.sql
To see if it worked, send your database a simple SQL query:
$ psql postgres -c 'SELECT * FROM places'
id | place_name | place_aliases | geocode | not_followed_by | not_preceded_by | market_id
----+---------------+---------------+---------+-----------------+-----------------+-----------
1 | Brooklyn | | | | | 1
2 | Staten Island | | | | | 1
(2 rows)
And now, the moment of truth...
Run the bin/app.js
file with node:
$ node bin/app
You'll hopefully see some console logging messages stream by as various Javascript promises are fulfilled – from fetching the feed content, to adding search results to the place_mentions
table. You'll probably also see some SQL error messages – something to be handled better in future versions of this software.
Let's see what we got. Query your place_mentions
table to see the results:
$ psql test -c 'SELECT * FROM place_mentions'
If all goes well, you should see here a list of article excerpts from today's New York City tabloids that mention "Brooklyn" and "Staten Island."
In practice, you're probably interested in more than two topics. If you're looking at New York City, you might want to analyze news mentions of all five boroughs, plus dozens of neighborhoods and maybe even some suburbs.
To upload longer lists of topics to your database, I recommend using pgAdmin, a user-friendly PostgreSQL administration platform. It will allow you to import your topic lists as a csv file into the places
table (important note: because the market_id
column in the places
table references a unique key in the media_markets
table, you'll need to set up the media_markets
table first).
This software has been running for several months in a hosted installation on Heroku, making use of the Heroku Scheduler to run once a day to catch new articles on each newspaper's RSS feed, and making use of embedded Javascript templates to display results as JSON feeds and as interactive maps.
More detailed documentation on this option is coming soon!