WMATA Watcher

A study of tweets and their correlation to delays in public transportation.

In recent quarters, service issues on Washington DC's MetroRail have been on the uptick:

This project was motivated by a discovery I found in searching through the combination of Tweets, the Metro Daily Service Archives, and Official Alerts from @MetroRailInfo. The data is plotted in Fig. 2. In just one week's worth of data, I discovered that MetroRailInfo failed to provide any notice (in blue) to riders about delays during two afternoon rush hours. These delays were later acknowledged by Metro (in red). In that same timeframe, Tweet activity (in black) spiked, suggesting that there may be valuable information contained within.

From here, I gathered several months of Tweets and scraped the Service Archive. This allowed me to construct a training set:

As shown in the sketchnote, a "bag of words" approach was used to construct features initially. Word content of the tweets (both 1 and 2-grams) were vectorized and normalized by term-frequency inverse document-frequency. Classification by Random Forest and SVM both sacrificed far too much absolute accuracy to boost the precision (predicting a delay in each time bin), and so were not useful for deployment to an app.

A more coarse-grain model was built to categorize Tweets based on their word content:

Delay-related Tweets - containing words such as "delay" or "offloading"
Line-related Tweets - containing specific mention of the line colors of Metro
Service OK-related Tweets - containing specific mention of words related to service resuming

Additional features used were the Tweet volume, Tweet day of week, hour of day, month, and week of the year. A recall of 0.5 is achieved using a Random Forest.

andrewyue/WMATAWatcher

WMATA Watcher