A peek inside the black box.
When I started developing “NextBus Delay Tracker” (NBDT), the idea of
exploring an entirely new set of tools intrigued me.
Prior to this project, I had never worked with databases (NBDT uses
mySQL), Javascript libraries such as
jQuery and
Highcharts, PHP, Python,
HTML5/canvas, TypeKit, or
git/GitHub.
I have enjoyed the exploration process as much as I have
enjoyed dissecting NextBus’s predictions; truly, this exercise has been a
climb to stand on the shoulders of giants. To acknowledge all of the great
technologies that NBDT relies on, below, I break down NBDT into its
component parts and explain each component’s role in the overall tracker.
This post is the second of four in a series about public transportation and buses.
Part 1,
“This is the 1 Bus in Boston”, explores bus trends
gathered from real-time prediction and arrival data. Part 3,
“Predictions from
Predictions”, compares the accuracy of predictions from NextBus Delay Tracker and
NextBus. Part 4, “The Case for Public Transportation”,
is an ode to public transportation in sunny and sprawled Los Angeles.
Broadly...
The cartoon block diagram below illustrates the general flow of information from
the transmission of real-time bus locations to the storage and analysis of NextBus’s
generated predictions. Although the client-side browser is responsible for
unpacking JSON data objects to render charts,
the actual day to day work of generating, storing, and analyzing predictions falls on
two servers: a server that collects real-time bus information and generates
predictions (operated for free by NextBus Inc.),
and a shared Unix environment on Dreamhost
that runs NBDT’s Python code, stores data, and hosts the mySQL database.
Polling for predictions
Although the link between NextBus’s real-time location
trackers and the NextBus database is proprietary, NextBus exposes a substantial
amount of data from its bus trackers in a publicly accessible XML-formatted feed.
Querying the NextBus feed simply requires opening a properly formatted URL
according to
NextBus’s XML specification, and then NextBus returns a text file with the
requested information. For example, opening the following address in a
web browser returns southbound predictions for the 1 Bus’s Harvard/Holyoke and
Central Square stops:
http://webservices.nextbus.com/service/publicXMLFeed?command=predictionsForMultiStops&a=mbta&stops=1|110&stops=1|72.
To poll for and store NextBus predictions, I wrote an open-source
Python scraper.
The scraper runs every other minute on the
Dreamhost server via crontab, a common Unix utility for scheduling tasks. At
the beginning of each day,
the scraper updates the list of stops for each bus route of interest. Although
uncommon, transit agencies occasionally modify stops for a particular route to
accomodate temporary road closures or permanent route modifications. Typically,
buses only have two routes (one in each direction); however, buses can have
multiple branches. Using the updated
route data, the scraper requests predictions for every stop and then stores all
predictions as comma separated values to a uniquely named text file
based on each bus’s vehicle ID, trip ID, and the current date. As it is not possible
to obtain historical predictions from NextBus, logging predictions frequently in real
time is essential to capturing different prediction dynamics, and prediction logs
remain on the server as text files for future use.
The magical mySQL database
When I first started analyzing NextBus predictions, I didn’t know about
relational databases. Instead of sensibly taking all of the comma
separated value files and importing them into a database, I tried to automate
everything in Matlab. The approach had several downfalls: for example,
filtering the dataset required more effort due to the lack of a SQL-like language,
and updating data in real time was difficult due to the lack of support for
defining primary keys. However, despite these difficulties, the initial Matlab
analysis convinced me that there were patterns worth pursuing in the data.
After learning the basics of relational databases, I moved the data import
and analysis from Matlab to a combination of Python and mySQL.
The database schema, which specifies how data is
stored inside the database, is fairly simple. Here is a sample of the schema for the
bus predictions table, with asterixes defining the fields in the primary key:
- epochTime* (bigint): time, in seconds since the Unix epoch (January 1, 1970),
that the scraper asked for a prediction
- year (smallint), month (tinyint), day (tinyint), dayofWeek (tinyint):
duplicate time fields derived from epochTime during record insertion to
speed up queries
- vehicleID* (int): the unique ID assigned to each physical bus
- tripID* (int): the ID assigned to a bus trip; this is not unique
- route (smallint), direction (tinyint), stop* (varchar):
bus trip information
- prediction (smallint): seconds until the next arrival at this stop relative
to when the scraper asked for a prediction
- error (smallint): error in seconds of this prediction, updated after the bus
arrives
After retrieving each new set of predictions from NextBus, a Python script
inserts bus trips, stop predictions, stop arrivals, and prediction errors into the
database. The script takes advantage of a series of useful libraries, such as
PyMySql and
mySQLDB
for database transactions and
pytz for timezone support.
Analysis
There are two sets of queries that provide information to NextBus Delay Tracker.
The first set, static queries, only
run once for the analysis period; these types of queries return static data
for the NBDT charts. The second set, flipboard queries, run
once every two minutes after the scraper to update inputs for the linear regression
error estimator show in the flipboard illustration. Both queries are run by Python
scripts using the same mySQL support libraries (PyMySQL and mySQL), and the scripts
store output data from the queries as JSON files.
The prediction algorithm is a linear regression
based on each stop’s current prediction, as well as the error of similar predictions
for the past few hours. For example, if NextBus predicts that the next bus for
Central Square will arrive in 5 minutes, then the script queries the database
for all Central Square predictions and errors between 2.5 and 7.5 minutes over the
last two hours. To facilitate training and feature selection, NBDT uses
lintrain,
a custom linear regression library built by the talented
Nathan Perkins for MIT’s Big Data
Challenge. Although I also tried using SciKit’s random forest
regression tools, running SciKit on Dreamhost to generate error estimates in real time
required compiling two Fortran libraries on Dreamhost’s system, and the task proved
too difficult given limited privileges in the shared environment.
During testing with the linear regression, I learned that most features that I intuitively
thought would matter turned out to be substantially less useful in predicting NextBus’s errors.
For example, it initially appeared natural to suspect that predictions would be less
accurate during rush hour, when traffic flows and accidents may create unexpected
delays. However, after trying a series of features, I found NextBus’s recent predictions
and their actual errors, as well as the derivatives of these pairs for each stop, to
provide the best (least mean squared error) performance. Consequently,
the current flipboard display estimates errors using a linear regression based on
current and past predictions from the last two hours, their errors, and their
derivatives.
Presentation
The “pictures” shown for NBDT are actually dynamically drawn by the browser, either
via Highcharts or a home-grown flipboard illustration based on HTML5’s canvas
tag. Avoiding static picture formats such as JPEGs and PNGs makes parameterizing
the drawing process substantially easier. For this reason, when data such as a
prediction changes, a simple text change can propagate through to the browser
as a “graphical” update. Additionally, the parameterization makes creating a new chart
a relatively simple process. Every chart on the NBDT page is drawn
with a single, generic chart command that lightly wraps around Highcharts.
To pass data from the Python-created JSON files into Highcharts and the flipboard,
the main Javascript file utilizes two jQuery/AJAX calls to unpack the
previously mentioned static and flipboard JSON files and load them into
Javascript arrays. jQuery is a Javascript wrapper that simplifies interactions with
objects in the browser’s document object model, and AJAX calls allow browsers
to send and receive data without refreshing a page. Because the static data does
not change, the first AJAX call only runs once. In contrast, every minute, the
second AJAX call refreshes the flipboard’s JSON file and then redraws the flipboard.
Because Apache and most browsers support automatic compression for text files, the
data required to draw the graphical elements of NBDT are automatically
compressed for supporting clients. Broadly speaking, the web pages that readers see
for NBDT start as a combination of dynamic and static text files that the browser
magically renders into a coherent document.
Stand on the shoulders of giants
I am amazed by the tools and technologies that have made NextBus Delay Tracker
possible. I find tremendous elegance in the design process that requires piecing
disparate parts
together, block by block, paying attention not only to the individual module, but
also their interactions, until the sum of the whole exceeds its component pieces.
Today, the component pieces for software and hardware are more spectacular and more
accessible than they have ever been in the past; as an engineer, I am inspired.
This is the second article in a series of four. Other posts in this series:
Part 1: “This is the 1 Bus in Boston.”
Part 3: “Predictions from Predictions.”
Part 4: “The Case for Public Transportation.”
Posted May 14, 2014; last substantially updated July 28, 2014