A peek inside the black box.

When I started developing “NextBus Delay Tracker” (NBDT), the idea of exploring an entirely new set of tools intrigued me. Prior to this project, I had never worked with databases (NBDT uses mySQL), Javascript libraries such as jQuery and Highcharts, PHP, Python, HTML5/canvas, TypeKit, or git/GitHub. I have enjoyed the exploration process as much as I have enjoyed dissecting NextBus’s predictions; truly, this exercise has been a climb to stand on the shoulders of giants. To acknowledge all of the great technologies that NBDT relies on, below, I break down NBDT into its component parts and explain each component’s role in the overall tracker.

Broadly...

The cartoon block diagram below illustrates the general flow of information from the transmission of real-time bus locations to the storage and analysis of NextBus’s generated predictions. Although the client-side browser is responsible for unpacking JSON data objects to render charts, the actual day to day work of generating, storing, and analyzing predictions falls on two servers: a server that collects real-time bus information and generates predictions (operated for free by NextBus Inc.), and a shared Unix environment on Dreamhost that runs NBDT’s Python code, stores data, and hosts the mySQL database.

Polling for predictions

Although the link between NextBus’s real-time location trackers and the NextBus database is proprietary, NextBus exposes a substantial amount of data from its bus trackers in a publicly accessible XML-formatted feed. Querying the NextBus feed simply requires opening a properly formatted URL according to NextBus’s XML specification, and then NextBus returns a text file with the requested information. For example, opening the following address in a web browser returns southbound predictions for the 1 Bus’s Harvard/Holyoke and Central Square stops: http://webservices.nextbus.com/service/publicXMLFeed?command=predictionsForMultiStops&a=mbta&stops=1|110&stops=1|72.

To poll for and store NextBus predictions, I wrote an open-source Python scraper. The scraper runs every other minute on the Dreamhost server via crontab, a common Unix utility for scheduling tasks. At the beginning of each day, the scraper updates the list of stops for each bus route of interest. Although uncommon, transit agencies occasionally modify stops for a particular route to accomodate temporary road closures or permanent route modifications. Typically, buses only have two routes (one in each direction); however, buses can have multiple branches. Using the updated route data, the scraper requests predictions for every stop and then stores all predictions as comma separated values to a uniquely named text file based on each bus’s vehicle ID, trip ID, and the current date. As it is not possible to obtain historical predictions from NextBus, logging predictions frequently in real time is essential to capturing different prediction dynamics, and prediction logs remain on the server as text files for future use.

The magical mySQL database

When I first started analyzing NextBus predictions, I didn’t know about relational databases. Instead of sensibly taking all of the comma separated value files and importing them into a database, I tried to automate everything in Matlab. The approach had several downfalls: for example, filtering the dataset required more effort due to the lack of a SQL-like language, and updating data in real time was difficult due to the lack of support for defining primary keys. However, despite these difficulties, the initial Matlab analysis convinced me that there were patterns worth pursuing in the data.

After learning the basics of relational databases, I moved the data import and analysis from Matlab to a combination of Python and mySQL. The database schema, which specifies how data is stored inside the database, is fairly simple. Here is a sample of the schema for the bus predictions table, with asterixes defining the fields in the primary key:

epochTime* (bigint): time, in seconds since the Unix epoch (January 1, 1970), that the scraper asked for a prediction
year (smallint), month (tinyint), day (tinyint), dayofWeek (tinyint): duplicate time fields derived from epochTime during record insertion to speed up queries
vehicleID* (int): the unique ID assigned to each physical bus
tripID* (int): the ID assigned to a bus trip; this is not unique
route (smallint), direction (tinyint), stop* (varchar): bus trip information
prediction (smallint): seconds until the next arrival at this stop relative to when the scraper asked for a prediction
error (smallint): error in seconds of this prediction, updated after the bus arrives

After retrieving each new set of predictions from NextBus, a Python script inserts bus trips, stop predictions, stop arrivals, and prediction errors into the database. The script takes advantage of a series of useful libraries, such as PyMySql and mySQLDB for database transactions and pytz for timezone support.

Analysis

There are two sets of queries that provide information to NextBus Delay Tracker. The first set, static queries, only run once for the analysis period; these types of queries return static data for the NBDT charts. The second set, flipboard queries, run once every two minutes after the scraper to update inputs for the linear regression error estimator show in the flipboard illustration. Both queries are run by Python scripts using the same mySQL support libraries (PyMySQL and mySQL), and the scripts store output data from the queries as JSON files.

The prediction algorithm is a linear regression based on each stop’s current prediction, as well as the error of similar predictions for the past few hours. For example, if NextBus predicts that the next bus for Central Square will arrive in 5 minutes, then the script queries the database for all Central Square predictions and errors between 2.5 and 7.5 minutes over the last two hours. To facilitate training and feature selection, NBDT uses lintrain, a custom linear regression library built by the talented Nathan Perkins for MIT’s Big Data Challenge. Although I also tried using SciKit’s random forest regression tools, running SciKit on Dreamhost to generate error estimates in real time required compiling two Fortran libraries on Dreamhost’s system, and the task proved too difficult given limited privileges in the shared environment.

During testing with the linear regression, I learned that most features that I intuitively thought would matter turned out to be substantially less useful in predicting NextBus’s errors. For example, it initially appeared natural to suspect that predictions would be less accurate during rush hour, when traffic flows and accidents may create unexpected delays. However, after trying a series of features, I found NextBus’s recent predictions and their actual errors, as well as the derivatives of these pairs for each stop, to provide the best (least mean squared error) performance. Consequently, the current flipboard display estimates errors using a linear regression based on current and past predictions from the last two hours, their errors, and their derivatives.

Presentation

The “pictures” shown for NBDT are actually dynamically drawn by the browser, either via Highcharts or a home-grown flipboard illustration based on HTML5’s canvas tag. Avoiding static picture formats such as JPEGs and PNGs makes parameterizing the drawing process substantially easier. For this reason, when data such as a prediction changes, a simple text change can propagate through to the browser as a “graphical” update. Additionally, the parameterization makes creating a new chart a relatively simple process. Every chart on the NBDT page is drawn with a single, generic chart command that lightly wraps around Highcharts.

To pass data from the Python-created JSON files into Highcharts and the flipboard, the main Javascript file utilizes two jQuery/AJAX calls to unpack the previously mentioned static and flipboard JSON files and load them into Javascript arrays. jQuery is a Javascript wrapper that simplifies interactions with objects in the browser’s document object model, and AJAX calls allow browsers to send and receive data without refreshing a page. Because the static data does not change, the first AJAX call only runs once. In contrast, every minute, the second AJAX call refreshes the flipboard’s JSON file and then redraws the flipboard. Because Apache and most browsers support automatic compression for text files, the data required to draw the graphical elements of NBDT are automatically compressed for supporting clients. Broadly speaking, the web pages that readers see for NBDT start as a combination of dynamic and static text files that the browser magically renders into a coherent document.

Stand on the shoulders of giants

I am amazed by the tools and technologies that have made NextBus Delay Tracker possible. I find tremendous elegance in the design process that requires piecing disparate parts together, block by block, paying attention not only to the individual module, but also their interactions, until the sum of the whole exceeds its component pieces. Today, the component pieces for software and hardware are more spectacular and more accessible than they have ever been in the past; as an engineer, I am inspired.

This is the second article in a series of four. Other posts in this series:

Part 1: “This is the 1 Bus in Boston.”
Part 3: “Predictions from Predictions.”
Part 4: “The Case for Public Transportation.”

Posted May 14, 2014; last substantially updated July 28, 2014

Next: This is the 1 Bus in Boston.

I am Tommy Leung, an engineer and amateur chef. These are my curiosities. (RSS)