### What electronic bus boards don't tell you, filling in the gaps in TfL bus data with Kognitio SQL

We previously talked about reading JSON files into Kognitio from Amazon S3 in Read More

We recently spoke at Big Data London where our focus was about being able to perform BI on Hadoop using some bus data from Transport for London (TfL). After the talk, there was some interest in the maths behind what we presented and whilst not overly complicated (yet), a blog entry with more explanations seemed appropriate.

This bus data was done by reading JSON files into Kognitio which you can read more about here.

**Poisson process **

When waiting at bus stops you will see on the electronic board the expected time of arrival for the next bus. However this expectation changes as time goes on, maybe it’s late or even early (this is bad too!). So rather than look at delay times, it’s better to look at is waiting times as these can’t be misinterpreted.

The Poisson process is basically a counting process. Think of it as a set of steps:

Where each “step” is the count increasing. Typically it’s used to model arrivals, for example calls bus arrivals at a stop, to a call centre, customers at a supermarket checkout etc. However there are two types, constant or variable rate of arrivals (homogeneous and in-homogeneous).

There are more formal definitions which you can find on sites like Wikipedia and various papers/lecture notes returned by a search engine.

You’d expected bus waiting times (if they were operating efficiently) to be in the constant rate model with some tolerance of early/lateness, i.e. within one standard deviation. This constant rate is typically based on the average.

Visually this would look like:

A consideration I would propose, particularly with bus arrival times is that if you model this over say a month and some buses only run for certain times of the day then overall picture is:

As the constant rate uses an average, by having a large empty length of time between the arrivals it’s artificially lowering the arrival rate because it’s spread out over a longer period of time where the buses were not operating.

You could combine them without the gap in between which is what I chose to do (I wanted to model each bus route at each stop) or leave each day as its own model.

For example if you had two Poisson processes:

Adding these two together would yield the previous one and the arrival rate would just be the sum of their two individual arrival rates.

**Useful tips for modelling with Poisson processes**

The first useful tip is that you can query the data using the Poisson distribution to find out the probability of a particular event X occurring k times in time t.

Where λ is the arrival rate. This needs to be adjusted for the designated time window e.g. your arrival rate may be per hour and you want to look at a window of ten minutes so you’d have to divide the arrival rate by six.

For example, what’s the probability that exactly two buses will arrive in ten minutes based on past arrivals? The “exactly” is quite important here because the case of two buses doesn’t include the case with one bus! What you’d have to do is work out the probabilities for both cases and add them up which will give you probability that up to two buses (not including 0) will arrive in ten minutes.

But this involves lots of probabilities to be calculated and some may not apply to all, for example the probability two buses arriving in ten minutes may not apply to a bus stop further out from the city. Obviously you can be selective based on the distance but an overall picture would be more useful.

Take the example you’re waiting for the next bus at a bus stop for ten minutes. Imagine whilst waiting, at every minute you check whether the bus has arrived. If it has arrived then it’s a success and a failure when it hasn’t. In probability, you can model this with a Binomial distribution:

Which gives you the probability that you get k success in n trials.

Where p, the probability of success is the probability of an arrival at some waiting interval which can be approximated by

For us we want a bus (so k = 1) every minute (so if we monitor this over ten minutes, we’d have ten trials so n = 10) as that’s what we’re counting. As you get more granular with time (minutes, seconds etc.) this actually converges to the exponential distribution cumulative distribution function:

So we get no values if our time is before 0.

And if you plot this, what you get is the probability that a bus will arrive before some time based on the past data. And a good distribution would be a quicker plateau across time meaning the certainty that a bus will arrive after some time is high.

**What happened with my TfL buses?**

At a lot of the city central stops, there were many instances where the waiting time between buses was zero. If we look at Olympia bus station and plot the CDF across five time periods during the day e.g. morning, day, afternoon etc. we mentioned previously:

The probability after we wait the average waiting time is quite low (essentially a coin flip!) but if we plot the waiting times themselves:

But this doesn’t support our expectations from earlier. This shows several buses on the same route arriving at the same time and the peaks around the average are very weak! Basically if the bus doesn’t come along within the first two minutes of you arriving there then you’d be expected to wait a long time! This is caused by both late and early buses:

The third and fourth buses may have been a little early (within some tolerance) but if you turn up to a bus stop randomly then you’d likely arrive in that large gap caused by the fourth and fifth buses. Most people would shrug off the extra minute wait but if every bus were allowed to adjust like this then you’d arrive at a very inconsistent bus network.

A worse case is Hammersmith bus station, many buses were all arriving at once and there isn’t much of an average so that may look like this:

London is a busy city and certainly there will be traffic incidents, congestion etc. But the data shows that a constant rate model isn’t suitable for TfL buses. This is basically down to using the average. If arrivals were at minutes 2, 4 and 40 then the average would be around 15 minutes would be a poor estimation of waiting time.

But the median wouldn’t help here either. What this needs is a variable arrival rate modelled using an in-homogeneous Poisson process.

We previously talked about reading JSON files into Kognitio from Amazon S3 in Read More

Apache Hadoop is an open source software offering that supports the storage and processing of

Read More