Blogs

What electronic bus boards don’t tell you, filling in the gaps in TfL bus data with Kognitio SQL

We previously talked about reading JSON files into Kognitio from Amazon S3 in this previous blog post and like with most data, it needed to be cleansed and formatted. In this post we will go through one that needed to be applied to the TfL bus data we collected. This is separated into lookups (information about a route) and real time (arrivals data).

The lookup data was readily usable save for a few date formatting exercises but the real time data needed work because it didn’t come with actual arrival times. What it does have is predicted arrival times for buses and once the bus has arrived at a bus stop then the predictions for that bus will disappear.

For example I grouped the route 1 arrivals for a particular like this by collection time (ordered by collect_ts, route_id, direction and expected_arrival):

Collect_ts Route_ID Direction Vehicle_ID Stop_sequence Expected_arrival
9:59:00 Route 1 Outbound VN63RKL Stop 3 09:59:47
9:59:00 Route 1 Outbound VN63RKL Stop 4 10:08:23
9:59:00 Route 1 Outbound VN63RKL Stop 5 10:13:39

But the next collection at 10:00:00 looks like this:

Collect_ts Route_ID Direction Vehicle_ID Stop_sequence Expected_arrival
10:00:00 Route 1 Outbound VN63RKL Stop 4 10:08:23
10:00:00 Route 1 Outbound VN63RKL Stop 5 10:13:39

As predictions for stop 3 have stopped, we can deduce that that bus must have arrived at that stop at some time between our two collection times; 9:59:00 and 10:00:00. The data being offered from the TfL API is basically the data you see at the electronic boards at bus stops i.e. it only offers data on upcoming buses but for the past, that’s something we needed to derive ourselves.

So how did we do this in Kognitio SQL? We used analytic functions.

Analytic functions are any functions that compute an aggregate based on a set of rows e.g. sum, average, count etc. For the buses we first needed the minimum stop number in each collection so that we can compare them:

Min(stop_sequence) over (partition by collect_ts, route_id, vehicle_id, direction) min_seq

Where the “partition by” sets the grouping for the sets of rows to compute the minimum over. Traditionally, other databases will iteratively process each partition and with a large cardinality of rows this becomes very long running. Kognitio does this by parallelising on the partitioned columns instead and will only take as long as the slowest partition.

If we go back to our earlier example, we now have:

Collect_ts Route_ID Direction Vehicle_ID Stop_sequence Expected_arrival Min_seq
9:59:00 Route 1 Outbound VN63RKL Stop 3 09:59:47 3
9:59:00 Route 1 Outbound VN63RKL Stop 4 10:08:23 3
9:59:00 Route 1 Outbound VN63RKL Stop 5 10:13:39 3
Collect_ts Route_ID Direction Vehicle_ID Stop_sequence Expected_arrival Min_seq
10:00:00 Route 1 Outbound VN63RKL Stop 4 10:08:23 4
10:00:00 Route 1 Outbound VN63RKL Stop 5 10:13:39 4

Where we have the extra column for the minimum sequence. So each collection now has the information we need to compare. To actually compare we’re going to use lead:

Lead(collect_ts) over (partition by route_id, direction order by collect_ts, route_id, direction and expected_arrival) next_collect_ts

The aim of this is to get the collection time from the next collection hence why we didn’t partition by as many columns like with the minimum sequence previously. We apply this to the minimum sequence too:

Lead(min_seq) over (partition by route_id, direction order by collect_ts, route_id, direction and expected_arrival) next_min_seq

Which gives us:

Collect_ts Route_ID Direction Vehicle_ID Stop_sequence Expected_arrival Min_seq Next_collect_ts Next_min_seq
9:59:00 Route 1 Outbound VN63RKL Stop 3 09:59:47 3 09:59:47 4
9:59:00 Route 1 Outbound VN63RKL Stop 4 10:08:23 3 09:59:47 5
9:59:00 Route 1 Outbound VN63RKL Stop 5 10:13:39 3 10:00:00 4
Collect_ts Route_ID Direction Vehicle_ID Stop_sequence Expected_arrival Min_seq Next_collect_ts Next_min_seq
10:00:00 Route 1 Outbound VN63RKL Stop 4 10:08:23 4 10:00:00 5
10:00:00 Route 1 Outbound VN63RKL Stop 5 10:13:39 4 (null) (null)

But only the last row in the first partition has the collection time from the next collection. As lead only applies to the next immediate row. However, we can use another analytic function called last as the last row has the information we need:


Last(next_collect_ts) over (partition by collect_ts, route_id, vehicle_id, direction)
Last(next_min_seq) over (partition by collect_ts, route_id, vehicle_id, direction)

Giving us:

Collect_ts Route_ID Direction Vehicle_ID Stop_sequence Expected_arrival Min_seq Next_collect_ts Next_min_seq Next_part_collect_ts Next__part_min_seq
9:59:00 Route 1 Outbound VN63RKL Stop 3 09:59:47 3 09:59:47 4 10:00:00 4
9:59:00 Route 1 Outbound VN63RKL Stop 4 10:08:23 3 09:59:47 5 10:00:00 4
9:59:00 Route 1 Outbound VN63RKL Stop 5 10:13:39 3 10:00:00 4 10:00:00 4
Collect_ts Route_ID Direction Vehicle_ID Stop_sequence Expected_arrival Min_seq Next_collect_ts Next_min_seq Next_part_collect_ts Next__part_min_seq
10:00:00 Route 1 Outbound VN63RKL Stop 4 10:08:23 4 09:59:47 5 (null) (null)
10:00:00 Route 1 Outbound VN63RKL Stop 5 10:13:39 4 (null) (null) (null) (null)

Now that we have those last two columns from the next collections, we can compare them with a case statement:

Case when stop_sequence < next_min_seq then next_collect_ts else null end arrived_by

So we now have (temporary columns used up until now have been omitted in the SQL):

Collect_ts Route_ID Direction Vehicle_ID Stop_sequence Expected_arrival Arrived_by
9:59:00 Route 1 Outbound VN63RKL Stop 3 09:59:47 10:00:00
9:59:00 Route 1 Outbound VN63RKL Stop 4 10:08:23 (null)
9:59:00 Route 1 Outbound VN63RKL Stop 5 10:13:39 (null)
Collect_ts Route_ID Direction Vehicle_ID Stop_sequence Expected_arrival Arrived_by
10:00:00 Route 1 Outbound VN63RKL Stop 4 10:08:23 (null)
10:00:00 Route 1 Outbound VN63RKL Stop 5 10:13:39 (null)

So the bus must have arrived at stop 3 between 9:59:00 and 10:00:00 where we can calculate the middle of this window for an approximated of the arrival time. We can’t say for certain whether it arrived before or after the expected time but with such small windows we can extrapolate it with confidence.

Now we can filter by that arrival time column to get all the arrival times for this bus on this route and call it a journey. It may also be worth keeping the null value rows to see what the predictions are. For me, I kept them to find the next expected arrivals, but keeping a track of how the predictions change over time would be interesting too.

Leave a Reply

Your email address will not be published nor used for any other purpose. Required fields are marked *