Documenting the conflicting messages from Big Data London
My mission at the moment is to reinvent Kognitio’s documentation suite. Good documentation needs to
Read MoreWe previously talked about reading JSON files into Kognitio from Amazon S3 in this previous blog post and like with most data, it needed to be cleansed and formatted. In this post we will go through one that needed to be applied to the TfL bus data we collected. This is separated into lookups (information about a route) and real time (arrivals data).
The lookup data was readily usable save for a few date formatting exercises but the real time data needed work because it didn’t come with actual arrival times. What it does have is predicted arrival times for buses and once the bus has arrived at a bus stop then the predictions for that bus will disappear.
For example I grouped the route 1 arrivals for a particular like this by collection time (ordered by collect_ts, route_id, direction and expected_arrival):
Collect_ts | Route_ID | Direction | Vehicle_ID | Stop_sequence | Expected_arrival |
---|---|---|---|---|---|
9:59:00 | Route 1 | Outbound | VN63RKL | Stop 3 | 09:59:47 |
9:59:00 | Route 1 | Outbound | VN63RKL | Stop 4 | 10:08:23 |
9:59:00 | Route 1 | Outbound | VN63RKL | Stop 5 | 10:13:39 |
But the next collection at 10:00:00 looks like this:
Collect_ts | Route_ID | Direction | Vehicle_ID | Stop_sequence | Expected_arrival |
---|---|---|---|---|---|
10:00:00 | Route 1 | Outbound | VN63RKL | Stop 4 | 10:08:23 |
10:00:00 | Route 1 | Outbound | VN63RKL | Stop 5 | 10:13:39 |
As predictions for stop 3 have stopped, we can deduce that that bus must have arrived at that stop at some time between our two collection times; 9:59:00 and 10:00:00. The data being offered from the TfL API is basically the data you see at the electronic boards at bus stops i.e. it only offers data on upcoming buses but for the past, that’s something we needed to derive ourselves.
So how did we do this in Kognitio SQL? We used analytic functions.
Analytic functions are any functions that compute an aggregate based on a set of rows e.g. sum, average, count etc. For the buses we first needed the minimum stop number in each collection so that we can compare them:
Min(stop_sequence) over (partition by collect_ts, route_id, vehicle_id, direction) min_seq
Where the “partition by” sets the grouping for the sets of rows to compute the minimum over. Traditionally, other databases will iteratively process each partition and with a large cardinality of rows this becomes very long running. Kognitio does this by parallelising on the partitioned columns instead and will only take as long as the slowest partition.
If we go back to our earlier example, we now have:
Collect_ts | Route_ID | Direction | Vehicle_ID | Stop_sequence | Expected_arrival | Min_seq |
---|---|---|---|---|---|---|
9:59:00 | Route 1 | Outbound | VN63RKL | Stop 3 | 09:59:47 | 3 |
9:59:00 | Route 1 | Outbound | VN63RKL | Stop 4 | 10:08:23 | 3 |
9:59:00 | Route 1 | Outbound | VN63RKL | Stop 5 | 10:13:39 | 3 |
Collect_ts | Route_ID | Direction | Vehicle_ID | Stop_sequence | Expected_arrival | Min_seq |
---|---|---|---|---|---|---|
10:00:00 | Route 1 | Outbound | VN63RKL | Stop 4 | 10:08:23 | 4 |
10:00:00 | Route 1 | Outbound | VN63RKL | Stop 5 | 10:13:39 | 4 |
Where we have the extra column for the minimum sequence. So each collection now has the information we need to compare. To actually compare we’re going to use lead:
Lead(collect_ts) over (partition by route_id, direction order by collect_ts, route_id, direction and expected_arrival) next_collect_ts
The aim of this is to get the collection time from the next collection hence why we didn’t partition by as many columns like with the minimum sequence previously. We apply this to the minimum sequence too:
Lead(min_seq) over (partition by route_id, direction order by collect_ts, route_id, direction and expected_arrival) next_min_seq
Which gives us:
Collect_ts | Route_ID | Direction | Vehicle_ID | Stop_sequence | Expected_arrival | Min_seq | Next_collect_ts | Next_min_seq |
---|---|---|---|---|---|---|---|---|
9:59:00 | Route 1 | Outbound | VN63RKL | Stop 3 | 09:59:47 | 3 | 09:59:47 | 4 |
9:59:00 | Route 1 | Outbound | VN63RKL | Stop 4 | 10:08:23 | 3 | 09:59:47 | 5 |
9:59:00 | Route 1 | Outbound | VN63RKL | Stop 5 | 10:13:39 | 3 | 10:00:00 | 4 |
Collect_ts | Route_ID | Direction | Vehicle_ID | Stop_sequence | Expected_arrival | Min_seq | Next_collect_ts | Next_min_seq |
---|---|---|---|---|---|---|---|---|
10:00:00 | Route 1 | Outbound | VN63RKL | Stop 4 | 10:08:23 | 4 | 10:00:00 | 5 |
10:00:00 | Route 1 | Outbound | VN63RKL | Stop 5 | 10:13:39 | 4 | (null) | (null) |
But only the last row in the first partition has the collection time from the next collection. As lead only applies to the next immediate row. However, we can use another analytic function called last as the last row has the information we need:
Last(next_collect_ts) over (partition by collect_ts, route_id, vehicle_id, direction)
Last(next_min_seq) over (partition by collect_ts, route_id, vehicle_id, direction)
Giving us:
Collect_ts | Route_ID | Direction | Vehicle_ID | Stop_sequence | Expected_arrival | Min_seq | Next_collect_ts | Next_min_seq | Next_part_collect_ts | Next__part_min_seq |
---|---|---|---|---|---|---|---|---|---|---|
9:59:00 | Route 1 | Outbound | VN63RKL | Stop 3 | 09:59:47 | 3 | 09:59:47 | 4 | 10:00:00 | 4 |
9:59:00 | Route 1 | Outbound | VN63RKL | Stop 4 | 10:08:23 | 3 | 09:59:47 | 5 | 10:00:00 | 4 |
9:59:00 | Route 1 | Outbound | VN63RKL | Stop 5 | 10:13:39 | 3 | 10:00:00 | 4 | 10:00:00 | 4 |
Collect_ts | Route_ID | Direction | Vehicle_ID | Stop_sequence | Expected_arrival | Min_seq | Next_collect_ts | Next_min_seq | Next_part_collect_ts | Next__part_min_seq |
---|---|---|---|---|---|---|---|---|---|---|
10:00:00 | Route 1 | Outbound | VN63RKL | Stop 4 | 10:08:23 | 4 | 09:59:47 | 5 | (null) | (null) |
10:00:00 | Route 1 | Outbound | VN63RKL | Stop 5 | 10:13:39 | 4 | (null) | (null) | (null) | (null) |
Now that we have those last two columns from the next collections, we can compare them with a case statement:
Case when stop_sequence < next_min_seq then next_collect_ts else null end arrived_by
So we now have (temporary columns used up until now have been omitted in the SQL):
Collect_ts | Route_ID | Direction | Vehicle_ID | Stop_sequence | Expected_arrival | Arrived_by |
---|---|---|---|---|---|---|
9:59:00 | Route 1 | Outbound | VN63RKL | Stop 3 | 09:59:47 | 10:00:00 |
9:59:00 | Route 1 | Outbound | VN63RKL | Stop 4 | 10:08:23 | (null) |
9:59:00 | Route 1 | Outbound | VN63RKL | Stop 5 | 10:13:39 | (null) |
Collect_ts | Route_ID | Direction | Vehicle_ID | Stop_sequence | Expected_arrival | Arrived_by |
---|---|---|---|---|---|---|
10:00:00 | Route 1 | Outbound | VN63RKL | Stop 4 | 10:08:23 | (null) |
10:00:00 | Route 1 | Outbound | VN63RKL | Stop 5 | 10:13:39 | (null) |
So the bus must have arrived at stop 3 between 9:59:00 and 10:00:00 where we can calculate the middle of this window for an approximated of the arrival time. We can’t say for certain whether it arrived before or after the expected time but with such small windows we can extrapolate it with confidence.
Now we can filter by that arrival time column to get all the arrival times for this bus on this route and call it a journey. It may also be worth keeping the null value rows to see what the predictions are. For me, I kept them to find the next expected arrivals, but keeping a track of how the predictions change over time would be interesting too.
My mission at the moment is to reinvent Kognitio’s documentation suite. Good documentation needs to
Read MoreWe recently spoke at Big Data London where our focus was about being able to
Read More