Data discovery in Kognitio with TFL and Tableau

(Power by TFL Open Data, Contains OS data © Crown copyright and database rights 2016)

In previous blog posts I’ve gone over ways to push data to S3 and connect to it via Kognitio external tables (link). I’ll be using that method on some bus data provided by Transport for London (TFL) and accelerating a simple (for now) Tableau dashboard propped up by Kognitio.

For downloadable content in this post I’ll point and post to things to get your collecting process started where I’ll post the relevant SQL scripts and Tableau dashboards once I’ve into them in more detail at a later date.

So what’s in the API?

Well TFL provides many forms of transportation (bus, tube, DLR, cycles etc.) but for this I’ll concentrate on bus data. To get at this data, you’ll make an API call through something like wget to a URL with the information you want. For example “” will return a JSON file with route info for line 1.

You can find more information about the variations of the URLs here and keep in mind of the usage terms, particularly the amount of API calls per minute. Complete terms and conditions here.

A bit about bus stops: in the UK there’s a system called the “National Public Transport Access Node” (NaPTAN) for uniquely identifying points of public travel. This isn’t limited to just buses but what you’ll find is that a bus stop is usually part of a pair (on a street where the other one is on the opposite side of the road) or a cluster (one of many stops at a train/underground station). This can cause a bit of confusion when using the stop name, for example Marylebone Station (link) has a cluster of four different NaPTAN IDs so it’s advised to use the unique IDs in this case. You can find out more about NaPTAN here.

You can also get a complete list of NaPTANS from here.

Checklist (similar to S3 connector one but now with the TFL API included):

• TFL API account and credentials
• S3 bucket and credentials
• Linux instance to fetch and push from (I used an AWS t2.medium)
• Kognitio instance
• Tableau installed with either a trial or license

My collection scripts are here.

I’ve packed some bash scripts to do the fetching, you may need to change around the directory and/or bucket names to suit your environment. I’ve also included two CSV files, one with a list of line IDs and another for a list of bus stops. Remember to add in your TFL API and S3 credentials for the fetching and pushing as well.
Set up a cron job (or an equivalent task scheduler) to execute this on schedule (remember the call limit imposed by TFL), go enjoy life a bit and come back to your haul.
Now that we have it in S3 we can connect to using the S3 connector for JSON files. I did this in an earlier blog post here.

I’d advise collecting this for about a week at least, a month would be ideal.
I’ll go over the design choices and details of the SQL setup in a later blog post as has enough intricacies to warrant its own dedicated post.

I made an external table for each type of file and then we can start transforming the data. What I ended up creating from the data in views:

• End points – lookup for where a line departs and ends up.
• Route sequence – lookup for the stops for a line.
• Actual sequence – lookup for where the bus actually moves (the route sequence above would only give you a straight line between stops).
• Stop list – lookup for list with information on a stop like name and latitude and longitude values
• Stop distances – lookup for distance between stops
• Variation taken – lists what route variation a bus took
• Arrivals – bus arrivals with time stamps on TFL’s predictions and an arrival window

By using an external table, we can just drop new JSON files into the S3 directories and it’ll be picked up
Now let’s connect it to Tableau using a sample, you can use more if you wish but my machine isn’t/wasn’t impressive spec-wise. My data source diagram looks like:

Data source relationships

We have the real time arrivals data on the left and to that, we join all the lookups that we may want to use. You also have noticed we’re going to use a live connection for this as the views have been placed into memory so we can leverage that faster access. Note that the relationships you define here are quite important as we’re attaching lookup data (e.g. information on a stop) to real time data where it can appear more than once. So it’s encouraged to join these on various distinct fields together such as Naptan ID, direction etc.
From here you can create visuals in new sheets and create a collage of them in a dashboard. Going back to the TFL data, my dashboard now looks like:

TFL dashboard

Let’s go through what each of the sheets show.

• Overview stats – some simple information about the network, changes as you make choices about the options below such as direction, weekday, hour of travel, bus route etc.
• Route network – a map of the bus route network. Fairly simple visualisation using the latitude and longitudes of the stops, connecting them as a line via the Naptan sequence and colouring them by the line. This acts as a filter for driving subsequent visuals and you can see I’ve chosen route 32 as an example.
• Stop list – list of stops on a route, selecting one highlights itself on the route area sheet and a one mile radius around it.
• No. vehicles – shows how the number of buses for the chosen line change throughout the day
• Route performance – shows performance (between TFL’s prediction and the end of the arrival window) on a weekday view.
• Route area – a zoomed-in view of the route and the area around it based on the selection in route network. The larger circles indicate the selected route and the smaller ones show stops on a neighbouring route within n
miles where n is defined by the parameter.

Although they can look nice they do require some setup to allow interactions done in the “actions” part of a dashboard. What’s an action? In our dashboard above, if you wanted a highlighted bus stop to also be reflected in the map then you’ll need to define the action in dashboards > actions. Think of it as “If I do this here then also apply it over there” where one source sheet can apply to many others. Be aware if your dashboard is very complex with many actions then you’ll need to take care with them, something I like to do is number my actions so I know what order they go in.

You can find out more about this here and the ones I’ve used are:

Dashboard actions

A nice simple dashboard, not the most impressive aesthetically (I’m still picking that up) but it’s a start for conveying information about the bus network of TFL. In this case we can pick a route and identify where and when the predictions TFL vary from the actual arrival by a significant (this is subjective though, as people probably don’t care about 30 second variations) amount. It’s not perfect, for example choosing the route to view is not an informed choice, and you can probably point out some flaws but we’ll build on this over the next few blogs.

Next time I’ll go into more detail about the raw data, SQL and transformations I did.

Leave a Reply

Your email address will not be published nor used for any other purpose. Required fields are marked *

The Worst Bus Stop in London

We analyzed 4,938,534,706 data points, 19,687 individual bus stops, 675 bus routes and 9,641 buses. A snapshot of three months of bus journeys to find London’s worst bus stop. That’s a lot of data.

See the results