This video is an example of how you can run interactive BI queries using a tool like Tableau with data held in Hadoop.
It’s a common misconception that it’s not possible to do Bi on Hadoop.
This stems from a belief that Hadoop cannot support interactive SQL querying of data generated/submitted by BI tools.
Using some Transport for London (TfL) bus data extracted between August and October 2017, we’re going to show that interactive querying is possible on Hadoop.
We’re using TfL bus data as our big data set, but you can use your own data seta and get the same performance.
During this demo, every time we drill into something in the Tableau dashboard and get an answer, we’re interactively querying data in Hadoop, using Tableau and Kognitio.
Let’s look at the dashboard that’s analyzing the data outlined in this blog: kognitio.com/blog/filling-in-the-gaps-tfl-bus-data/. (Also relevant is this blog referring to Poisson process)
For this demo, we are using Hortonworks HDP on Amazon EC2 instances.
Kognitio is running as a YARN application directly on the cluster and is providing the high performance SQL interface.
Tableau is connected directly to Kognitio.
The system is an 8 node cluster with 512 cores and 1.5TB RAM with 5.2 billion rows of transport data.
Here we can see the data in the Tableau dashboard.
At the top we can see some statistics about the bus network.
On the left we can see the activity by zones, also shown on the map as concentric circles.
Each radial bar represents a bus route. The longer the bar, busier that route was.
When we hover over these (known as action filters in Tableau), we’re interactively querying the Kognitio SQL layer which quickly returns both the number of records available and the route points so the route and the zones it travels through can be plotted by Tableau instantaneously.
The map is obviously incredibly busy, making it difficult to highlight a specific route.
So let’s filter the activity visual using the route_id parameter.
We’ve selected route 9 which passes through Olympia London.
When we click on route 9, Tableau will transition to the next dashboard and query the Kognitio database based on our selection.
The query took around 10 seconds to display the data of route 9 from all of the 5.2 billion rows of data we’ve accumulated.
This dashboard shows a general report about route 9 and its performance.
It shows some basic stats, the stop sequence, a map of the route, an hour-of-day by weekday report on the relays and the next buses to arrive on this route.
At the top, again we have some basic stats. We can also specify the direction and a cut-off date to look at.
We’ve selected 28th August 2017 at 18:00 so the rest of our analysis and reports will be up until this time.
The hexagons show stops on route 9 on the outbound route. Colours indicate delay where early is blue, grey is on time and red is late.
Hovering over them reveals more information like the Naptan identifier, zone and average delay in minutes.
In the 10 seconds we have also generated the table to the right showing us an hour-of-day by weekday averages of delays on route 9.
As it’s mostly grey it tells us that generally buses are on time.
The next buses tables shows us the expected buses on this route.
We’re going to look at a specific stop now: Olympia.
We’ve clicked to target Olympia and now Tableau is querying the Kognitio SQL layer to extract the Olympia bus stop from the other 17,000 stops and also selecting the relevant date window.
At the same time the dashboard has updated the hour-of-day by weekday and next bus reports for out stop of choice.
The speed of Kognitio allows this to happen almost instantaneously.
We can see that generally the punctuality is pretty good for Olympia – and often a little early with two instances of lateness.
Looking at the next buses to arrive at Olympia, we can generally expect to wait 7 minutes between buses.
But there’s another consideration to be made other than delays.
Until a bus arrives, predictions are returned by the TfL API and these change.
When told that buses are on time, which of these prior predictions are being used?
We could compared arrivals with the earliest prediction but a less deceivable angle is to look at waiting times between buses.
This is modelled using a Poisson process. So let’s look at that for bus stop 20, Olympia.
We’re querying the Kognitio SQL Layer for data about a specific stop and time period then generating the model across 5 different time periods throughout the day.
Again this returned very quickly while Tableau is transitioning from the previous dashboard.
The top half of this dashboard is the expectation of waiting times across five time periods at Olympia.
And the bottom two charts show what’s actually happening in the data.
For a particular time period of the day i.e. Night, morning rush, day etc., we can see the number of buses expected to arrive in an hour and expected waiting time between buses.
For the day time period, 7.9 buses are expected per hour at just over 7 minute intervals.
For the afternoon rush period, there would be 9.2 arrivals per hour which is approximately 6.2 minutes between buses.
However averages can be misleading.
The cumulative probability curves on the bottom left answers the question “How certain is a bus to arrive after waiting some amount of time?”
With the “minutes waited” along the bottom, as I wait longer I should be more certain that a bus will arrive as it gets closer to the average waiting times calculated earlier.
This isn’t the case though. In the afternoon rush period we’d expect to wait 6.2 minutes.
After I’ve waited an above-average 7 minutes the data is only 66% certain a bus will arrive which is quite low.
A steeper curve would indicate more confidence which is not what we’re seeing here.
So how are the waiting times between buses distributed? We can look at the bottom right graph for this.
We’re now counting the instances of wait times between buses e.g. number of times there was a 0-minute gap between two buses.
For the day period (yellow line), we see that there were 35 instances of a 0 –minute waiting gap between buses meaning multiple buses arrived at once 35 times.
This is not centered around the average like we expected and in fact the certainty in the left graph is only as high as it is because of these low waiting time gaps.
Now let’s look at these same graphs but for ALL the stops on route 9.
We are once again interactively querying the Kognitio layer to run the analysis across all of them.
We’re not just running our analysis for just one stop now but for every stop on the route for this we need ultra-high performance from the SQL layer.
Previously we thought that along a route, buses would come at pretty constant intervals but our analysis of the Olympia stop showed this isn’t the case.
The top visualisation shows us that actually the buses are not equally spread across the hour.
There is sometimes quite a large waiting gap e.g. between 16 and 28 mins in the hour.
Now the lines represent a stop on the route rather than a time period.
The different colours are the zones where blue is zone 1 and purple is zone 2.
It’s interesting that there’s quite a difference between the certainties of arrival.
For example up to 20% in the same zone and the same route that a bus will arrive by that time.
For the waiting time distribution we can see that Hammersmith bus station is an unusual stop where around 150 times buses had no waiting time between them.
So at Hammersmith bus station, if your bus doesn’t come in the first 2 minutes you’re likely to be waiting a long time.
For all the other stops, they peak at around 6 minutes of waiting time, so there is some pattern to their arrivals but the 0-minute waiting gap still persists.
This concludes this brief demo which used Tableau to interactively visualize TfL bus data stored in Hadoop using Kognitio’s ultra-high speed SQL layer.
This demonstrates that with Kognitio, interactive BI on Hadoop is possible using your preferred BI tools.