As part of our ongoing TPC-DS benchmarking of SQL on Hadoop platformsRead More
Lift with Kognitio and not your BI tool at the Big Data gym
BI tools are great for inspecting your data from a viewpoint that databases can’t provide. Some are able to perform complex calculations and even modelling such as clustering and regression.
With Big Data over a live connection this can be a challenge considering the amount of data that the BI tool must work through. Typically, as these BI tools are run on low powered machines, these aren’t likely to completed within a reasonable time frame. I’ll outline a problem I’ve had and its solution in Kognitio and Tableau.
The following scenario isn’t exclusive to Tableau (it’s just the one that I’m most familiar with) or Kognitio (applies to any type of live connection) and applies to any BI tool which uses live connections and performs calculations on top.
This is also quite a long blog post so I’ve added some TL; DR points to help summarise each section.
TL; DR – Calculations on large amounts of rows in Tableau are slow and not fit for purpose. Needed a new solution at the database level to do the heavy lifting.
So a while back, whilst working through my collected TFL data, I wanted a distance based radius around a single/multiple bus stops as a comparison. The motivation was to see if a particular stop was severely delayed, would the surround stops also be facing the same problem. In Tableau, there is a radial tool (link) but this is more for filtering the result set but I wanted to calculate aggregates at a lower level (e.g. arrival times at specific hours). I ended up finding this solution:
(Larger circle denotes the chosen bus stop and the center of the circle)
And it works great, albeit a bit long but at the time the result was worth it. Then I wanted to increase the number of bus stops (I had been working with a small sample size) and that’s when things went sour; it was too slow (left it for 20 minutes and didn’t finish, it actually ran out of memory with 8GB). This is because Tableau would pull in each stop and calculate them on the fly, with less data is a good idea but not for the large amount I wanted to include.
The dashboard I wanted to create was an interactive one where the user can discover by clicking on various objects to filter and inspect. Unfortunately, on the map portion of my dashboard, this meant that it had to calculate the distances each time between every combination of bus stops (about 32000 including postcodes served by TFL outside Greater London like TW and KT) every time something changed/got filtered. This resulted in an unusable and sluggish dashboard.
There’s two things I could see wrong with this:
• Tableau tries to pull all the bus stops in to calculate each time
• Every bus stop is compared to every other bus stop
So a solution was to get a beefier machine to run Tableau on and I could’ve done so but I didn’t want to have to list system requirements for a dashboard. But I thought this was a chance to make the process smarter. Passing such tremendous amounts of calculations to Tableau isn’t its purpose and given their recent outsourcing strategies to Python and R and a conversation I had with them at a recent event, they probably don’t expect it to do the heavy lifting. What should do the heavy lifting is where the data originates; in my case the Kognitio database.
TL; DR – Create a lookup table in memory of distances between bus stops. Filtering in Tableau still requires calculations to be made.
The solution is to create a lookup of the distances between the bus stops but not every bus stop. Certainly, we wouldn’t want to compare bus stops 10 miles (approx. 16 Km) apart unless they’re on the same bus route or even 5 miles (approx. 8 Km) given the size of London (approx. 28 miles/45 Km).
Why a lookup? Well the bus stop distances remain static and don’t change so it makes sense to have it calculated and readily available whenever needed.
I don’t want to change how my Tableau data source works though, couldn’t we just filter it out in Tableau instead? The problem with this is that it still need to do the comparison before it can filter them out i.e. pull the rows, compare them, if no good then throw it away.
Place this in memory and join it in the data source section of Tableau which reduces the amount of strain put on Tableau for calculations. This also enables area around a route which I couldn’t find a good way to do in the method I found which was based around a single point.
TL; DR – we have geometry functions based on PostGIS as of Kognitio 8.2.
If you’re already a user of Kognitio you may have used the earth_distance function to calculate the distance between two sets of lat/long coordinates. As of the 8.2 release, we’ve expanded on this now with a set of geometry plugins based on PostGIS (link) and implemented using the OGC standard (link).
A list of functions and descriptions in PostGIS linked here. Note that not all of these have been implemented yet but will be soon. You can check which are available in Kognitio console after you have activated the fullgeometry plugin in Kognitio, double clicking it from the systems menu and opening the functions tab:
TL; DR – we reduced the number of stops compared in one postcode region from 31,520 to just 718 in the neighbouring areas. You’ll probably want to read through this part.
So that’s quite an information dump but what functions can we use for this? The ones I’ve used for bus stop distances are:
• St_Point – defines a set of lat/long coordinates as geometrical point
• St_DistanceSpheroid – distance between two sets of lat/long across a designated sphere
• ST_LongLatToEN – converts long/lat to actual distances
So the first point is straightforward but what’s this about spheres? Well the earth isn’t a perfect sphere (I was shocked too) and has different definitions at different locations. The one used in the UK:
So when calculating the distance we need to say where we are and what sphere to use to get an accurate distance measurement.
The important note on the first two functions is that they are based on geometries and geographies. Therefore we need the third function to convert to actual distances.
But this still compares every bus stop with every other one, it would still be faster than doing it at the BI level but we can reduce the amount of comparisons significantly by introducing areas. Namely postcode regions, and assigning the bus stops to the correct area.
More levels of data? Is this useful/worth it?
Well, let’s look at the following postcode map from Wikipedia:
A bus stop in area N1 probably doesn’t need to be compared to one in E4 at all or any of the ones too far away. “Far away” would depend on the location your data is based on but for London and this example we’ll stick with just the neighbouring areas.
The functions useful for this:
• St_polygon – defines a polygon based on linestring (a sequence of lat/longs in a varchar column)
• St_contains – checks whether a point resides inside a polygon
• St_touches – checks whether two polygons touch
The first function is to create polygons based on our postcode regions, the second will put the bus stop lat/long points into a postcode region and the last one checks for neighbouring areas. The postcode linestrings I got from here in Geojson format:
So our N1 example postcode should only touch E8, E2, EC2, EC1, WC1, NW1, N7 and N5. Thereby significantly reducing the number of comparisons and excluding the ones that don’t make sense. We go from comparing this many stops to N1:
i.e. all of them, to this many in the surround regions:
This doesn’t only need to be used in Tableau, you can use it with your other analytics. Want to group your discovery metrics by distance? You can! Region to region comparisons by time of day? You can!
Performance is relative given your hardware and software environment so here’s an export of the polygons which I’ve formatted and Naptan IDs which you can load, try and see for yourself. If you have the data in S3 already then run the commented queries to create the views in Kognitio to retrieve the data. You should see a noticeable improvement like I did in the usability of Tableau dashboards using the lookup table.
My personal take from this: when the result of the calculation is static (in my case the distance between bus stops won’t change) then it’s better have an in memory table at the database level ready to be called on. The BI tool you employ can only do so much lifting before it becomes a bottleneck.