How damaging can an out-of-date phone number or a duplicated, misspelled name really be? AsRead More
BI tools like Tableau struggling with your big data? Boost performance with these tips
A few months ago I wrote about using your database to do the heavy lifting instead boost performance in BI tools like Tableau here and the premise is the same here, just a bit more general.
So I’ve compiled some tips that can get the most out of Tableau driven by a system using Kognitio. These don’t apply exclusively to Kognitio and Tableau, the concepts should be applicable to most in-memory databases and BI tools that connect to them.
Don’t let Tableau apply the filtering to large data sets
If you’ve done any sort of data drilling in Tableau then you’ve probably used filters before to select a subset of your data. For example we have our retail sale data in Tableau:
If we apply more than one sort of filter (even calculations like Top 10), these are calculated independently.
Sure if you pin your data into memory in Kognitio, it will be returned very quickly. But for larger data sets, Tableau still has to iteratively apply other filters like dimension and measure ones to the relevant values. Essentially bottle necking the in-memory read speeds from Kognitio.
By pinning it into the context with big data you’re lessening the workload on the BI tool. It does this by moving it up the priority list in terms of filter application where the order from highest to lowest priority is data source filters > context filters > dimension filters > measure filters > table calculation filters.
Generally filters which can significantly reduce the amount of data should be added to the context or if you want to apply one before the rest for some type (e.g. two dimension filters but you want to ensure one gets applied before the other).
To add a filter to the context, right click it in the filters pane and select “Add to context”:
Custom SQL with parameters
Don’t build a complex web of joins, be specific!
As you build out your BI dashboards, you’ve no doubt had to add more than one table/view to the mix. For example:
This sends an SQL join to your database. But this can become very complicated, for example I’ve had a situation where I wanted an aggregate for each lookup value so I had to add another join:
The sale data is a record of every sale and the product and store are lookups. The other sale1 is a duplicate of the first one. Why? The idea is that if I drilled down into a particular product and product category but wanted to compare with all the others in the same category, I’m going to need to aggregate it and join it back in.
Like with the previous tip, the key is to get the exact data you want to Tableau very quickly. With a data source like this, it’s very difficult as you can’t be specific about the SQL unless you use custom SQL instead:
But why don’t I just create an underlying view of this in Kognitio and pin it into memory? You certainly could and it would be very quick but you may not need this to be persistent in memory all the time. The access speed wouldn’t be significantly lower if the tables/views involved are in-memory and is how superviews are built. Additionally this lets us apply parameters at a much lower level:
This allows us to be specific about our data like before and by working with a smaller subset, we don’t need to worry about the time to process like with the web of joins if using the non-custom SQL data source. Not that it would take much time to process with the parallel SQL architecture, the main objective being to use the available memory efficiently and not to put too much of a load on your BI tool.
Only use the bits you need
You’ve probably performed some arduous but rigorous data cleansing and transforming to get your fact tables/views to look exactly as you want them. But you may not need to reference these all the time.
Briefly speaking, by assuming referential integrity Tableau doesn’t include any tables/views that aren’t referenced improving performance on both it and the database’s side. Refer to the explnantion on Tableau’s website here.
And a Wikipedia link for further reading here.
To assume referential integrity, right click on the data source and select it:
Note that this only applies to data sources where you’ve dragged the tables/views in and not custom SQL queries as you’re in more control of those.
Distributing your data in RAM
Organise your data so that everything knows where to find it
Whilst you can put your data into memory, there are different ways to do so. One that is beneficial to this is hashing. Rather than randomly distributing the data (default behaviour), it can be organised by a specified column as the unique ID. The result of this is that everything with that ID will end up on the same RAM store increasing read performance.
By placing them on the same RAM store, rows with the same unique ID would end up in the same location. For our retail data in Tableau, you may choose to drill down on a particular product ID which you can hash on and use with custom SQL detailed earlier for much faster access.
To hash a table or view when pinning it into memory, run the follow:
Create table image hashed on ();
Create view image hashed on ();
It’s also a good idea to update the statistics:
update statistics for full;
update statistics for ram;
The caveat with this is that you’ll need to be careful with the skew of your data. One ID may have much more data than another and the RAM store for that will fill up much quicker. To examine the skew, run the following query to look at the amount of rows per hash value:
Select hash_mpid(hash_value()) mpid, count(*)
group by 1
order by 2;
And the following to see how full each RAM store is:
select (ram_size *1.0)/1024/1024/1024 ram_size, (data_stored *1.0)/1024/1024/1024 data_stored, (ram_free *1.0)/1024/1024/1024 ram_free
where type = 'rs'
order by 2 desc;
For more information about images in Kognitio look here.