Blogs

Kognitio on Kubernetes on Azure

Following on from my blog about running Kognitio on a GCP Kubernetes cluster I thought I’d try running Kognitio on Kubernetes on Microsoft Azure – the next largest cloud platform to AWS (which Andy covered). This wasn’t quite as straightforward as AWS and GCP but a little bit of research soon identified a good solution for this evaluation implementation.

My first manual hack went ok and I got a working Kognitio cluster up fairly easily but I had to mess around with the permissions for the Kognitio disks The Kognitio server also fully formatted it’s disk files rather than making its usual sparse files so it took a fairly long time to initialise the database. Digging around, it seems that the easy option for Kubernetes readWriteMany access uses Azure Files (which are a restricted SMB implementation) and that Microsoft recommends creating an NFS server if you want POSIX semantics and sparse files.

Azure Files are a reasonable choice for a production implementation but the requirement here is to create a system quickly so the install script (available in our Git repo) builds an NFS server instead. It’s fairly basic but provides reasonable performance for Kognitio internal tables and metadata.

Generally, in a cloud environment, Kognitio will use data stored in the cloud platforms storage technology (we use an Azure blob store container in this example). In this case, internal storage is only used for system and temporary tables. However I wanted to build a system that had a reasonable amount of internal disk so that use cases that require internal tables can also be evaluated.

Deployment

I used the Azure cloud console to run the deployment script – it works well and is, if anything, a little better than the GCP equivalent. It has a reasonable editor built in but does keep timing out, closing the session and deleting the output – 20 minutes (and I think it was shorter sometimes) is just not long enough. The script should run in any Linux environment with the Azure CLI (az) installed and internet access so you don’t actually need to use the cloud console.

I wanted the script to work without having to ask for a quota increase so I had to reduce the size of the Kognitio cluster compared to the one created by the GCP script. Azure has a default limit of 10 cores on an account so on Azure the Kubernetes cluster is 2 x Standard_E4s_v3 nodes instead of the 3 x Standard_E8s_v3 that would be equivalent to the GCP script. At this size it’s quite a small system but still big enough to show the advantages of the Kognitio in-memory architecture. If you already have an increased quota in place it’s straightforward to change the size and number of nodes in the script to create a larger system.

The script creates a StorageV2 storage account and within that, it creates a container which is used in the examples. The details for the container are written out when the script finishes and they can be used with the azure-ext-table-demo.kog script to read and write tables stored in the container.

The script doesn’t create an Azure datalake store but there is an example of how to connect to one using Kognitio external tables in the azure-ext-table-demo.kog script.

It’s worth noting that Kognitio external tables use the Hadoop filesystem drivers (except for specific connectors optimised for S3 for example). This allows access to all the main cloud vendors storage types. You can access any storage type from any cloud or on premise deployment (S3 from GCP or Azure storage from AWS for example) giving you great flexibility in a hybrid cloud environment.

The Kubernetes part of the script is 95% the same as AWS / GCP, just a different annotation to increase the idle timeout on the Azure load balancer to the maximum of 30 minutes. The load balancer is used for convenience – if the timeout is a problem – there are many alternatives. Connecting to Kognitio using ODBC or JDBC just requires access to port 6550 (configurable) on any of the Kognitio containers.

Architecture

The diagram below shows the architecture the script creates.

Running the script

All the files are in this github repository.

Copy kubernitio-azure to your Azure cloud shell (or clone the repo) and run it with:

./kubernitio-azure.sh create <CIDR block>

The recommended CIDR block is <your external ip address>/32 – you can use 0.0.0.0/0 but the Kognitio server will then be accessible by anyone which is not recommended. The whole process should take around ten minutes.

The script is semi-automated but you still have to enter some basic information about the system. The first step is Kognitio cluster initialisation:

  1. License agreement – enter “yes” to see the agreement or return to skip.
  2. Accept license agreement – enter “yes” to continue.
  3. Cluster ID – enter up to 12 lower case, numeric or underscore characters – e.g “mycluster”
  4. Number of storage volumes – enter “8”
  5. Storage volume size – enter “100”
  6. License key – enter “-” unless you have allocated more than 512GiB of container memory otherwise enter a license key.

The next phase is Kognitio server initialisation:

  1. SYS password – enter the password you want to use for the SYS (admin) account.
  2. System ID – enter the cluster ID you entered above.
  3. Enter to continue or ctrl-c to abort – press enter to continue.

When the Kognitio server has finished intialising, the script will output the IP address of the load balancer and the information required to create external tables backed by the storage created by the script.

You can also get this information by running

./kubernitio-azure.sh info

To delete the server and all allocated resources you can run

./kubernitio-azure.sh delete

This will delete the resource group and hence the Kubernetes cluster, the NFS Server and the StorageV2 account (and therefore the database and any data you may have put into it or the container).

You can connect to the cluster using any ODBC or JDBC compatible query tool – use the load balancer address to do this. 

Once you are connected to the Kognitio cluster, you need to get some data to query. The easiest way is to use an external table to connect to data stored in an Azure container and the azure-ext-table-demo.kog file in the github repo contains SQL showing how to create external tables for accessing data in CSV, ORC or Parquet files. The block connector used for loading CSV files is very versatile and has many target string options for reading existing data in a variety of formats including Avro

Don’t forget to create memory images of your data before querying it.

If you have any comments or questions please leave them below or use our community forum

Leave a Reply

Your email address will not be published nor used for any other purpose. Required fields are marked *