Blog

Disk Space Connector

The Kognitio external table connector scripts are a powerful way of accessing data from a wide variety of sources. Any data that can be presented as a stream of CSV information can be brought into Kognitio in a massively parallel operation and presented as a table. There is no restriction on the programming language – as long as the code can run on Linux, read stdin and write stdout and stderr it can be used. Typically BASH, Python or Perl will be used but for more advanced connectors, we have used Java and C / C++.

This blog post presents a very basic connector that loads data from the Linux df command (which displays disk usage figures). It runs a single connector on each node to avoid retrieving duplicate information and prepends the hostname so we can see which node the figures relate to.

We are using a BASH script in this example which simply transforms the output of the df command into csv as shown.

Standard df

$ df -P

Filesystem 1024-blocks Used Available Capacity Mounted on
/dev/sda1 138840528 42271252 93702660 32% /
tmpfs 24705916 0 24705916 0% /dev/shm
cm_processes 24705916 187624 24518292 1% /var/run/cloudera-scm-agent/process

 

Processed into CSV
$ df -P | grep -v Filesystem | tr -s ' ' ',' | cut -d, -f1-4,6
/dev/sda1,138840528,42271252,93702660,/
tmpfs,24705916,0,24705916,/dev/shm
cm_processes,24705916,187624,24518292,/var/run/cloudera-scm-agent/process

To create the connector we need to wrap this in some code that runs the df when the connectors LOAD method is called.

#!/bin/bash -l

if [ "$WX2_METHOD" = "LOAD" ] ; then
  hn=`hostname`
  df=`df -P | grep -v Filesystem | tr -s ' ' ',' | cut -d, -f1-4,6`
  for o in $df ; do
    echo "${hn},${o}"
  done
fi

This code is placed on the AP of a standalone instance of Kognitio or the edge node of a Hadoop instance and the following SQL is used to create the connector (you will need “CREATE CONNECTOR” privileges).

create connector Disk_Space_Connector
  command '/vol1/services/wxadmin/Disk_Space_Connector'
  target 'max_connectors_per_node 1';

We can then build a table to access the connector. The table creation statement defines the column names and types and the database does the necessary conversions from CSV.
create external table disk_space (
hostname varchar,
filesystem varchar,
size_kb bigint,
used_kb bigint,
available_kb bigint,
mounted_on varchar
) from Disk_Space_Connector;

And finally select from the table.
[code language=”sql”]select * from disk_space order by hostname, filesystem;[/code]

Which produces.

HOSTNAME FILESYSTEM SIZE_KB USED_KB AVAILABLE_KB MOUNTED_ON
hadoop02-rack3-enc2-10 /dev/sda1 30465020 11693060 18142960 /
hadoop02-rack3-enc2-10 /dev/sdb1 220248772 67738168 141322540 /hdfs
hadoop02-rack3-enc2-10 cm_processes 24705916 183780 24522136 /var/run/process
hadoop02-rack3-enc2-10 tmpfs 24705916 12935056 11770860 /dev/shm
hadoop02-rack3-enc2-11 /dev/sda1 30465020 12214992 17621028 /
hadoop02-rack3-enc2-11 /dev/sdb1 220248772 65651144 143409564 /hdfs
hadoop02-rack3-enc2-11 cm_processes 24705916 183196 24522720 /var/run/process
hadoop02-rack3-enc2-11 tmpfs 24705916 12935056 11770860 /dev/shm
hadoop02-rack3-enc2-12 /dev/sda1 30465020 11474152 18361868 /
hadoop02-rack3-enc2-12 /dev/sdb1 220248772 68018648 141042060 /hdfs
hadoop02-rack3-enc2-12 cm_processes 24705916 183216 24522700 /var/run/process
hadoop02-rack3-enc2-12 tmpfs 24705916 12935056 11770860 /dev/shm

If the table definition does not match the data, an error will be returned – we have removed the hostname column to show this.
create external table disk_space_err (
filesystem varchar,
size_kb bigint,
used_kb bigint,
available_kb bigint,
mounted_on varchar
) from Disk_Space_Connector;

select * from disk_space_err order by filesystem;

HY000[Kognitio][WX2 Driver][hdev:6550] ET010F: Invalid external data record, see SYS.IPE_CONV_ERROR for details (tno 6074)

To get details of the error, query the SYS.IPE_CONV_ERROR table as directed in the error message (remember to substitute your tno into the query).

select column_no, record_char_pos, message, cast(record as varchar(10000)) as record from SYS.IPE_CONV_ERROR where tno = 6074;

Which returns (turned sideways for clarity).

COLUMN_NO 2
RECORD_CHAR_POS 23
MESSAGE IE112B: Invalid field value for column type or incorrect field terminator
RECORD hadoop02-rack3-enc2-11,/dev/sda1,30465020,12218452,17617568,/

This shows that the problem is with column 2 at position 23 in the record. Looking at the table definition, we are expecting a bigint in the second column and we have the value “/dev/sda1” which can’t be converted, hence the error.

If the data source is not comma separated, there are many options that can be used to transform the data – see the quick reference sheet “Target String Format Attributes.pdf” and chapter 8 of “kognitio-Guide-v80100.pdf” for more information (all documentation is available on the forum page Click here, for the latest version 8 documentation).

This blog entry has only scratched the surface of what can be done with an external table connector. As well as the LOAD method we utilised above there are methods called when the connector is created and methods to allow a subset of columns to be returned which can improve performance when only a small number of columns are required from a wide table.

Of particular note is a method called LOADCO which can be used to coordinate the loading of multiple files or other objects by sending messages to the LOAD methods. I recently used it to create a connector that fetches a list of files via SSH from a server and then distributes the file names across multiple LOAD processes to load them in parallel. This can give very good performance – our testing settled on 12 concurrent load processes (6 each on 2 nodes) as being the most the file server could handle but we tried 100 and the only problem was with the file server struggling to serve the files.

The BASH script and SQL used above are available on the forum at Click here for Disk Space Connector script

If you have any questions or ideas for connectors please post them on the forum or in the blog comments.

Mark Marsh – Senior Technical Consultant