Blogs

One sneaky tip for querying MapR-DB tables using ANSI SQL

Introduction

This blog post presents a basic MapR-DB connector which allows data in a MapR-DB table to be queried using ANSI SQL in Kognitio. It demonstrates how simple it can be to write a connector for a data source and demonstrates the use of standard Kognitio utilities to transform JSON data into tabular format.

The following assumes that you have Kognitio installed on your MapR cluster, are familiar with MapR-DB JSON tables, have Maven installed and are familiar with Java.

The code to go with this blog is available on GitHub.

A Kognitio external table is a table which gets its data from an external source such as a Hive table or files in s3. In this case we’re extracting JSON records from MapRDB using the dbshell command and transforming them into a tabular format using wxjson2csv (using a transformation command auto generated by wxjsonprofile which is discussed in more detail in the blog post Kits, cats, sacks, wives: unfolding JSON data into tables).

Example Data

The JSON data for this example is taken from the nicely curated CORGIS dataset and is the airlines.json dataset which contains summary flight data for American carriers and airports.

To get the data into Mapr-DB I adapted a MapR Java example to insert records from the file using an incrementing ID.

To insert the data, connect to a node on the system as the user used to install Kognitio and:

cd to the user directory on MapR-FS so that all nodes can see the files.

cd /mapr/<clustername>/user/<username>

git clone the repo containing the java code to insert records.

git clone https://github.com/kognitio-ltd/blog-mapr-db
cd blog-mapr-db

Build the java app.

mvn clean package

Create a MapR-DB JSON table called “airlines”.

maprcli table create -path airlines -tabletype json

Download the airlines.json dataset.

wget https://think.cs.vt.edu/corgis/json/airlines/airlines.json

Load the data into the airlines table.

java -cp `mapr classpath`:target/MapRDB-Airport-0.0.1-SNAPSHOT.jar com.kognitio.presales.blog.LoadAirports airlines airlines.json

Converting to CSV

Once the data has been loaded we can query it with dbshell to get data for wxjsonprofile to work on:

mapr dbshell find airlines --fromid "0000" --toid "0010" | head -n -1 > airlines-table.json

The records returned are in the format below (pretty printed here):
{
"_id": "0000",
"airport": {
"code": "ATL",
"name": "Atlanta, GA: Hartsfield-Jackson Atlanta International"
},
"carrier": {
"code": "AA",
"name": "American Airlines Inc."
},
"statistics": {
"# of delays": {
"carrier": 34,
"late aircraft": 18,
"national aviation system": 105,
"security": 2,
"weather": 28
},
"flights": {
"cancelled": 5,
"delayed": 186,
"diverted": 0,
"on time": 561,
"total": 752
},
"minutes delayed": {
"carrier": 1367,
"late aircraft": 1269,
"national aviation system": 3817,
"security": 139,
"total": 8314,
"weather": 1722
}
},
"time": {
"label": "2003/6",
"month": 6,
"year": 2003
}
}

We need to convert this structure into a tabular format so that it can be read by the Kognitio external table. To do this we will use wxjsonprofile to build a conversion specification and wxjson2csv to do the conversion.

wxjsonprofile airlines-table.json

This produces a summary of the JSON structure in the file and creates two files:

  • airlines_table.json2csv – this contains the conversion definition
  • airlines_table.sql – this contains the Kognitio table definition

We can see how wxjson2csv uses this file to create tabular data with the following command:

wxjson2csv -J airlines_table.json2csv airlines-table.json

Which produces:

0000,ATL,"Atlanta, GA: Hartsfield-Jackson Atlanta International",AA,American Airlines Inc.,34,18,105,2,28,5,186,0,561,752,1367,1269,3817,139,8314,1722,2003/6,6,2003
0001,BOS,"Boston, MA: Logan International",AA,American Airlines Inc.,69,46,84,2,24,7,225,0,1034,1266,4201,3043,3067,45,12139,1783,2003/6,6,2003
0002,BWI,"Baltimore, MD: Baltimore/Washington International Thurgood Marshall",AA,American Airlines Inc.,18,25,38,0,20,1,101,1,490,593,1058,1600,1708,0,5698,1332,2003/6,6,2003
0003,CLT,"Charlotte, NC: Charlotte Douglas International",AA,American Airlines Inc.,17,10,10,2,10,4,48,0,151,203,968,653,558,127,3048,742,2003/6,6,2003
0004,DCA,"Washington, DC: Ronald Reagan Washington National",AA,American Airlines Inc.,43,26,53,0,25,5,147,2,793,947,2048,1566,2767,3,7868,1484,2003/6,6,2003
0005,DEN,"Denver, CO: Denver International",AA,American Airlines Inc.,43,24,70,1,19,1,157,1,591,750,2098,1329,2780,46,7446,1193,2003/6,6,2003

Building the connector

We will now build a Kognitio connector based on the above commands. This will be a very simple connector that runs with no parallelism – options to add parallelism are described below.

Firstly, we need to create a connector – the git repo has a file called mapr-db-connector:

#!/bin/bash
# Basic Kognitio MapR-DB connectivity
#
case $WX2_METHOD in
INIT)
  # Tell the resource scheduler how much memory we will be using
  echo "WX2ATTR:RAM_MB_REQUIRED 300"
;;
DESTROY)
;;
CONNECT)
;;
PROJECT)
;;
LOAD)
  # The virtual memory limit set by the database isn't suitable for 
  # Java, so disable it and limit the Java heap size
  ulimit -v unlimited
  export MAPR_OPTS=-Xmx150m 
  mapr dbshell find ${WX2_TABLE} | head -n -1 | wxjson2csv -J ${WX2_SPECIFICATION} -
  exit $?
;;
esac
exit 0

To create a connector using that file, run the following SQL as the SYS user (remember to substitute the command path with the appropriate one for your cluster and user):

create connector mapr_db_connector command '/mapr/maprdemo/user/kogmapr/blog-mapr-db/mapr-db-connector' target 'max_connectors 1';

We can now create a table based on the connector using the template file airlines_table.sql created by wxjsonprofile earlier.

create schema maprdb;
CREATE external TABLE maprdb.airlines (
id VARCHAR(10) char set utf8, -- _id
airport_code VARCHAR(10) char set utf8, -- airport.code
airport_name VARCHAR(100) char set utf8, -- airport.name
carrier_code VARCHAR(10) char set utf8, -- carrier.code
carrier_name VARCHAR(100) char set utf8, -- carrier.name
statistics_ofdelays_carrier INTEGER, -- statistics.'# of delays'.carrier
statistics_ofdelays_lateaircraft INTEGER, -- statistics.'# of delays'.'late aircraft'
statistics_ofdelays_nationalaviationsystem INTEGER, -- statistics.'# of delays'.'national aviation system'
statistics_ofdelays_security INTEGER, -- statistics.'# of delays'.security
statistics_ofdelays_weather INTEGER, -- statistics.'# of delays'.weather
statistics_flights_cancelled INTEGER, -- statistics.flights.cancelled
statistics_flights_delayed INTEGER, -- statistics.flights.delayed
statistics_flights_diverted INTEGER, -- statistics.flights.diverted
statistics_flights_ontime INTEGER, -- statistics.flights.'on time'
statistics_flights_total INTEGER, -- statistics.flights.total
statistics_minutesdelayed_carrier INTEGER, -- statistics.'minutes delayed'.carrier
statistics_minutesdelayed_lateaircraft INTEGER, -- statistics.'minutes delayed'.'late aircraft'
statistics_minutesdelayed_nationalaviationsystem INTEGER, -- statistics.'minutes delayed'.'national aviation system'
statistics_minutesdelayed_security INTEGER, -- statistics.'minutes delayed'.security
statistics_minutesdelayed_total INTEGER, -- statistics.'minutes delayed'.total
statistics_minutesdelayed_weather INTEGER, -- statistics.'minutes delayed'.weather
time_label VARCHAR(10) char set utf8, -- time.label
time_month INTEGER, -- time.month
time_year INTEGER -- time.year
) from mapr_db_connector target 'table "airlines", specification "/mapr/maprdemo/user/kogmapr/blog-mapr-db/airlines_table.json2csv"';

Querying the data

We can now select from the table:

select top 4 * from maprdb.airlines;
IDAIRPORT_CODEAIRPORT_NAMECARRIER_CODECARRIER_NAMESTATISTICS_OFDELAYS_CARRIERSTATISTICS_OFDELAYS_LATEAIRCRAFTSTATISTICS_OFDELAYS_NATIONALAVIATIONSYSTEMSTATISTICS_OFDELAYS_SECURITYSTATISTICS_OFDELAYS_WEATHERSTATISTICS_FLIGHTS_CANCELLEDSTATISTICS_FLIGHTS_DELAYEDSTATISTICS_FLIGHTS_DIVERTEDSTATISTICS_FLIGHTS_ONTIMESTATISTICS_FLIGHTS_TOTALSTATISTICS_MINUTESDELAYED_CARRIERSTATISTICS_MINUTESDELAYED_LATEAIRCRAFTSTATISTICS_MINUTESDELAYED_NATIONALAVIATIONSYSTEMSTATISTICS_MINUTESDELAYED_SECURITYSTATISTICS_MINUTESDELAYED_TOTALSTATISTICS_MINUTESDELAYED_WEATHERTIME_LABELTIME_MONTHTIME_YEAR
0000ATLAtlanta, GA: Hartsfield-Jackson Atlanta InternationalAAAmerican Airlines Inc.341810522851860561752136712693817139831417222003/662003
0001BOSBoston, MA: Logan InternationalAAAmerican Airlines Inc.6946842247225010341266420130433067451213917832003/662003
0002BWIBaltimore, MD: Baltimore/Washington International Thurgood MarshallAAAmerican Airlines Inc.182538020110114905931058160017080569813322003/662003
0003CLTCharlotte, NC: Charlotte Douglas InternationalAAAmerican Airlines Inc.171010210448015120396865355812730487422003/662003

And count the records:

select count(*) from maprdb.airlines;
> COUNT(*)
> 54013

These two operations take around 20 seconds which isn’t exactly fast. This is largely due to the lack of parallelism in this simple connector but by using Kognitio’s in-memory capabilities we can bring the query times back to the levels Kognitio users have come to expect. We will create a view of the table and image it:

create view image maprdb.v_airlines as select * from maprdb.airlines;

The create view image takes 24 seconds but the count(*) from v_airlines now takes just 0.1 seconds – Kognitio works best when data is imaged in memory!
More complex queries are similarly performant:

select time_year, carrier_name,
cast((sum(statistics_flights_delayed) / cast(sum(statistics_flights_total) as float)) * 100 as decimal(7,2)) as percent_delayed
from maprdb.v_airlines
group by 1, 2
order by 1, 3 asc;

This query takes 0.1 seconds to execute and a further 0.1 second to return it’s 244 results on my small test system. The data set we are working with here is tiny and a lot of the execution time is taken up with compiling the query and network latency, the actual query time against 54,013 records is negligible.

Performance is obviously system dependent but raw scan speeds of several billion rows per second are the norm for Kognitio.

Enhancements

This is a very basic connector and generally Kognitio connectors run massively parallel with tens or hundreds of connections to the source data system. There are several options for implementing parallelism with MapR-DB – by loading parts of the table in parallel using –offset and –limit in each of the parallel connectors or by using a query to extract the data required from the MapR-DB table. Using a Java application for the connector would be more flexible and performant than using dbshell and they aren’t difficult to write (e.g. the simple application used to insert the airlines data into MapR-DB above)

If you have a requirement for running SQL queries against large amounts of data stored in MapR-DB tables (or any other data source) please feel free to contact us for advice on how best to go about it. We may already have a suitable connector or one that can be modified for your requirements. The Kognitio server is the same on all platforms so the principles used here are equally applicable to standalone, Hadoop and MapR on premise or in the cloud.

Contact can be through the community forums, via your support portal (existing support customers only) or, for partners, through your partner account manager.

Leave a Reply

Your email address will not be published nor used for any other purpose. Required fields are marked *