Each wife had seven sacks… Converting simple, flat JSON data into CSV or another tabularRead More
Kits, cats, sacks, wives: unfolding JSON data into tables
Each wife had seven sacks…
Converting simple, flat JSON data into CSV or another tabular format is a problem addressed by many freely-available JSON-to-CSV converters. However, these are usually only useful when the JSON is a series of flat objects without any nested arrays.
In practice, real-world JSON files often have nested arrays containing more objects, each of which may contain arrays of their own. They tend to look like the old riddle As I was going to St Ives:
As I was going to St. Ives,
I met a man with seven wives.
Each wife had seven sacks,
Each sack had seven cats,
Each cat had seven kits.
Kits, cats, sacks, wives,
How many were there going to St. Ives?
If you’ve heard the riddle before, you may know the answer is often given as 1. We won’t dwell on the varyingly pedantic interpretations of the question – that’s what Wikipedia is for – instead we’ll look at the more interesting part of the problem, using a JSON file which represents this situation.
This pre-prepared JSON file, st_ives_wives.json, contains all the people, objects and animals mentioned in the riddle, in their hierarchical order. The top-level object contains an array wives, each of which has a sacks array containing seven sack objects, each of which has a cats array containing seven cat objects, and each of those contains a kits array listing that cat’s kittens.
In a relational database, you’d have four tables for this – a WIFE table, a SACK table, a CAT table and a KITTEN table, each containing the attributes of that entity and having a foreign key referring to the entity containing it.
To convert this JSON file into four DDL statements and four CSV files to load into them requires special-case knowledge of the structure and the attributes of each entity, and an element of manual intervention, which we want to minimise.
From a nested JSON file to flat table definitions
Kognitio comes with tools to assist in analysing and loading JSON data into tables. The first one you’ll use is wxjsonprofile. Give it a JSON file, and it will make an educated guess on what counts as a “record”. It will give you a .sql file containing a suitable DDL statement for creating a table, and a .json2csv file containing a JSON format string, to tell Kognitio’s bulk loading tool, wxloader, how to load the data (more on that later).
If your JSON file is better represented as two or more tables, because the JSON records contain arrays (like our example above), you need to tell it how to uniquely identify a top-level JSON record. In our JSON example, the wifeid member does that – wxjsonprofile needs to know that no two top-level records contain the same wifeid value, so that can be used as a foreign key for other tables. wxjsonprofile will then give you many .sql files and many .json2csv files, one for each table.
So let’s say we want to load into database tables all the information we know about the people, things and animals in the riddle, and all we have is this heavily-nested JSON file. We give wxjsonprofile it our JSON file and use the -k option to tell it that the wifename field in each object is a unique key, which is what allows it to join the wives table to the sacks table. We then tell it that the sackid field in the sacks array is another key, and the catid field, and the kitid field.
$ wxjsonprofile -k "wifeid, sacks.sackid, sacks.cats.catid, \ sacks.cats.kits.kitid" \ st_ives_wives.json Kognitio WX2 JSON Profiling Tool v8.01.50-rel170105 (c)Copyright Kognitio Ltd 2015-2017. st_ives_wives.json: one JSON object. root 1 obj description 1 str(3) name 1 str(37) wives : 7 obj sacks : 49 obj cats : 343 obj catid 343 int2 catname 343 str(22) kits : 2.4k obj kitid 2.4k int2 kitname 2.4k str(21) sackid 49 int1 sackname 49 str(44) wifeid 7 int1 wifename 7 str(7) st_ives_wives.json: no target array specified... looks like it might be "wives" BASE TABLE st_ives_wives, 2 columns, target array wives DIMENSION TABLES st_ives_wives_sacks (sacks), 4 columns st_ives_wives_sacks_cats (sacks.cats), 5 columns st_ives_wives_sacks_cats_kits (sacks.cats.kits), 6 columns Writing st_ives_wives.sql... done. Writing st_ives_wives.json2csv... done. Writing st_ives_wives_sacks.sql... done. Writing st_ives_wives_sacks.json2csv... done. Writing st_ives_wives_sacks_cats.sql... done. Writing st_ives_wives_sacks_cats.json2csv... done. Writing st_ives_wives_sacks_cats_kits.sql... done. Writing st_ives_wives_sacks_cats_kits.json2csv... done.
wxjsonprofile shows us what the structure of the JSON object looks like and correctly guesses the target array, the array containing the records we want to load (in this case it’s wives – we could have made this explicit using the option -A wives, but wxjsonprofile has worked it out itself). It then identifies three nested dimensions under the wives array, and gives us SQL files for creating the tables we need and .json2csv files which define how a row is created.
We can now create our four tables and load into them.
Loading JSON data into Kognitio database tables
wxloader is Kognitio’s bulk data loading tool. We won’t go into detail about wxloader here – for a general-purpose introduction, see the wxloader quick reference sheet. To load JSON data into Kognitio using wxloader, you need to tell it what file you want it to load, what table you want to load it into, and which values to project out of the JSON file to insert a record into the table. This is where the .json2csv files come in.
Loading the main table
wxjsonprofile has given us some .sql files, each of which contain a table definition, in SQL, ready to be executed on Kognitio to create the tables. We’ll take it as read that we’ve already done that.
wxjsonprofile has also given us some .json2csv files, which contain a recipe for how to convert a JSON record into a table row. Let’s look at the st_ives_wives.json2csv file:
$ cat st_ives_wives.json2csv [wives] APPLY(firstvalid(?, null)), wifeid wifename
This is a JSON format string. More information about all the things the format string can do are in the wxjson2csv(1) man page, but we’ll go through the important points here:
The [wives] bit at the top tells wxloader that the target array (the main array containing the records we want) is the array called wives. wxloader assumes that each element of this array can be taken in isolation and used to create one or more records for the table (in this case, one). Behind the scenes, the file is chopped up into records and the individual records distributed around the Kognitio system for fast parallel conversion.
APPLY(firstvalid(?, null)) means that if a field requested by the rest of the format string doesn’t exist in the JSON object, the field should be NULL. Without this line, such a scenario would give you an error when you try to load the data. We don’t really need it for this example but wxjsonprofile puts it in anyway, as it doesn’t do any harm.
wifeid,wifename tells wxloader that for each JSON object in the target array, load two values into the table: the values of the wifeid and wifename elements of the object, in that order. The st_ives_wives.sql file defines a table with an integer and a string, in that order, so this looks sensible.
$ wxloader -s 172.30.11.10 -u stives -t st_ives_wives \ -J st_ives_wives.json2csv st_ives_wives.json Kognitio WX2 Data Loading Tool v8.01.50-rel170105 (c)Copyright Kognitio Ltd 2010-2017. Password: Connected to 172.30.11.10 as user stives Loading into st_ives_wives from st_ives_wives.json 7 objects read 524K 458KB/s 99% 536223 bytes read 7 records imported Load completed successfully.
-s and -u specify the server and username respectively (we’ve previously created the user stives on the Kognitio system for this example). The -t option to wxloader tells it what table to put the data in. The -J option tells it that the format string for transforming the JSON is in this file, and the st_ives_wives.json argument is the filename.
The ST_IVES_WIVES table now contains this:
WIFEID | WIFENAME --------+---------- 0 | Alice 1 | Barbara 2 | Cheryl 3 | Donna 4 | Edith 5 | Frances 6 | Grace
Loading the dimension tables
Let’s have a look at the table definition and format string for the wife-to-sack table…
$ cat st_ives_wives_sacks.sql CREATE TABLE st_ives_wives_sacks ( wifeid INTEGER, -- wifeid seq INTEGER, -- seq(sacks) sacks_sackid INTEGER, -- sacks.sackid sacks_sackname VARCHAR(100) char set utf8 -- sacks.sackname ); $ cat st_ives_wives_sacks.json2csv [wives] APPLY(firstvalid(?, null)), wifeid, seq(sacks), sacks.sackid, sacks.sackname
In this case, wxjsonprofile has generated something a bit more complicated.  is the dimension operator, and it means to iterate over the named array and generate a record for each element. So sacks.fieldname tells Kognitio “for every element of the sacks array, generate and load into the table a record containing the value of the fieldname member of that element”. In this case, for each element of the sacks array, we put a record in the table containing the sack’s name. Note that properties within a hierarchy have their name elements delimited by a dot.
The seq() function projects the current position in the given array, starting from 0. wxjsonprofile puts this in automatically, so the position of an object in its array is preserved in the table.
Let’s see what happens when we tell wxloader to load data from the same st_ives_wives.json file, transformed by the st_ives_wives_sacks.json2csv format string file, into the ST_IVES_WIVES_SACKS table…
$ wxloader -s 172.30.11.10 -u stives -t st_ives_wives_sacks \ -J st_ives_wives_sacks.json2csv st_ives_wives.json Kognitio WX2 Data Loading Tool v8.01.50-rel170105 (c)Copyright Kognitio Ltd 2010-2017. Password: Connected to 172.30.11.10 as user stives Loading into st_ives_wives_sacks from st_ives_wives.json 7 objects read 524K 264KB/s 99% 536223 bytes read 49 records imported Load completed successfully.
We read in seven JSON objects (one for each wife) and we’ve generated 49 records, because each of the seven wives has seven sacks.
Let’s see what we’ve loaded:
WIFEID | SEQ | SACKS_SACKID | SACKS_SACKNAME --------+-----+--------------+---------------------------------------------- 0 | 0 | 0 | Alice's Hold-e-eeze 4000 0 | 1 | 1 | Alice's Contain-o-matic Plus 0 | 2 | 2 | Alice's Bulk Animal Transporter (Deluxe) 0 | 3 | 3 | Alice's All-purpose Bag For Life 0 | 4 | 4 | Alice's HessianTech "Executive" 0 | 5 | 5 | Alice's Amazing Rubber Catholder 0 | 6 | 6 | Alice's Hipster CatTidy 1975 Retro Edition 1 | 0 | 7 | Barbara's Hold-e-eeze 4000 1 | 1 | 8 | Barbara's Contain-o-matic Plus 1 | 2 | 9 | Barbara's Bulk Animal Transporter (Deluxe) ... [snip the rest - 49 records in total]
We’ve now got all the information we know about the 49 sacks in our sacks table.
Loading into the ST_IVES_WIVES_SACKS_CATS table and the ST_IVES_WIVES_SACKS_CATS_KITS table is the same. For brevity, we won’t show the wxloader output for each one, just the table definitions and the .json2csv definitions. The wxloader commands are the same as before except for the table name and the name of the .json2csv file.
As you may expect if you’ve already worked out the powers of 7, there are 343 records in the cats table and 2,401 records loaded into the kits table.
$ cat st_ives_wives_sacks_cats.sql CREATE TABLE st_ives_wives_sacks_cats ( wifeid INTEGER, -- wifeid sacks_sackid INTEGER, -- sacks.sackid seq INTEGER, -- seq(sacks.cats) sacks_cats_catid INTEGER, -- sacks.cats.catid sacks_cats_catname VARCHAR(100) char set utf8 -- sacks.cats.catname ); $ cat st_ives_wives_sacks_cats.json2csv [wives] APPLY(firstvalid(?, null)), wifeid, sacks.sackid, seq(sacks.cats), sacks.cats.catid, sacks.cats.catname
Sampled contents of the cats table:
WIFEID | SACKS_SACKID | SEQ | SACKS_CATS_CATID | SACKS_CATS_CATNAME --------+--------------+-----+------------------+------------------------ 0 | 0 | 0 | 0 | Boniface Moriarty 0 | 0 | 1 | 1 | Bernard Partridge 0 | 0 | 2 | 2 | Hubert Snettersworth 0 | 0 | 3 | 3 | Norbert Spongeworthy 0 | 0 | 4 | 4 | Scrumpy Overton 0 | 0 | 5 | 5 | Galahad Meadowbrook 0 | 0 | 6 | 6 | Scrumpy Quinquangle 0 | 1 | 0 | 7 | Charles Partridge 0 | 1 | 1 | 8 | Galahad Snettersworth 0 | 1 | 2 | 9 | Xerxes Spongeworthy ... [snip - total 343 records]
$ cat st_ives_wives_sacks_cats_kits.sql CREATE TABLE st_ives_wives_sacks_cats_kits ( wifeid INTEGER, -- wifeid sacks_sackid INTEGER, -- sacks.sackid sacks_cats_catid INTEGER, -- sacks.cats.catid seq INTEGER, -- seq(sacks.cats.kits) sacks_cats_kits_kitid INTEGER, -- sacks.cats.kits.kitid sacks_cats_kits_kitname VARCHAR(100) char set utf8 -- sacks.cats.kits.kitname ); $ cat st_ives_wives_sacks_cats_kits.json2csv [wives] APPLY(firstvalid(?, null)), wifeid, sacks.sackid, sacks.cats.catid, seq(sacks.cats.kits), sacks.cats.kits.kitid, sacks.cats.kits.kitname
Sampled contents of the kits table:
WIFEID | SACKS_SACKID | SACKS_CATS_CATID | SEQ | SACKS_CATS_KITS_KITI | SACKS_CATS_KITS_KITNA --------+--------------+------------------+-----+----------------------+----------------------- 0 | 0 | 0 | 0 | 0 | Mewy Moriarty 0 | 0 | 0 | 1 | 1 | Sleepy Moriarty 0 | 0 | 0 | 2 | 2 | Squeaky Moriarty 0 | 0 | 0 | 3 | 3 | Dimple Moriarty 0 | 0 | 0 | 4 | 4 | Jumpy Moriarty 0 | 0 | 0 | 5 | 5 | Flumpy Moriarty 0 | 0 | 0 | 6 | 6 | Scrawny Moriarty 0 | 0 | 1 | 0 | 7 | Mewy Partridge 0 | 0 | 1 | 1 | 8 | Sleepy Partridge 0 | 0 | 1 | 2 | 9 | Squeaky Partridge ... [snip - total 2,401 rows]
Which wife has the most cats called Hubert?
Now we’re ready to run whatever queries we want. Our JSON file gives a name to each cat. As an example, let’s find out which wife has the most cats called Hubert.
> select w.wifeid, w.wifename, count(*) from st_ives_wives w, st_ives_wives_sacks_cats c where w.wifeid = c.wifeid and c.sacks_cats_catname like 'Hubert %' group by 1, 2 order by 3 desc; WIFEID | WIFENAME | COUNT(*) --------+----------+---------- 4 | Edith | 3 2 | Cheryl | 2 0 | Alice | 2 3 | Donna | 2 5 | Frances | 1 6 | Grace | 1 1 | Barbara | 1 > select c.sacks_cats_catname from st_ives_wives_sacks_cats c, st_ives_wives w where w.wifeid = c.wifeid and w.wifename = 'Edith' and c.sacks_cats_catname like 'Hubert %'; SACKS_CATS_CATNAME -------------------- Hubert Supernova Hubert Hatbox Hubert Hatbox
Edith has three Huberts, which happen to be called Hubert Supernova, Hubert Hatbox and Hubert Hatbox (not to be confused with the other one).
We’ve now got kits, cats, sacks and wives, each with their own entry in a database table. All we needed to tell wxjsonprofile was which JSON property to use to uniquely identify each entity, and it worked out the rest itself, giving us DDL statements for the tables and JSON definition files to pass to wxloader so we could load the required fields into each table.
For more information on the JSON format string syntax and its capabilities, see the wxjson2csv(1) man page. The wxjsonprofile(1) and wxloader(1) man pages provide details on those tools.
No cats were harmed in the making of this article.