JSON Loading

Some users have JSON data they would like to query using Kognitio along with other data sources. JSON needs to be converted into a tabular format so that it can be represented in a Kognitio table or external table. How this is done depends on the structure of the data within the JSON data that needs accessing.

The command line tool wxjson2csv is a program that produces a CSV file from:

  • JSON file(s)
  • some instructions on how to generate CSV records from it, called the column definition
  • name of the JSON array in which the relevant objects can be found (if applicable)

See the end of this section for details on how to use the Kognitio client tool wxloader to load data into Kognitio without converting into a CSV file as an interim step.

Kognitio JSON functionality is best illustrated with examples. Suppose we have a JSON file containing information about different types of fruit.

{
    "fruits" : [
        { "name" : "apple", "color" : "green", "tastiness" : 8 },
        { "name" : "orange", "color" : "orange", "tastiness" : 9 },
        { "name" : "banana", "color" : "yellow", "tastiness" : 4 },
        { "name" : "pear", "color" : "green", "tastiness" : 6 }
    ]
}

If you wanted to represent this information in a FRUIT table, it’s easy enough:

name color tastiness
apple green 8
orange orange 9
banana yellow 4
pear green 6

The column definition for wxjson2csv to convert this JSON to CSV would simply be name, color, tastiness.

The wxjson2csv command to execute this from the JSON input in “fruits.json” would be

$ wxjson2csv -j "name, color, tastiness" -a fruits fruits.json

apple,green,8
orange,orange,9
banana,yellow,4
pear,green,6

The -a fruits option means that the array of objects in the file we need to look at is called fruits.

Alternatively, you can specify the target array name in square brackets at the start of the column definition list, like this:

$ wxjson2csv -j "[fruits] name, color, tastiness" fruits.json

In the rest of this documentation will use this notation.

Nested objects

A JSON object representing a single entity is rarely a simple flat structure containing only named values. Objects may contain other objects. For example, here is a possible JSON representation of a list of towns with their locations.

{
    "towns": [
        {
            "name" : "Bracknell",
            "location" : {
                "latitude" : 51.408333,
                "longitude" : -0.756666,
                "osgridref" : {
                    "northing" : 168500,
                    "easting" : 486500
                }
            }
        },
        {
            "name" : "Reading",
            "location" : {
                "latitude": 51.455000,
                "longitude": -0.970000,
                "osgridref": {
                    "northing": 173500,
                    "easting": 471500
                }
            }
        },
        {
            "name" : "Bristol",
            "location" : {
                "latitude" : 51.476666,
                "longitude" : -2.568333,
                "osgridref" : {
                    "northing" : 175500,
                    "easting" : 360500
                }
            }
        }
    ]
}

To refer to names of members inside other named objects, use the “.” character to separate names in the hierarchy. For example, if we specified the column list [towns] name, location.latitude, location.longitude. We would get this output:

$ wxjson2csv -j "[towns] name, location.latitude, location.longitude" town.json

Bracknell,51.408333,-0.756666
Reading,51.455000,-0.970000
Bristol,51.476666,-2.568333

Or, we could include the grid references in the column definition as well [towns] name, location.latitude, location.longitude, location.osgridref.northing, location.osgridref.easting. To obtain the following results:

Bracknell,51.408333,-0.756666,168500,486500
Reading,51.455000,-0.970000,173500,471500
Bristol,51.476666,-2.568333,175500,360500

Dimensions

JSON files with nested objects, like the one above, are essentially no different from flat objects, except with some of the key values grouped in a certain way. These are easily dealt with. However, in some JSON files, an object representing a single entity might contain an array of other objects, each of which represent another entity, and you want one output record for each. For example, suppose you have a JSON file which contains an array, where each element of the array represents a person. Each person has many hobbies, which are listed in an array.

{
    "people" : [
        {
            "name" : "Alice",
            "hobbies" : ["skiing", "cycling"]
        },
        {
            "name" : "Bob",
            "hobbies" : ["birdwatching", "cycling", "bridge"]
        },
        {
            "name" : "Charlie",
            "hobbies" : ["skydiving", "boxing", "chess"]
        }
    ]
}

Our target array, “people”, is the array which will be split up and each element converted into a record. But the “hobbies” array means that it isn’t normalised. So how do we make a CSV file out of this? We don’t want one field to contain a list, and we don’t want a variable number of fields per record. Ideally we want one record per name-hobby pair.

The solution is to use the dimension operator “[]”.

To output one CSV record for every person’s hobby, the column definition would be [people] name, hobbies[]. The [] on the end of “hobbies” means to iterate over the “hobbies” array and put each value in turn into that column. The CSV output would look like this:

Alice,skiing
Alice,cycling
Bob,birdwatching
Bob,cycling
Bob,bridge
Charlie,skydiving
Charlie,boxing
Charlie,chess

Nested objects underneath dimensions

What if the array inside your entity object contains a series of JSON objects, rather than values? You can specify how to find the value inside each object of the sub-array using the dot-notation, as before. For example, suppose your JSON file is a list of houses and the people who live there:

{
    "houses": [
        {
            "address" : "221B Baker Street, London",
            "occupants" : [
                {
                    "name" : "Sherlock Holmes",
                    "occupation" : "Consulting Detective"
                },
                {
                    "name" : "John Watson",
                    "occupation" : "Doctor"
                }
            ]
        },
        {
            "address" : "62 West Wallaby Street",
            "occupants" : [
                {
                    "name" : "Wallace",
                    "occupation" : "Inventor"
                },
                {
                    "name" : "Gromit",
                    "occupation" : "Dog"
                }
            ]
        }
    ]
}

A suitable column definition would be [houses] address, occupants[].name, occupants[].occupation

This means produce a row containing the “address”, the “name” and “occupation” fields in each element of occupants from any “houses” elements.

wxjson2csv will recognise that the occupants[] in the second field and the occupants[] in the third field are the same expression, so it will only iterate through them once - you don’t need to worry about accidentally generating the cartesian product of all the names and occupations.

The resulting CSV would be:

"221B Baker Street, London",Sherlock Holmes,Consulting Detective
"221B Baker Street, London",John Watson,Doctor
"62 West Wallaby Street",Wallace,Inventor
"62 West Wallaby Street",Gromit,Dog

Sequence numbers and matrices

wxjson2csv can convert arrays of arrays, using dimensions of arbitrary depth as long as what’s specified in each column of the column definition corresponds to a scalar value in the object.

A list of two matrices might be specified like this in JSON:

{
    "matrices" : [
        {
            "matrixid" : 10001,
            "cells" : [ [ 0.1, -0.1,  0.8,  1.3],
                        [ 0.0,  0.4,  0.8,  1.2],
                        [-0.3, -0.9, -1.7, -2.5]
            ]
        },
        {
            "matrixid" : 10002,
            "cells" : [ [ 9.3,  8.7,  6.0],
                        [ 1.2,  3.4,  5.6]
            ]
        }
    ]
}

Your output records need to include the matrix ID, and not only the value of each cell but also its row and column number.

The seq() function takes a dimension expression and returns the current position in that array. So, specifying “matrices” as the target array, a suitable column definition would be [matrices] matrixid, seq(cells[]), seq(cells[][]), cells[][]

The last column, “cells[][]” means to iterate over the array cells, and for each (array) element in it, iterate over each element in that and make each one in turn the column’s value. The third column, “seq(cells[][])” means the position of the current cells[][] value in its array, and the second column, seq(cells[]) means the position of the current cells[] value in its array. In other words, the second column is the row number and the third column is the column number in the matrix.

The result would be this:

10001,0,0,0.1
10001,0,1,-0.1
10001,0,2,0.8
10001,0,3,1.3
10001,1,0,0.0
10001,1,1,0.4
10001,1,2,0.8
10001,1,3,1.2
10001,2,0,-0.3
10001,2,1,-0.9
10001,2,2,-1.7
10001,2,3,-2.5
10002,0,0,9.3
10002,0,1,8.7
10002,0,2,6.0
10002,1,0,1.2
10002,1,1,3.4
10002,1,2,5.6

The target array

According to the standard, a valid JSON document must be a single object, although that object can be arbitrarily complex, containing any number of arrays and nested objects.

Our JSON conversion code assumes there will be an array somewhere in the JSON object, which can be split up into each of its elements, and each element can be converted into one or more tabular records without reference to any other part of the JSON document.

In some cases, the input file does not contain a single JSON object, but several objects, one after the other. Alternatively, the document might contain only a JSON array, which might be our target array. In both of these cases, specifying no target array at all means that the conversion code will expect either an array of objects (in which case each object is converted individually, and the names in the column definition are in the namespace of that object), or a file containing many objects one after the other (again, each of these objects is given to the column definition to extract records).

Absent fields

Sometimes, not all the fields you want will be present in an object.

Let’s say this is your JSON input, containing details of some US Presidents:

{
    "presidents" : [
        {
            "firstname" : "Franklin",
            "middleinitial" : "D",
            "surname" : "Roosevelt",
            "start" : "1933-03-04",
            "end" : "1945-04-12"
        },
        {
            "firstname" : "John",
            "middleinitial" : "F",
            "surname" : "Kennedy",
            "start" : "1961-01-20",
            "end" : "1963-11-22"
        },
        {
            "firstname" : "George",
            "middleinitial" : "W",
            "surname" : "Bush",
            "start" : "2001-01-20",
            "end" : "2009-01-20"
        },
        {
            "firstname" : "Barack",
            "surname" : "Obama",
            "start" : "2009-01-20"
        }
    ]
}

You might use the column definition [presidents] firstname, middleinitial, surname, start, end to list the names of these presidents and the dates they entered and left office.

However, if you run wxjson2csv, giving the above column definition, you get this error message:

wxjson2csv: presidents.json: object 4, ending on line 28: value not present in
object: middleinitial
wxjson2csv: too many errors (more than 0)

Barack Obama doesn’t use his middle initial so it isn’t included in the object that represents him. In addition, his “end” field isn’t included because he was still the president when this data was created. By default, if you specify a JSON name in the column definition which does not exist in one of the objects in the target array, this is an error.

If you want a field to be optional, you can use the firstvalid() function in the column definition. This function takes an unlimited number of arguments, and returns the first argument whose evaluation does not cause a runtime error. In this case, you can use firstvalid(middleinitial, null), which means “the value of the middleinitial property, or null if that doesn’t exist”. Therefore the column definition [presidents] firstname, firstvalid(middleinitial, null), surname, start, firstvalid(end, null) will work in this case.

Note that an attribute setting comes after the column name it corresponds to, and consists of a series of name=value pairs separated by whitespace.

This will cause the following records to be generated:

Franklin,D,Roosevelt,1933-03-04,1945-04-12
John,F,Kennedy,1961-01-20,1963-11-22
George,W,Bush,2001-01-20,2009-01-20
Barack,,Obama,2009-01-20,

Further details on the expressions and functions you can use in the JSON format string are in the wxjson2csv man page available on the command line.

Handling JSON in wxloader and external table target strings

The Kognitio command line tool for loading data wxloader accepts the same JSON format string as wxjson2csv. Use the option “-j <format string>” or “-J <file containing format string>)” if you want to load the JSON data straight into Kognitio rather than convert it to CSV and then load it.

The options “-j” and “-J” also work in the target string of Kognitio external tables.