wxjsonprofile - JSON analysis and profiling tool

SYNOPSIS

wxjsonprofile [options] file.json

DESCRIPTION

wxjsonprofile is designed to help load JSON data into a tabular structure on Kognitio. When run it takes a JSON file, analyses it, and produces a table create statement and a format string, suitable for passing to wxloader or wxjson2csv, for each table.

By default, the SQL statement to create a table is written to a suitably-named .sql file, and the format string is to a .json2csv file, in the current working directory. A different directory can be specified with the -d option.

OPTIONS

-A arrayname

Use arrayname as the target array, that is, the array in the top-level JSON object which can be split into independent elements to be treated independently when loading records. If this is not given, wxjsonprofile guesses the target array; if the JSON file contains one root object, and a top-level array (that is, an array which is not inside another array) within that object contains more than half of all the scalar values in the file, it’s assumed to be the target array. If the file contains more than one root object, or contains one array of objects, there is no target array. This means that when loading with wxloader, the file will be split into these root objects, and each will be converted independently. JSON property names in the format string generated by wxjsonprofile are relative to each root object.

-b limit

Don’t sample more than limit bytes of the JSON file.

-d directory

Write the .sql and .json2csv output files in directory, not the current working directory.

--ext-file filenamepattern

If we are writing create-text for an external table (see -x) then set the file attribute in the target string to this value. If this is not specified, a placeholder value will be inserted into the target string which you must replace manually.

--force-write-format-file

Still write a separate .json2csv file containing the format string even if it will be written as part of an external table definition. See -x.

-h

Show help.

-k key1,key2,

Tell wxjsonprofile the name of one or more JSON properties (relative to an element of the target array, or to a root object if there is no target array) which uniquely identify an element of an array. This is useful when elements of the target array contain further dimensions (arrays) and these are to be loaded as dimension tables. If no key is given, wxjsonprofile does not create output files for dimension tables. The expression objectnumber() may be given to use the position of the object in the target array as a key.

-n

Show the structure of the JSON file, but don’t create any output files.

--no-geojson

Do not attempt to automatically detect and give special treatment to objects that look like GeoJSON geometry objects.

-t name

Specify the name of the table for the DDL. This will be the name of the main table, and any dimension tables will be called this name followed by a suitable suffix. The output filenames begin with the table name. If this is not given, the table name is assumed to be the name of the input file, minus any “.json” or other suffix.

-x connector, --ext-connector connector

Write create-text for an external table rather than an ordinary table, and use this connector name. The JSON format string will be included in the external table’s target string, so a separate .json2csv file will not be produced unless --force-write-format-file is supplied.

GeoJSON

GeoJSON is a standard for representing geospatial data such as points, lines and polygons in JSON. wxjsonprofile automatically detects GeoJSON objects by identifying if an object called geometry contains an attribute called type whose value is one of the strings Point, MultiPoint, LineString, MultiLineString, Polygon or MultiPolygon. These geometry types have a coordinates array, whose dimensionality differs depending on the type of the object.

GeoJSON type

coordinates property

Point

1-dimensional array of two or three numbers

MultiPoint

2-dimensional array of numbers

LineString

2-dimensional array of numbers

MultiLineString

3-dimensional array of numbers

Polygon

3-dimensional array of numbers

MultiPolygon

4-dimensional array of numbers

An array of geometries with their associated properties (a FeatureCollection) might contain many different types of geometries, so when we load the data each type needs to be treated separately. Unless the option --no-geojson is passed, wxjsonprofile will automatically detect GeoJSON and treat it as a special case in the following ways:

  • The bottom-level array of coordinates (the one containing the coordinates themselves) is not given a sequence number column and one row per element like other arrays. Each element (coordinate) in the array is given its own column, up to a maximum of three columns. This means all the coordinates for the same point are loaded into the same row. The columns are called X, Y and Z. Typically in GeoJSON these represent longitude, latitude and (if present) altitude respectively, but other coordinate systems may be used. If the array contains more than three elements, the rest are ignored.

  • The table which stores the coordinates is given as many sequence number columns as necessary for the most complex geometry object appearing in the target array. For example, if the features array contains MultiPolygons, Polygons and Points, the MultiPolygon is the most complex type so the table is given a polygon sequence number, a ring sequence number and a point sequence number, as well as one column for each coordinate in a point. For Polygon objects the first sequence number would be fixed as 0, and for Point objects all three sequence numbers would be 0.

  • If the target array contains more than one type of GeoJSON object, the wxjson2csv format string uses conditional expressions based on the value of geometry.type to decide how the object should be converted.

EXAMPLES

Analysing data, creating a table, and loading records

This example takes a simple JSON file containing an array of objects, each of which should be converted into one table row. This is the JSON file, cities.json, which contains a list of cities and their locations:

{
  "cities" : [
    {
      "name" : "Belfast", "lat" : 54.583330, "long" : -5.933330
    },
    {
      "name" : "Cardiff", "lat" : 51.480000, "long" : -3.180000
    },
    {
      "name" : "Edinburgh", "lat" : 55.952060, "long" : -3.196480
    },
    {
      "name" : "London", "lat" : 51.508530, "long" : -0.125740
    }
  ]
}

If you run wxjsonprofile on cities.json with no extra arguments, it prints a summary of the structure and generates a .sql file and a .json2csv file:

$ wxjsonprofile cities.json
Kognitio WX2 JSON Profiling Tool v8.01.01
(c)Copyright Kognitio Ltd 2015-2015.

cities.json: one JSON object.

root         1 obj
    cities[] [4]: 4 obj
        name 4 str(9)
        lat  4 dec(8,6)
        long 4 dec(7,6)

cities.json: no target array specified... looks like it might be "cities"

cities, 3 columns, target array cities

Writing cities.sql... done.
Writing cities.json2csv... done.

The summary tells us that the file contains one object, the root object, and that this root object contains an array called “cities” which has four object elements.

In these objects, the field “name”, “lat” and “long” were each seen four times. “name” is always a string, up to nine characters long; “lat” and “long” are always decimal fields.

No target array has been specified, but wxjsonprofile has correctly assumed that “cities” is the name of the array we’re interested in.

wxjsonprofile has generated two output files, cities.sql and cities.json2csv.

cities.sql is the SQL to run to create a table suitable for holding this data, and it looks like this:

CREATE TABLE cities (
    name VARCHAR(10) char set utf8,     -- name
    lat  DECIMAL(8,6),                  -- lat
    long DECIMAL(7,6)                   -- long
);

Note that VARCHAR maximum lengths are determined by the longest string wxjsonprofile saw for that field, rounded up to the nearest power of 10.

cities.json2csv contains a JSON format string which can be given to wxloader or wxjson2csv, and it describes how to create a record from each element. The APPLY directive applies the firstvalid() wxjson2csv function to each field, so that if any object does not contain the named field, the null value is used.

[cities]
APPLY(firstvalid(?, null)),
name,
lat,
long

To load this data into a Kognitio database table, first execute the SQL to create the table, using wxsubmit (for example):

wxsubmit -s <server> -u <user> cities.sql
(enter password)

Then use wxloader to load the JSON data, into the table using the row recipe detailed in cities.json2csv:

wxloader -s <server> -u <user> -J cities.json2csv -t cities cities.json
(enter password)

Dimensions

JSON objects are often not flat but have arrays in them, and these arrays need to be converted into dimension tables. Let’s extend our cities.json example, so that each city object now contains a list of some of its main railway stations. We’ll call this cities_stations.json:

{
  "cities" : [
    {
      "name" : "Belfast", "lat" : 54.583330, "long" : -5.933330,
      "railwaystations" : [
          { "name" : "Belfast Central" },
          { "name" : "Belfast Great Victoria Street" }
      ]
    },
    {
      "name" : "Cardiff", "lat" : 51.480000, "long" : -3.180000,
      "railwaystations" : [
          { "name" : "Cardiff Central", "code" : "CDF" },
          { "name" : "Cardiff Queen Street", "code" : "CDQ" }
      ]
    },
    {
      "name" : "Edinburgh", "lat" : 55.952060, "long" : -3.196480,
      "railwaystations" : [
          { "name" : "Edinburgh Waverley", "code" : "EDB" }
      ]
    },
    {
      "name" : "London", "lat" : 51.508530, "long" : -0.125740,
      "railwaystations" : [
          { "name" : "London Euston", "code" : "EUS" },
          { "name" : "London King's Cross", "code" : "KGX" },
          { "name" : "London Liverpool Street", "code" : "LST" },
          { "name" : "London Marylebone", "code" : "MYB" },
          { "name" : "London Paddington", "code" : "PAD" },
          { "name" : "London Victoria", "code" : "VIC" },
          { "name" : "London Waterloo", "code" : "WAT" }
      ]
    }
  ]
}

If you run wxjsonprofile on this file with no other arguments, you will find it generates .sql and .json2csv files only for the cities as before, and not for the railway stations. This is because wxjsonprofile does not know how each row in the railway stations table should reference its city.

For this reason, you need to tell wxjsonprofile a suitable key field, or set of fields, which uniquely identify a city object. This will effectively become the primary key of the main table, and the foreign key of the dimension table(s). These key fields are JSON property names relative to an element of the target array.

In our case, the city name uniquely identifies the city, so we would tell wxjsonprofile that the name field can be used for this purpose with -k name. If we had to specify a set of field names, which were unique when combined, we would separate them with commas, e.g. -k field1,field2. This might be useful if our JSON file had more than one city of the same name, because we could use the lat and long fields too. Field names which are members of other objects are referenced using a dot, e.g. outerobject.fieldname.

For example:

$ wxjsonprofile -k name cities_stations.json
Kognitio WX2 JSON Profiling Tool v8.01.01
(c)Copyright Kognitio Ltd 2015-2015.

cities_stations.json: one JSON object.

root                      1 obj
    cities[]              [4]: 4 obj
        name              4 str(9)
        lat               4 dec(8,6)
        long              4 dec(7,6)
        railwaystations[] [1-7, 3.0]: 12 obj
            name          12 str(29)
            code          10 str(3)

cities_stations.json: no target array specified... looks like it might be "cities"

BASE TABLE
        cities_stations, 3 columns, target array cities

DIMENSION TABLES
        cities_stations_railwaystations (railwaystations), 4 columns

Writing cities_stations.sql... done.
Writing cities_stations.json2csv... done.
Writing cities_stations_railwaystations.sql... done.
Writing cities_stations_railwaystations.json2csv... done.

Note that the structure summary now shows railwaystations[], which is an array. This array appears multiple times (once in each city object), and the numbers inside the square brackets indicate the length of the shortest example of this array (1), the longest (7) and the mean length of this array (3.0). It also tells us that the elements of this array were objects 12 out of 12 times.

SQL and format string files for two tables are created. The dimension table, CITIES_STATIONS_RAILWAYSTATIONS, has one column for the station name, one column for the position in the array of stations, and one column each for the station name and station code.

GeoJSON

The mythical world of Hoblong is a flat 10x10 square, whose (0,0) coordinate is in the south-west corner. Hoblong consists of four countries. Pythagoria is a triangular country covering the south-west diagonal half of the western half of the world. The north-eastern corner of Hoblong is a right-angled isosceles triangular country called Dogear, its two short sides having length 4. In the remaining middle part of Hoblong are the two countries North Quadrilania and South Quadrilania. These two countries are divided by the east-west line y=6, which at its eastern end meets the southern tip of Dogear.

In addition to this, there is a small triangular region bordering the south of the world and contained entirely within Pythagoria, which, much to the historical resentment of the Pythagorians, is an exclave belonging to South Quadrilania.

Happily, the Pythagorians and South Quadrilanians have for the time being put aside their differences for the benefit of their shared interest in digital cartography. They have collaborated to produce a GeoJSON representation of the countries of their world, as follows.

{
    "type": "FeatureCollection",
    "features": [
        {
            "properties": {
                "code": "DO",
                "name": "Dogear",
                "headofstate": "King Bottleneck IV"
            },
            "geometry": {
                "type": "Polygon",
                "coordinates": [
                        [ [6,10], [10,6], [10,10], [6,10] ]
                ]
            }
        },
        {
            "properties": {
                "code": "NQ",
                "name": "North Quadrilania",
                "headofstate": "Queen Cubehead XXVII"
            },
            "geometry": {
                "type": "Polygon",
                "coordinates": [
                        [ [0,10], [2,6], [10,6], [6,10], [0,10] ]
                ]
            }
        },
        {
            "properties": {
                "code": "SQ",
                "name": "South Quadrilania",
                "headofstate": "Kevin the Magnificent"
            },
            "geometry": {
                "type": "MultiPolygon",
                "coordinates": [
                        [ [ [2,6], [5,0], [10,0], [10,6], [2,6] ] ],
                        [ [ [1,2], [1,0], [2,0], [1,2] ] ]
                ]
            }
        },
        {
            "properties": {
                "code": "PY",
                "name": "Pythagoria",
                "headofstate": "Lord Hypotenuse"
            },
            "geometry": {
                "type": "Polygon",
                "coordinates": [
                        [ [0,10], [0,0], [5,0], [0,10] ],
                        [ [1,2], [2,0], [1,0], [1,2] ]
                ]
            }
        }
    ]
}

Note that all the countries can be represented by Polygon objects except South Quadrilania, which is represented as a MultiPolygon in order to include its exclave. The corresponding hole in the middle of Pythagoria is represented as a hole in the Polygon - in GeoJSON, a Polygon object is actually an array of polygons (each of which is an array of points forming a ring), the first one being the main polygon and subsequent rings being holes in the main polygon.

If we run wxjsonprofile on this JSON file, using the properties.code property of each object as the key, we get the following output:

$ wxjsonprofile -k properties.code hoblong.json
Kognitio WX2 JSON Profiling Tool v8.01.01
(c)Copyright Kognitio Ltd 2015-2015.

hoblong.json: one JSON object.

root                            1 obj; 1 FeatureCollection
    type                        1 str(17)
    features[]                  [4]: 4 obj
        properties              4 obj
            code                4 str(2)
            name                4 str(17)
            headofstate         4 str(23)
        geometry                4 obj; 3 Polygon, 1 MultiPolygon, coord X 26 int1, coord Y 26 int1
            type                4 str(12)
            coordinates[][][]   [1-2, 1.5][1-5, 3.2][2-5, 2.3]: 34 int1
            coordinates[][][][] [1-2, 1.5][1-5, 3.2][2-5, 2.3][2]: 18 int1

hoblong.json: no target array specified... looks like it might be "features"

BASE TABLE
        hoblong, 4 columns, target array features

DIMENSION TABLES
        hoblong_geometry_coordinates (geometry.coordinates), 6 columns

Writing hoblong.sql... done.
Writing hoblong.json2csv... done.
Writing hoblong_geometry_coordinates.sql... done.
Writing hoblong_geometry_coordinates.json2csv... done.

We won’t examine hoblong.sql and hoblong.json2csv in any detail, as these are just the code, name and head of state of the country. The interesting part is the SQL and format string for the coordinates.

$ cat hoblong_geometry_coordinates.sql
CREATE TABLE hoblong_geometry_coordinates (
    properties_code VARCHAR(10),     -- properties.code
    polygon_seq     INTEGER,         -- polygon sequence number
    ring_seq        INTEGER,         -- ring sequence number
    point_seq       INTEGER,         -- point sequence number
    X               INTEGER,         -- coord [0]
    Y               INTEGER          -- coord [1]
);

$ cat hoblong_geometry_coordinates.json2csv
[features]
if (valid(geometry.type) && geometry.type == "Polygon") {
        APPLY(firstvalid(?, null)),
        properties.code,
        0,
        seq(geometry.coordinates[]),
        seq(geometry.coordinates[][]),
        geometry.coordinates[][][0],
        geometry.coordinates[][][1]
}
if (valid(geometry.type) && geometry.type == "MultiPolygon") {
        APPLY(firstvalid(?, null)),
        properties.code,
        seq(geometry.coordinates[]),
        seq(geometry.coordinates[][]),
        seq(geometry.coordinates[][][]),
        geometry.coordinates[][][][0],
        geometry.coordinates[][][][1]
}

Note that we’ve generated two row recipes in the format string: one to use if the geometry type is a Polygon, and one for if it’s a MultiPolygon. For Polygon objects, the polygon sequence number is always 0 because there’s only one polygon.

Once loaded into the hoblong_geometry_coordinates table, the data looks like this:

CODE

POLYGON_SEQ

RING_SEQ

POINT_SEQ

X

Y

DO

0

0

0

6

10

DO

0

0

1

10

6

DO

0

0

2

10

10

DO

0

0

3

6

10

NQ

0

0

0

0

10

NQ

0

0

1

2

6

NQ

0

0

2

10

6

NQ

0

0

3

6

10

NQ

0

0

4

0

10

SQ

0

0

0

2

6

SQ

0

0

1

5

0

SQ

0

0

2

10

0

SQ

0

0

3

10

6

SQ

0

0

4

2

6

SQ

1

0

0

1

2

SQ

1

0

1

1

0

SQ

1

0

2

2

0

SQ

1

0

3

1

2

PY

0

0

0

0

10

PY

0

0

1

0

0

PY

0

0

2

5

0

PY

0

0

3

0

10

PY

0

1

0

1

2

PY

0

1

1

2

0

PY

0

1

2

1

0

PY

0

1

3

1

2

EXIT STATUS

wxjsonprofile returns 0 on success, and 1 on failure.

REFERENCES

GeoJSON: http://geojson.org/

AUTHOR

Kognitio Ltd.