wxjson2csv - Convert JSON to CSV

SYNOPSIS

wxjson2csv -j col1,col2, [options] file.json

DESCRIPTION

wxjson2csv takes a JSON file and a column definition list, and creates a CSV file from them according to column definitions supplied by the user. The column definition format is described in the COLUMN DEFINITION FORMAT section, and it is the same format given to wxloader when loading a a JSON file.

OPTIONS

-a array
Specify the name of the target JSON array, relative to the root object, whose objects will be used to generate CSV records. This may instead be given in square brackets at the start of the column definition list. Each object in the array generates one CSV record, or more than one if the object contains arrays which need to be iterated over (see the COLUMN DEFINITION FORMAT section). If this option is not specified, and the target array is not specified in the column definition format, the input is expected to be an unnamed JSON array or a series of JSON objects.
-E n
Stop and fail if we exceed n conversion errors. The default value is 0.
-f char
CSV field separator character. The default is a comma (,).
-h
Show help and exit.
-jcol1,col2,
This option tells wxjson2csv which JSON property of each element of the array specified with -a to use for each CSV column. For example, if each JSON object in the array has three elements, ID, Name and Department, and you want them to come out in that order, you would use -j “ID,Name,Department”. See the COLUMN DEFINITION FORMAT section for a full discussion of the features available. Exactly one of -j or -J is required.
-J file
Pass the column definitions to wxjson2csv in a file rather than directly on the command line. Exactly one of -j or -J is required.
-o file
Output CSV to file. The default is standard output.
-n string
Output string in place of any null value. The default is to leave a CSV field blank if it represents null.

COLUMN DEFINITION FORMAT

The column definition format for wxjson2csv, and for wxloader, is a comma-separated list of expressions, specifying what should be in each output column. This expression may, and usually does, contain JSON property names, which are relative to an object in the target array.

A simple example - one row per JSON object

You have a JSON file containing a series of JSON objects, each of which describes a country, like this:

{
  "name" : "United Kingdom",
  "population" : 64000000,
  "capital" : {
      "name": "London", "lat": 51.5072, "long": -0.1275
  }
}
{
  "name" : "United States of America",
  "population" : 316000000,
  "capital" : {
      "name": "Washington DC", "lat": 38.8951, "long": -77.0367
  }
}
{
  "name" : "Australia",
  "population" : 23000000,
  "capital" : {
      "name": "Canberra", "lat": -35.3075, "long": 149.1244
  }
}

The requirement is for one CSV record per country, containing the country name, capital city, latitude and longitude, and country population, in that order.

By default, if no target array is specified, wxjson2csv expects the input to be in the format above - a series of independent objects. Each object is treated independently, and the column definition list is applied to each object. The column definition list specifies what to put in which CSV column, and is given with -j.

The argument to the -j option should list for each column the JSON property which should be used for that column, as follows:

name, capital.name, capital.lat, capital.long, population

Note that values inside nested objects are addressed by putting a dot between the object name and the value name.

Array elements

To access a specific element of an array by number, append [n] to the array name, where n is the desired index. The first element of an array is element 0. For example, suppose your input JSON consists of places and co-ordinates, like this:

{
    "Name" : "Bracknell",
    "Coords" : [ -0.7566, 51.4083 ]
}
{
    "Name" : "Reading",
    "Coords" : [ -0.9700, 51.4550 ]
}
{
    "Name" : "Edinburgh",
    "Coords" : [ -3.1600, 55.9483 ]
}

To make a file where each record contains the name of the place, the longitude and the latitude, the column definition list would look like this:

Name, Coords[0], Coords[1]

Naming a target array

The previous two examples assume that the input is a simple series of JSON objects, each of which can be used independently and generates one or more records. This is not always the case. Often you will find the above example formatted like this, with the list of objects in a named array in a top-level object:

{
    "Settlements": [
        {
            "Name" : "Bracknell",
            "Coords" : [ -0.7566, 51.4083 ]
        },
        {
            "Name" : "Reading",
            "Coords" : [ -0.9700, 51.4550 ]
        },
        {
            "Name" : "Edinburgh",
            "Coords" : [ -3.1600, 55.9483 ]
        }
    ],
    "Description": "A list of some settlements in the UK."
}

By default, wxjson2csv would load the entire top-level object into memory at once, which is probably not what you want if there are a very large number of elements in the array. So you can name an array in the top-level object which contains the objects wxjson2csv should be looking at: the target array, which in this case is Settlements.

To name a target array, put it in square brackets before the column definition list, like this:

[Settlements] Name, Coords[0], Coords[1]

This tells wxjson2csv to find the array “Settlements” in the top-level object, and for each element of that array, generate three output columns containing the settlement name, longitude and latitude. Each element of the target array must be able to be converted independently from other objects and from any higher-level hierarchy. When loading JSON with wxloader, for example, the elements of the target array are shared out amongst the database nodes for independent conversion, so the column definition list could not refer to the Description element of the top-level object, for example.

Arrays - generating multiple rows from each object

Objects in JSON files are often denormalised, in that an object may contain lists of values as well as scalars. For example, consider the following JSON file describing students and modules:

{
    "Students": [
        {
            "StudentId" : 100214,
            "StudentName" : "John Smith",
            "Modules" : [
                { "Id": "CS203", "Name": "Language Design" },
                { "Id": "CS236", "Name": "Data Structures" }
            ]
        },
        {
            "StudentId" : 100215,
            "StudentName" : "Jill Watson",
            "Modules" : [
                { "Id": "CS301", "Name": "Algorithmic Complexity" },
                { "Id": "CS319", "Name": "Database Theory" },
                { "Id": "CS331", "Name": "Neural Computing" }
            ]
        }
    ]
}

Suppose you want a CSV file linking student IDs to module IDs. You don’t want many module IDs in one field, but a separate record for each student-module relationship.

[Students] StudentId, Modules[].Id

The dimension operator [] tells wxjson2csv to generate a row for every element in the JSON array “Modules”. The resulting CSV would look like this:

100214,CS203
100214,CS236
100215,CS301
100215,CS319
100215,CS331

If the module name is required as well, this can be added:

[Students] StudentId, Modules[].Id, Modules[].Name

To make:

100214,CS203,Language Design
100214,CS236,Data Structures
100215,CS301,Algorithmic Complexity
100215,CS319,Database Theory
100215,CS331,Neural Computing

wxjson2csv recognises that the Modules[] in the second column and the Modules[] in the third column are the same array, and only iterates over them once to pull out Id and Name. In other words, you don’t need to worry about accidentally generating the cartesian product of all module IDs and names.

Multiple dimensions

If the format string refers to two or more independent dimensions - that is, dimensions which are not children of other dimensions referred to in the format string - then wxjson2csv iterates over each independent dimension in turn, with the other dimensions returning null.

For example, this JSON file has one input record which contains two arrays, cats and dogs:

{
    "petowners" : [
        {
            "owner" : "Alice",
            "cats" : [ "Scrumpy", "Jess", "Pewter" ],
            "dogs" : [ "Max", "Bernard" ]
        }
    ]
}

Let’s use this format string:

[petowners] owner, cats[], dogs[]

The CSV output would contain five rows of three columns as follows:

Alice,Scrumpy,
Alice,Jess,
Alice,Pewter,
Alice,,Max
Alice,,Bernard

Sequence numbers and nested arrays

It is possible to project as a CSV field the current position in a particular dimension by using the seq function on the desired dimension.

As an example, consider this JSON file which represents a matrix of numbers.

{
    "matrixid" : 1001,
    "description" : "A 3x4 matrix",
    "matrix" : [
            [ 1.3, 0, -2.5, 4.1 ],
            [ 0.1, -0.4, 3.0, 0.9 ],
            [ -2.0, -1.0, 0, 1.0 ]
    ]
}

The requirement is to arrange this as CSV, where each CSV record represents a cell of the matrix and contains the matrix ID, the row number, the column number, and the cell’s value. An additional complication is that the matrix property is not a flat array but a two-dimensional one.

The seq function evaluates to the current position in the named dimension, rather than the value at that position. The sequence number starts at 0 for the first element of the array and increments by 1 each time, although this behaviour can be changed (see FUNCTION REFERENCE). A suitable column definition list would be as follows:

matrixid, seq(matrix[]), seq(matrix[][]), matrix[][]

For each cell in the matrix, the second column gives the row number of the cell (the position in the array of arrays), the third column gives the column number (the position in the next-level array), and the fourth column gives the value. Note that for this JSON file there is no target array - the root object is the only object we’re looking at and the property names in the column definitions are relative to that.

The two dimensions matrix[] and matrix[][] multiply together - that is, you get every column value for every row - rather than being treated independently like in the previous example, because one dimension is an ancestor of the other.

In this case the resulting rows would be:

1001,0,0,1.3
1001,0,1,0
1001,0,2,-2.5
1001,0,3,4.1
1001,1,0,0.1
1001,1,1,-0.4
1001,1,2,3.0
1001,1,3,0.9
1001,2,0,-2.0
1001,2,1,-1.0
1001,2,2,0
1001,2,3,1.0

Empty arrays

If a column definition refers to an array which is empty, that array will not generate any rows. Consider a JSON file which contains information about houses and their occupants:

{
    "Houses" : [
        {
            "Address" : "221B Baker Street",
            "Occupants" : [ "Sherlock Holmes", "John Watson" ]
        },
        {
            "Address" : "62 West Wallaby Street",
            "Occupants" : [ "Wallace", "Gromit" ]
        },
        {
            "Address" : "13 Deserted Lane",
            "Occupants" : []
        }
    ]
}

The column definition [Houses] Address, Occupants[] would generate the following CSV file:

221B Baker Street,Sherlock Holmes
221B Baker Street,John Watson
62 West Wallaby Street,Wallace
62 West Wallaby Street,Gromit

Note there are no rows for the house with no occupants. If you want an empty array to always generate one row, with the absent values treated as null, use [+] instead of [] for that array. So the column definition [Houses] Address, Occupants[+] would generate this:

221B Baker Street,Sherlock Holmes
221B Baker Street,John Watson
62 West Wallaby Street,Wallace
62 West Wallaby Street,Gromit
13 Deserted Lane,

FILTER CONDITIONS

The JSON input may contain many objects structured differently, and you might only want to convert those objects which satisfy a particular condition. To do this, put if (conditional expression) immediately before the column definition list, and after the target array specification if there is one. Objects which do not satisfy the condition are silently ignored.

The conditional expression can be any expression which can be converted to a boolean value. The syntax for the conditional expression is the same as for the column definitions, and is explained more fully in the next section, but here are some examples.

if (geometry.type == “Polygon”)
Only process those objects whose member geometry.type is the string “Polygon”.
if (valid(dateofbirth))
Convert only those objects for which dateofbirth is a valid expression - that is, where the object contains a member called dateofbirth.
if (valid(coords) && isarray(coords) && length(coords) > 0 && !isarray(coords[0]))
Convert only those objects which contain a value called coords, that is an array, that array is not empty, and the first element of it is not itself an array. Note that if coords is not valid at all, valid(coords) will return false, so the rest of the expression will not be evaluated, so it won’t cause a runtime error.

SYNTAX REFERENCE

The full column definition script consists of an optional target array enclosed in square brackets:

[arrayname]

followed by an optional filter condition:

if (conditional expression)

followed by a comma-separated list of expressions corresponding to the desired values of the output columns, which may optionally be enclosed in braces:

{ expression, expression, }

For example:

[features]
if (geometry != null && geometry.type == "Polygon") {
    geometry.properties.region_name,
    geometry.properties.area_hectares,
    seq(geometry.coordinates[]),
    seq(geometry.coordinates[][]),
    geometry.coordinates[][][0],
    geometry.coordinates[][][1]
}

Expression syntax

Expressions may contain constants, arithmetic expressions, identifiers and function calls. The syntax is similar to C or Java. Every value is a boolean, an integer, a floating-point number, a string, a null, an array of values or an object containing a list of name-value pairs. These are the types defined by JSON.

Operators

The following operators are supported. They are presented in groups in descending order of precedence (e.g. 3 + 4 * 5 is 23, not 35). Operators in the same group have the same precedence, and are evaluated left-to-right unless otherwise stated (e.g. 10 - 3 + 1 is 8, not 6).

OPERATOR     DESCRIPTION
--------------------------------------------------------------
()           parentheses (for overriding operator precedence)
func(...)    function call
array[i]     array index
name.name    object member access
array[]      dimension operator

array[+]     dimension operator (force non-empty)
--------------------------------------------------------------
!exp         logical not (right associative)
-exp         negation (right associative)
--------------------------------------------------------------
exp * exp    multiplication
exp / exp    division
exp % exp    modulo
--------------------------------------------------------------
exp + exp    addition or string concatenation
exp - exp    subtraction
--------------------------------------------------------------
exp < exp    less than
exp <= exp   less than or equal to
exp >= exp   greater than or equal to
exp > exp    greater than
--------------------------------------------------------------
exp == exp   is equal to
exp != exp   is not equal to
--------------------------------------------------------------
exp && exp   logical and
--------------------------------------------------------------
exp || exp   logical or

The conditional operators && and || accept boolean expressions on each side and evaluate to a boolean. The comparison operators <, <=, >=, >, == and != take expressions on each side and evaluate to a boolean, provided the types of each side are such that a comparison makes sense. If an operand that should be boolean isn’t boolean, it is converted to boolean as follows: numbers evaluate to true if they are nonzero and false if zero, strings evaluate to true if nonempty and false if empty, arrays and objects evaluate to true if they contain at least one element and false if not, and null values are taken as false.

The conditional operators && and || are evaluated left-to-right, and if the result can be determined from the left hand operand, the right hand operand is not evaluated. So it is safe to have an expression like if (valid(foo) && foo != 0): if foo is not valid, foo != 0 will not be evaluated so you won’t get a runtime error.

The division operator / performs integer division if both sides are integers, and floating point division otherwise. If either side is not a number, this is a runtime error.

The modulo operator % only accepts integers on either side.

The array index operator [n] takes an integer - it is an error to use any other type as the index, or to apply it to any type other than an array. The first element’s index is 0. If the index is negative, it counts from the end of the array, so array[-1] returns the last element in the array. If the index value is outside the bounds of the array, a runtime error is thrown and the input record is rejected.

Identifiers refer to names of JSON properties in the input object. Identifiers may contain letters, numbers and underscores. To refer to a JSON property which contains other characters, put it in single quotes, e.g. ‘my-json-property’.

The special identifier $ refers to the root, that is, the input object itself - an element of the target array. It is not possible to refer to any higher-level objects, such as properties of the object which contains the target array. Use of the $ identifier is not usually necessary, as $.foo.bar is more easily written as foo.bar, but the $ identifier can be useful to get the JSON representation of the object using the prettyprint function, i.e. prettyprint($).

FUNCTION REFERENCE

The following functions are defined. In the event of a runtime error, the input record is rejected, and by default this terminates the whole conversion.

abs(n)
Return the absolute value of n, that is, n if n >= 0 or -n otherwise. If n is not a number, this is a runtime error.
firstvalid(arg1, arg2, )
Return the value of the first argument whose evaluation does not throw a runtime error. If all the arguments throw a runtime error, you get the runtime error thrown by the last argument.
float(arg [, default ])
Convert the value arg to a floating-point number. arg may evaluate to an integer, floating-point number, string, or boolean (true becomes 1.0, false becomes 0.0). If it is any other type, or null, or if arg is a string and does not parse as a number, default is returned; or if default is not specified, a runtime error is thrown.
index(haystack, needle [, default ])
If haystack is a string, return the position of the start of the first occurrence of the string needle in haystack. The start of the string is position 0. If haystack is an array, return the position of the first element of haystack which compares equal to needle. If there is no match, default is returned if it was supplied, or a runtime error is thrown if not.
iindex(haystack, needle [, default ])
As index(), but string comparisons are case-insensitive. Note that if haystack is an array which contains objects, this applies throughout the comparison to all string values within the objects.
inputbasename()
Return the name of the JSON file being converted, with any leading directory components removed. For example, if the input file is given as /home/joe/json/myjsonfile.json, we return “myjsonfile.json”. This is equivalent to inputfilename(-1) (see below).
inputfilename()
Return the name of the JSON file being converted.
inputfilename(index) [integer argument]
If the argument is an integer, use it as an index to select a component from the input file’s path name, and return that component. The first component is 0. If index is negative, the component is selected by counting backwards from the end. For example, if the input file’s path name is /home/joe/mydata/json/foobar.json, inputfilename(0) would return “home”, inputfilename(3) would return “json”, inputfilename(-1) would return “foobar.json” and inputfilename(-4) would return “joe”. If index refers to a component which does not exist, a runtime error is thrown.
inputfilename(name) [string argument]
If the argument is a string, we scan the input filename for a directory component of the form name=value, where the name part matches the name argument to the function, and return the value. value is expected to be percent-encoded, so %xx, where xx are hex digits, will be translated to the appropriate character. For example, if the input file’s path name is /home/joe/json/country=UK/year=2017/data.json, then inputfilename(“country”) returns “UK” and inputfilename(“year”) returns “2017”. If there is no directory component with the requested name, a runtime error is thrown.
int(arg [, default [, base ] ])
Convert the value of arg to an integer. arg may evaluate to an integer, floating-point number, string, or boolean (true becomes 1, false becomes 0). If it is any other type, or if the value of arg is out of range for a signed 64-bit integer, or if arg is a string and does not parse as an integer, default is returned, or if default is not given, a runtime error is thrown. base is 10 by default, and it affects how strings are converted. For example, to convert from a hexadecimal string to an integer, set the base to 16. If base is 0, the string is assumed to represent an octal number if it begins with ‘0’, a hexadecimal number if it begins with ‘0x’, or a decimal number otherwise.
isarray(arg)
Return true if arg evaluates to an array, false otherwise.
isbool(arg)
Return true if arg evaluates to a boolean value, false otherwise.
isfloat(arg)
Return true if arg evaluates to a floating-point number, false otherwise.
isint(arg)
Return true if arg evaluates to an integer, false otherwise.
isobject(arg)
Return true if arg evaluates to an object, false otherwise.
isstr(arg)
Return true if arg evaluates to a string, false otherwise.
length(arg)
Return, as an integer, the number of elements in arg if arg is an array, the number of name-value pairs if arg is an object, or the length in bytes if arg is a string. If arg is any other type, a runtime error is thrown.
objectnumber()
Return the current input object number. The first object is object 1. This is effectively the current position in the target array, plus 1.
prettyprint(arg [, indent])
Return a human-readable JSON text representation of arg. If indent is given, it is the number of spaces to indent each nested level by. If indent is negative, it’s the number of tabs. The default is -1, so each line is indented by one tab per level of depth.
rindex(haystack, needle [, default ])
As index, but the position of the last occurrence of needle in haystack is returned, rather than the first.
riindex(haystack, needle [, default ])
As iindex, but the position of the last occurrence of needle in haystack is returned, rather than the first.
seq(dim[] [, start [, step ] ] )
Return the current position in the dimension dim[], which must be written with its dimension operator []. The position of the first element is 0, the next 1, and so on. This behaviour can be changed with the start and step arguments, which must be integers if supplied at all. These default to 0 and 1 respectively, so seq(dim[], start, step) is equivalent to seq(dim[]) * step + start.
str(arg)
Convert the value of arg to a string. If arg is a string it is returned unchanged. If arg is an integer, it is converted in the obvious way. If arg is a floating point number, it is formatted like 1.23456789012345e+012. If arg is a boolean, the string “true” or “false” is returned as appropriate. If arg is null, the string “null” is returned. If arg is an array or an object, a JSON representation of the array or object is returned. This JSON representation has unnecessary whitespace removed to save space. If you want a JSON object to be formatted in a human-readable way, use prettyprint().
strptime(input, format)
Read the string input as a date, time or timestamp, according to the string format, which must be a format acceptable to strptime(3). Return a string containing an SQL-format date (e.g. 2015-08-31), time (13:21:34) or timestamp (2015-08-31 13:21:34), depending on whether a date, time or both were converted by the format string.
substr(arg, start [, length ] )
Return a portion of the string arg. If arg is not a string, a runtime error is thrown. The returned portion starts with byte position start in arg, and is length bytes long, limited by the end of the string. The position of the first character is 0. If length is not given, or is negative, or reaches beyond the end of the string, the portion of arg between start and the end of the string, inclusive, is returned. If start is negative, it counts backwards from the end of the string, limited by the start of the string. If start is greater than or equal to arg’s length, you get an empty string.
substr examples
Call
valid(arg)
Return false if the evaluation of arg would throw a runtime error, and return true otherwise. This is useful for checking for the existence of property names in the filter condition: if the member foo does not exist in the object, attempting to use the value of foo would cause a runtime error, but valid(foo) would return false with no runtime error.

DIRECTIVES

Directives look like function calls with capitalised names. They may be used as expressions in the column list, but they do not evaluate to anything and they do not produce an output column. Their purpose is to alter how subsequent expressions are interpreted.

APPLY(exp, exp, …)
The APPLY directive applies the given expressions to every subsequent column expression, using the placeholder token ? to represent the replaced expression. For example, APPLY(? + 1) would add 1 to each column value. APPLY(firstvalid(?, null)) would set to null any column whose evaluation would have thrown a runtime error. If more than one argument is given, their effects are compounded, applied in order. For example, APPLY(f(?), g(?)) means every subsequent column expression exp becomes g(f(exp)).
MULTIPLY_DIMENSIONS()
If this directive is included, it must come before any column definitions. Its effect is to cause independent dimensions to be multiplied together so the cartesian product of their values is generated, rather than the default behaviour which is to iterate over each independent dimension in turn. For example, in the cats and dogs example above (see the Multiple dimensions section), this directive would cause a record to be generated for every cat with every dog.
ROOT(name)
The ROOT directive causes identifiers in subsequent expressions to be reckoned relative to the object name within the input object, rather than the input object itself. This means that if the object contains an object subobject, from which you want the values a, b and c, “ROOT(subobject), a, b, c” is short for “subobject.a, subobject.b, subobject.c”. The special root expression $ is affected by the ROOT directive.

Directives do not apply to later directives, only to later expressions. Also, if you use the same directive a second time in the list, its effect replaces the first, rather than compounding upon it, for subsequent expressions. So if you say ROOT(foo.bar), then list some expressions, identifiers in those expressions are relative to foo.bar, but if you then say ROOT(baz), identifiers in later expressions are relative to baz, not foo.bar.baz.

EXIT STATUS

0 on success, 1 on failure.

AUTHOR

Kognitio Ltd.

SEE ALSO

wxloader