They also live across the river and there’s no bridge…so we’ll just make our own!

Amazon’s S3 is a popular and convenient storage solution which many, especially those with big data, tend to utilise, and the challenge can be connecting to this large store that has been building up over days/weeks/months/years. There are many ways to do this, no doubt, you could do a curl/wget but it’s in its raw form and converting it for use with databases isn’t always simple.
With Kognitio external connectors and tables you can connect to and parse it on the fly.

Let’s see how we can do this with JSON data and external tables in Kognitio where we’ll be able to use externally stored data as if they were local tables. We’ll also be utilising this in a later blog post where we’ll have some larger data.

Why would you want to use this though? What are the benefits?

•  Access data stored elsewhere without having to physically move them - streamlining the ETL process
•  Use data as if it were stored locally - no need to rewrite proprietary scripts and processes
•  Easily update with the newest data and improving access times using memory images

Convinced? Great! here’s what you’ll need to get started:

•  An S3 bucket with JSON data
•  Access and secret keys to S3 bucet
•  Kognitio with support for external tables

You can read more about external tables in chapter 8 of the Kognitio guide here.

In a nutshell the three typical components to this process are; stored data, a connector and an external table.

Let’s try it with our sample JSON file containing a customer’s form data.

{
    "firstName": "John",
    "lastName": "Smith",
    "age": 25,
    "address":
    {
        "streetAddress": "21 2nd Street",
        "city": "New York",
        "state": "NY",
        "postalCode": "10021"
    },
    "phoneNumber":
    [
        {
            "type": "home",
            "number": "212 555-1234"
        },
        {
            "type": "fax",
            "number": "646 555-4567"
        }
    ]
}

Source: https://www.sitepoint.com/customer-form-json-file-example/

The first step is to put this into S3, you can use Bash’s curl or something like the boto package in Python to do this. You can also use something like S3 browser to confirm that the file is in our bucket.

Now we can start building connections with external tables in Kognitio to access them. External tables don’t actually import the data from their source but bridge a connection to them. This enables us to access the data without having to go through arduous ETL processes to include the newest batch.

So firstly we need the bridge, an S3 based connector:

create connector my_test_connector source s3
target 'bucket <your_bucket> ,
accesskey <your_s3_access_key> ,
secretkey <your_s3_secret_key> ,
connectors_on_nodes "<node_name>",
proxy "<your_proxy_address>:<port>"';

The connectors_on_nodes is only if your nodes have restricted external access and you need to tell the connector which one can connect to the outside world. Same kind of situation with the proxy, use if required. Once you’ve created your connector, you can see it in the systems pane:

Connector in schema

Let’s see if our connector works, run:

external table from my_test_connector target 'list sample.json';

Files in S3

And we can see the file we placed in S3 and other information about it. Note that you can use the asterisk “*” like in Bash to include all JSON files e.g. “*.json”. If you have JSON files holding different information, you can always tag them with a prefix e.g. “prefix1*.json”. This will be useful when you need to build several external tables with different JSON sources.

Creating an external table uses the following syntax:

create external table . (
    sourcefile varchar(500)
    ,object_id int
    ,first_name varchar(20)
    ,last_name varchar(20)
    ,c_age int
)
from my_test_connector
target 'file "sample.json"
    ,conv_error_name "TEST_S3", ignore_invalid_records
    ,fmt_json 1, format "APPLY(firstvalid(?,null))
    ,inputbasename()
    ,objectnumber()
    ,firstName
    ,lastName
    ,age
';

The first part of the query encased in brackets defines the columns similar to typical table creation in SQL, this needs to match the definitions further down in the format string and the column definition needs to be suitable.
The next three lines dictate the connector you’re using, the target files, the error name for use with ipe_conv_error and invalid record handling.
The last part is telling the external table that we’ll be reading from the JSON format. This is done by setting fmt_json to 1 and then we need to set the format string to tell it what to look for. The APPLY() function used takes a list of functions and applies them to every column and the firstvalid() returns the first evaluated argument which does not cause an error. So “APPLY(firstvalid(?,null))” applies the firstvalid() function to every column where it nulls an entry in case it doesn’t exist instead of erroring. Lastly we define the columns we want it to look for. The inputbasename() and objectnumber() will put the filename and object number into columns which we’ve defined as sourcefile and object_id further above. Then we have JSON names themselves. Obviously JSON files can contain quite some depth so if you need to access deeper entries, so if you had a JSON file containing student details you use “Student[].year_group” which will get the year group from the student array. If year_group was another array inside “Student[]” then you can extend it in the same way, “Student[].year_group[].name”.
Now we can access the data. External tables appear alongside regular tables in the schemas but have a red icon instead, hovering over one with the mouse cursor will identify it as one:

External table in schema

What you might have noticed is that this depends on you knowing what’s inside your JSON file i.e. the entry names and with larger files, searching through them can be quite daunting. A feature to aid in this discovery would be prettyprint() which will return a readable view of what you supply to it e.g. prettyprint(address). This is defined with the rest of the column definitions too after format. Let’s try it on the address in our sample JSON file using an inline external table (returns results without saving it as an actual table):

external table (
    sourcefile varchar(500)
    ,object_id int
    ,first_name varchar(20)
    ,last_name varchar(20)
    ,c_age int
    ,addr_contents varchar(1000)
)
from my_test_connector
target 'file "sample.json"
,conv_error_name "TEST_S3", ignore_invalid_records
,fmt_json 1, format "APPLY(firstvalid(?,null))
    ,inputbasename()
    ,objectnumber()
    ,firstName
    ,lastName
    ,age
    ,prettyprint(address)
';

Pretty printing with JSON files

Now we can see the contents of the address and can pull things from there into their own columns. Try it for the phone number, you will need to add “[]” as it’s an array.
A tip for using data from external tables: the connector still needs to connect to S3 and parse the JSON data on the fly which can be slow considering the amount of variables involved such as connections, proxies, amount of data etc. This slow access certainly won’t be pleasant when we want to use it so what we can do is create a view image of this table.

create view <your_schema>.<your_view> as
    select * from <your_schema>.<your_table>;
create view image <your_schema>.<your_view>;

The view image is essentially a snapshot of the external table data in memory which can be renewed for newer data by simply recreating it. This can be done manually but it’s recommended that you create a bash script to submit the query via wxsubmit and then schedule this to run hourly/daily/weekly via something like cron. It’s also a good idea to do any sort of cleaning or transforming at this view creation stage instead of “select *” so that it’s ready to use.

Next time we’ll use this with a much larger data set including visuals with Tableau and insights with external scripts.