In my last blog post, I introduced Kognitio’s ability to flatten complex JSON objects for loading into a table. Today we’ll look at another example using real-world Ordnance Survey data. We’ll also look at what you can do if the JSON files you need to load are in HDFS. We’ll use these techniques to solve the following example problem…

Which railway station in mainland Britain is the furthest straight-line distance from its nearest neighbour? The fact that the answer is Berwick-upon-Tweed may surprise you!

The source data

Now that we’ve got the rubbish clickbait joke out of the way, we can look at Ordnance Survey’s OpenData scheme which provides freely-available datasets for mapping purposes. Let’s use one of these datasets, Open Map, which lists the positions of all Britain’s railway stations. We’re going to ignore minor discrepancies such as errors introduced by the curvature of the earth and the map projection, and imagine Britain as a flat surface with the OS grid laid on top of it.

The OpenMap dataset gives us, among other things, a series of .shp and other files containing the locations of railway stations in Britain. We convert these to GeoJSON using the ogr2ogr program (ogr2ogr -f GeoJSON /path/to/outputfile.json /path/to/inputfile.shp), and now we have a series of GeoJSON files…

data/NC/NC_RailwayStation.json
data/ND/ND_RailwayStation.json
data/NG/NG_RailwayStation.json
data/NH/NH_RailwayStation.json
data/NJ/NJ_RailwayStation.json
...

We won’t use wxjsonprofile for this like we did in the previous post. Instead, we’ll define our table and JSON format string manually. Each of these JSON files looks something like this…

{
"type": "FeatureCollection",
"crs": { "type": "name", "properties": { "name": "urn:ogc:def:crs:EPSG::27700" } },

"features": [
{ "type": "Feature", "properties": { "ID": "53697B31-32FB-4D9F-A8E3-8C05C9CB5DBB", "DISTNAME": "Chalfont and Latimer", "CLASSIFICA": "Railway Station And London Underground Station", "FEATCODE": 15424 }, "geometry": { "type": "Point", "coordinates": [ 499646.3, 197559.87, 0.0 ] } },
{ "type": "Feature", "properties": { "ID": "D2192CA2-4DED-48D6-9C46-3B127197B214", "DISTNAME": "Swindon", "CLASSIFICA": "Railway Station", "FEATCODE": 15422 }, "geometry": { "type": "Point", "coordinates": [ 414919.98, 185223.0, 0.0 ] } },

<...snip...>

]
}

Defining and populating our table

Each station has a name, an ID, a classification, a feature code, and an OS grid reference consisting of an easting and northing. For this example we won’t need any dimension tables – each station is associated with a pair of co-ordinates rather than a list of polygons or anything complex like that.

So we’ll define our table like this:

CREATE TABLE railway_station (
    id varchar(100),
    name varchar(100),
    easting float,
    northing float
);

And our JSON format string like this, in a file station.json2csv. Our target array is features, and for every object in that array, we project the four fields we want. So here is our station.json2csv file:

[features]
APPLY(firstvalid(?, null)),
properties.ID,
properties.DISTNAME,
geometry.coordinates[0],
geometry.coordinates[1]

Now let’s load the data…

wxloader -s 172.30.11.10 -u os -d -t railway_station \
        -J json/station.json2csv \
        json/data/*/*_RailwayStation.json

This loads each JSON file, one after the other, into the RAILWAY_STATION table using the given JSON format definition.

The Answer

Now let’s find the station whose closest station is furthest away. First, create a view to find the squared distance from every station to every other station. (Squared distance, because Pythagoras’ Theorem means it’s easier to work that out than the distance itself.)

create or replace view railway_station_squared_distances
(id1, id2, dist_squared) as
select s1.id, s2.id,
       (s1.easting - s2.easting) * (s1.easting - s2.easting) +
       (s1.northing - s2.northing) * (s1.northing - s2.northing)
from railway_station s1, railway_station s2;

Now we’ll create another view showing the distance from each station to its nearest station:

create or replace view railway_station_nearest
(id, dist_metres) as
select id1, sqrt(min(dist_squared))
from railway_station_squared_distances
where id1 != id2
group by id1;

Now let’s get the top 10 loneliest stations, name and distance to nearest station in kilometres.

select top 10 s.name, cast(n.dist_metres / 1000 as decimal(9, 3)) dist_km
from railway_station s, railway_station_nearest n
where s.id = n.id
order by 2 desc;

These are the results:

 NAME               | DIST_KM 
--------------------+---------
 Berwick-upon-Tweed |  32.645 
 Stranraer          |  26.064 
 Wick               |  22.181 
 Market Rasen       |  20.447 
 Spalding           |  20.223 
 Elgin              |  19.153 
 Dunkeld and Birnam |  18.852 
 Hereford           |  18.345 
 Perth              |  17.771 
 Gleneagles         |  17.601

Britain’s loneliest station is Berwick-upon-Tweed, being 32.6km in a straight line from its nearest other station, which happens to be Chathill.

JSON in HDFS

In the previous example, we had 38 JSON files, one for each 100km-by-100km OS grid region. Loading them into an ordinary table using wxloader took only a few seconds in this example, but sometimes this may be impractical. What if you’ve got hundreds or thousands of JSON files in HDFS? You don’t want to have to copy them out of HDFS just to load them into Kognitio. And if you’re running Kognitio on Hadoop, and the files are on the same cluster as Kognitio, it seems even more unnecessary to do that.

Let’s make the Kognitio system fetch the files from HDFS itself.

Provided the Kognitio system has access to the HDFS cluster (which, if you’re running Kognitio on Hadoop, it does, or it wouldn’t start), you can solve this problem by creating an external table. An external table like an ordinary table except that its data is not stored within Kognitio, it is stored elsewhere and fetched from there every time the table is queried.

An external table is associated with an external connector. Many external tables can use the same external connector. Kognitio allows users to create HDFS connectors, S3 connectors, and ODBC connectors for fetching data from third party databases. In this post we’ll just look at the HDFS connector.

Creating an HDFS connector

Create your HDFS connector with the following syntax. An external connector takes a target string which defines any attributes the connector needs. In the case of the HDFS connector, it needs to know how to connect to your HDFS cluster’s name node. In this case our name node happens to be at 172.30.21.11:8022 – substitute the address and port number of your HDFS cluster’s name node here.

create external connector HDFSCON
source HDFS
target 'namenode 172.30.21.11:8022';

There are other configuration options related to locations of libraries and other kinds of admin. They are summarised in the HDFS Connector Quick Reference Sheet. Kognitio on Hadoop (8.2 and later) has these libraries in the right place anyway so we won’t get bogged down in those details here. Kognitio on Hadoop also comes with an HDFS connector predefined.

The railway station example in HDFS

Let’s redo our railway station example, but this time using external tables. First we need to define the external table, which connector it uses, which external files to read and how to interpret them.

In our case, we’re reading from a series of JSON files, which we’ve put in the HDFS cluster under /user/kodoop/os/openmap.

$ hadoop fs -ls /user/kodoop/os/openmap/data/*/*_RailwayStation.json
-rw-r--r--   3 kodoop kodoop       2395 2017-02-09 10:19 /user/kodoop/os/openmap/data/NC/NC_RailwayStation.json
-rw-r--r--   3 kodoop kodoop       1674 2017-02-09 10:19 /user/kodoop/os/openmap/data/ND/ND_RailwayStation.json
-rw-r--r--   3 kodoop kodoop       1921 2017-02-09 10:19 /user/kodoop/os/openmap/data/NG/NG_RailwayStation.json
...

Creating an external table backed by JSON files in HDFS

So our external table creation statement looks like this:

create external table railway_station_ext (
    id varchar(100),
    name varchar(100),
    easting float,
    northing float
)
from hdfscon
target 'file /user/kodoop/os/openmap/data/*/*_RailwayStation.json,
        fmt_json 1,
        format "[features]
            APPLY(firstvalid(?, null)),
            properties.ID,
            properties.DISTNAME,
            geometry.coordinates[0],
            geometry.coordinates[1]"'
;

The external table definition consists of a list of column definitions (as with a normal table), an external connector name (hdfscon, the connector we created earlier), and a target string. As with the external connector, the target string contains a comma-separated list of name-value pairs whose meanings are specific to the connector. In this case we’ve had to tell the HDFS connector which files we want (the file attribute), that they’re JSON (fmt_json 1), and what JSON format string to use to project the data into the form defined by the table’s columns (the format attribute).

Querying the external table

Now we’ll set up the same views as before, but referencing this external table rather than an ordinary one:

create or replace view railway_station_squared_distances
(id1, id2, dist_squared) as 
select s1.id, s2.id, 
       (s1.easting - s2.easting) * (s1.easting - s2.easting) +
       (s1.northing - s2.northing) * (s1.northing - s2.northing)
from railway_station_ext s1, railway_station_ext s2;
 
create or replace view railway_station_nearest 
(id, dist_metres) as
select id1, sqrt(min(dist_squared))
from railway_station_squared_distances
where id1 != id2
group by id1;

select top 10 s.name, cast(n.dist_metres / 1000 as decimal(9, 3)) dist_km
from railway_station_ext s, railway_station_nearest n
where s.id = n.id
order by 2 desc;

Optimising repeated accesses of the external data by creating a view image

Note that an external table reloads its data from the source every time it’s accessed, and the query above accesses the RAILWAY_STATION_EXT table several times. To improve performance, you could create a view image of the external table, which here we have called RAILWAY_STATION_EXT_VI. Creating a view image loads the rows into Kognitio’s memory. You can then point the query to that instead.

First, define a view image containing the data from the external table. This defines the view image and loads it with the data.

create or replace view image railway_station_ext_vi as
    select * from railway_station_ext;

Now we’ll redefine RAILWAY_STATION_SQUARED_DISTANCES to use the view image  rather than using the external table directly:

create or replace view railway_station_squared_distances
(id1, id2, dist_squared) as 
select s1.id, s2.id, 
       (s1.easting - s2.easting) * (s1.easting - s2.easting) +
       (s1.northing - s2.northing) * (s1.northing - s2.northing)
from railway_station_ext_vi s1, railway_station_ext_vi s2;

Now redefine our other view to use this view…

create or replace view railway_station_nearest 
(id, dist_metres) as
select id1, sqrt(min(dist_squared))
from railway_station_squared_distances
where id1 != id2
group by id1;

Now select the top ten loneliest railway stations.

select top 10 s.name, cast(n.dist_metres / 1000 as decimal(9, 3)) dist_km
from railway_station_ext_vi s, railway_station_nearest n
where s.id = n.id
order by 2 desc;

Giving us the results:

NAME                | DIST_KM
--------------------+--------
Berwick-upon-Tweed  | 32.645
Stranraer           | 26.064
Wick                | 22.181
Market Rasen        | 20.447
Spalding            | 20.223
Elgin               | 19.153
Dunkeld and Birnam  | 18.852
Hereford            | 18.345
Perth               | 17.771
Gleneagles          | 17.601

Acknowledgements

The data used in these examples comes from OS OpenData. Contains OS data © Crown copyright and database right 2017.