Memory Image Distribution

As described in Using Kognitio Memory, how rows are distributed across memory images affects how they are processed in a query.

The basic goal is to distribute data so that operations can be completed without having to pass data between ramstores - this is known as a “ramstore local” operation and enables extremely fast join and group by operations.

There are three main types of image distribution.

  • random - (the default) rows are distributed evenly between ramstores.
  • hashed - rows are distributed to ramstores such that all rows that have the same value for the combined hash value of the columns specified in the hashed on clause are distributed to the same ramstores.
  • replicated - every ramstore contains all the rows in the image.

The syntax for the distribution and further information is available in the reference pages:

Random images

By default, Kognitio distributes the rows of an image evenly across all the ramstores on all the nodes. It writes rows in batches to ramstores chosen at random.

This has the advantage that there will be no skew in the data.

Replicated images

For relatively small tables or views (lookup tables, for example) you can create a replicated image.

Kognitio writes every row of the image to every ramstore on every node. This means that whatever the distribution of the table being joined to, the join will always be ramstore local which gives you great flexibility and excellent join performance at the expense of memory usage.

Hashed images

Hashed images are distributed by calculating a hash value from one or more columns in the data and distributing those hash values across the ramstores.

This means that if two tables are distributed on the keys used to join them together in a query, each ramstore will contain all the rows with the same join keys from both tables which in turn means that the join operation will be ramstore local.

Similarly, if you hash distribute a table on a column and then group by or count(distinct column) the column, all the rows with the same value for the column will be on a single ramstore so the processing of that clause will be ramstore local.

Note: if you hash one table by a single column and another table by that column plus another one, the hashing isn’t compatible and the tables can’t be joined locally.

Skew

One of the difficulties with hash distribution is skew. This is where more rows are distributed to some ramstores than others. A typical case for this is distribution by customer_id where most customers are small and a few are very large. For example a phone company which may have many SME customers and a small number of large call centers. The ramstores that contain the rows for calls containing the customer_id for the call centers will fill up more quickly than the others and you may get an error: RS0001 Insufficient RAM for table / view.

This means that one or more ramstores are full - others may have plenty of space but the space can’t be used because the basic hash distribution requires the data to be deterministically placed for the optimiser to take advantage of the distribution.

In this case you need to consider using partial hashing.

References

Section 2.6 of the Kognitio Guide (PDF).