Anticipating and resolving hardware failure

Caution

This section refers to Kognitio Standalone only as hardware in Kognitio on Hadoop is virtualized over the HDFS layer.

RAID

Kognitio provides a software RAID layer, described here. It is recommended to implement software RAID in standalone systems. RAID is also necessary if you want a system to be able to cope with disk or node failures.

Backups

Kognitio provides tools for backing up your data. Backups are described <link> here. It is recommended to backup data regularly to prepare for disaster recovery.

Preparing for disk failure

If some disk resources in a system fail or become unavailable, and cannot be recovered quickly, Kognitio supports the ability to restart without those disks. There are two ways to do this - standby disks and virtual diskstores. Both require Software RAID to be implemented in advance.

  • Standby disks, described here, allow the system to be restarted with a full complement of working disks, by replacing failed disks with reserve disks. The downside is that some disks need to be held in reserve and not used, thereby reducing available disk space.

  • Virtual diskstores, described here, are software-virtualized disk resources. In this case, the system will be restarted without a full complement of working disks. This method doesn’t have the downside of reducing available disk space in the system, but it does mean performance will be degraded in the event of disk failure, since virtual diskstores must constantly recreate data from parity information.

Node failure

If one or more nodes fail, and they cannot be restored quickly, then Kognitio can be restarted with the nodes missing, provided that:

Kognitio cannot conntinue to run automatically in the event of node failure - a restart will be required, and images will need to be rebuilt from disk. Of course if nodes are missing, there will be less RAM space available for user images, and performance will be slightly worse due to the reduced CPU and RAM resource.

Once the node has been repaired, steps can be taken to reintegrate it into the system. See reingetrating nodes for details.

Network interface failure

A Kognitio system should ideally be commissioned on a set of nodes with redundant networking. This would mean that more than one NIC per node is available for inter-node communication. If this is the case, then Kognitio will be able to tolerate one or more NICs failing, provided that all nodes have at least one functional interface. Kognitio keeps a record of the number of network frames that have been dropped/sent/resent over time for each interface, in the system tables ipe_mpk_stats, ipe_mpk_link_stats and ipe_mpk_link_peer_stats. When certain thresholds are reached, the database will mark an interface as ‘bad’, and will not use the interface thereafter for a period of time. Periodically, the database will run checks to see whether the interface has returned to a functional state, and will resume use of the interface if it has. Therefore, problems with an interface can often be resolved by replacing or restarting the interface, with no need to restart the system.

Should there be a requirement to manually stop the Kognitio database from using a particular network interface(s) for internal traffic, the local config file can be updated on the required nodes to include an entry that defines which interfaces should be used on those nodes. The steps below describe this.

  1. ssh to the desired node, and open the local config file on the node with wxviconf -l

  2. Add the line default_net=<x> under [system], here <x> is a comma separated list of the interfaces that the node should use. For example, to specify that eth2 and eth3 should be used for inter-node communication, use default_net=eth2,eth3

  3. Repeat steps 1 and 2 for other nodes if required

  4. Run wxprobe -i and check that the correct interfaces are used, and that the correct number of MPK links are shown.

  5. Restart the database with wxserver start [sysimage]

These changes will remain in effect until the config settings are removed and the system is restarted.

Disk corruption

Kognitio provides disk_check and disk_repair tools to check and repair any data structure inconsistencies on disk resources. Typically these would be used when evidence of disk corruption is found. The syntax is

INVOKE DISK_REPAIR;
INVOKE DISK_CHECK;

The INVOKE DISK_REPAIR command invokes a disk resource repair process that makes a complete scan of all disk resources, and tries to repair any structural damage encountered in the Kognitio data structure. The Kognitio error log lists all actions undertaken by DISK REPAIR. At the end of the DISK REPAIR, the disk resource structure should be consistent (although some data loss may have occurred and be reported in the error log). The INVOKE DISK_CHECK command performs the same structural checks, but makes no changes to the disk resource. These commands are not required during normal operation. They should only be run after consultation with Kognitio.

If disk corruption cannot be repaired in this way, the system will need to be recommissioned and restored from backup.