Reintegrating a repaired node

If a node in a KAP system fails, KAP resiliency features allow the system to be brought back up with the node missing, albeit it with some impact on performance and capacity. This document deals with the steps typically required to reintroduce the node once it has been repaired, and sets expectations on how long such an operation will take. The repaired node must have the same OS version, Kognitio software version, drivers and firmware as the other nodes. It must also have identically configured disks.

Note that the times shown below are just estimates based on past incidents, rather than guaranteed min/max times for operations.

The basic steps involved are:

  1. Run tests on the node in isolation - 2 days. This is to gain confidence that the node will not fail quickly again (e.g. due to components failing as the node is put under load). This should ideally be run for a period of days - not following this step increases the risk of the node failing again, with further disruption to the production system.

  2. Make the node visible to the rest of the KAP system: 15 minutes. Stop all KAP software on the repaired node, and change its configuration file to allow it to be seen by the other nodes in the KAP system.

  3. Wipe the disk resources on the node to be reintroduced: 2-6 hours. The disk resources on the node need to be reinitialised. The time taken for this is dependent upon disk size and disk subsystem performance.

  4. Restart the KAP software including the repaired node: 1-6 hours. Ideally an imaging script will be used here to only create the required images in RAM in an optimal order, but remember disk performance is slow as still relying on software RAID.

  5. Recreate the disk resources on the repaired node: 2-8 hours. The time taken to recreate depends on disk resource size, disk performance, and concurrent disk activity.

More detail on each of the steps is contained below.

Step 1: Testing the node in isolation

In this phase, the node’s system_id should be changed to differ from the main KAP system_id in use. The node is then commissioned as a single node system, with a short-cut method used to prevent the disk resource being zeroed. This ensures the node is capable of running the KAP software at a basic level.

To gain more confidence in the repaired node, run some basic DB tests. These will exercise CPU and RAM, although not networking (as the node is isolated), or disk. It is best practice to run these tests for a prolonged period to shake out any issues with the repaired hardware (e.g. failures that only present as components heat up, or are put under load).

Step 2: Make the node visible to the rest of the KAP system

  1. Stop the KAP software on all nodes by running wxserver stop.

  2. Stop the System Management Daemon (smd) on the repaired node with wxsvc stop

  3. Edit the configuration file to match the main system. This can be done by copying /opt/kognitio/wx2/etc/config to the repaired node from another node, as root. Ensure the owner and group of this file is root on all nodes.

  4. Restart the smd on the repaired node with wxsvc start

  5. Wait a few seconds, then run wxprobe -H to ensure all nodes are present, including the repaired node

Step 3: Zero disk resources on the repaired node

Only execute these steps if the replacement node has been fitted with new disks or if the UID on the original disks has changed e.g. if the node was commissioned as a single node system for testing.

  1. zero the disk on the replacement node with wxtool -Z. You can monitor the progress of the disk zeroing by viewing the output file in the current startup directory on the repaired node, and looking for lines including the string “FORMAT” - these are emitted for every 1% of progress.

  2. once zeroing has completed run wxprobe -wD to confirm the status has changed to disk_is_zeroed 1 for the disk on the replacement node and note the new UID

  3. view the clustermap for the current boot on the main system and note down the UID for the failed drive i.e. the drive marked as <virtual ds>

Step 4: Restart the system to include the repaired node

Only execute step 1 below if the replacement node has been fitted with new disks or if the UID on the original disks has changed e.g. if the node was commissioned as a single node system for testing.

  1. having noted the UID of the replacement disk(s) in step 2.6 (new UID) and the UID of the virtual diskstore(s) in step 2.7 (original UID), run wxserver start [sysimage] without recovery replace uid <original UID> with <new UID>. Typically you would include the sysimage option if you have an imaging script you can run. If you need to restart with multiple disk replacements then use wxserver start [sysimage] without recovery replace uid <old-uid-1> with <new-uid-1> uid <old-uid-2> with <new-uid-2> ...

Only execute step 2 below if the replacement node is using the original disks with unaltered Kognitio partitions

  1. run wxserver start [sysimage] without recovery to restart the database

Step 5: Recreate disk resources on the repaired node

If reliability features are enabled, the KAP software will automatically recreate the disk resources that have just been added. If not, recreate disk commands must be run manually to trigger the recreate. Note that a recreate disk command always returns immediately, leaving the recreate to run in the background.

  1. run select mpid,status from sys.ipe_xor_element, and make a note of the mpid’s of any disks with a status of zero. If there are no disks with a status of zero, but there are disks with a status of 4 or 7, then the disks are already recreating ,and no further work is required

  2. If there were disks with a status of 0, then for each disk, run recreate disk <mpid>;. If you then run select mpid,status from sys.ipe_xor_element, you should see each disk with a status of 4 or 7, meaning the node reintegration is complete.