Forum

Information and discussion related to the Kognitio on Hadoop product
Contributor
Offline
User avatar
Posts: 15
Joined: Mon Jun 04, 2018 4:28 pm

Unable to get Kognitio Server to Start (command hangs)

by tajones » Tue Oct 30, 2018 8:12 pm

We experienced a Kognitio server crash and the gateway was not running as well. I attempted to restart the gateway but the server was still not accessible. I then restarted the server and it spit out many DEAD LINKS and Resends messages and said "SERVER INIT Failed. Retrying." But then it would just hang there.

I noticed that 4 containers were new (we lost 4 original containers) so I decided to stop the cluster and start it again. I could not stop the cluster with the cluster stop command because it would just hang. I had to kill the application from YARN scheduler. The cluster started but when attempting to start the server again it did the same (spit out many DEAD LINKS and Resends messages and said SERVER INIT Failed. Retrying.). I tried to start the server again and from this point forward I cannot get the command to run (it hangs). I also have attempted starting the cluster again and this now hangs too. I am not sure what to try next.

While sifting through logs to see if I could find anything helpful I could see where yesterday at 2018-10-29_18:40:56_GMT a command was issued to stop the gateway after a series of messages pertaining to masterlock.

/home/kodoop2/kodoop/logs/logs-prod02/smd.T_2018-10-14_00:16:33_EDT/output.T_2018-10-14_00:16:33_EDT

T_2018-10-29_16:02:28_GMT: Stdout: Get masterlock on this node from container_e141_1539459413993_0219_01_000027 failed. I am greater.
T_2018-10-29_16:02:28_GMT: Stdout: Get masterlock on this node from container_e141_1539459413993_0219_01_000026 failed. I am greater.
T_2018-10-29_16:02:28_GMT: Stdout: Get masterlock on this node from container_e141_1539459413993_0219_01_000028 failed. I am greater.
T_2018-10-29_16:02:45_GMT: Stdout: pxxxxx045..../12942/270656747/00000000 606 CLK: CLOCK CHECK PACKET.
T_2018-10-29_16:02:45_GMT: Stdout: pxxxxx045..../12942/270656747/00000000 606 CLK: QUEUED CLOCK CHECK RESPONSE. min 0, max 0
T_2018-10-29_16:03:09_GMT: Stdout: pxxxxx045..../12942/270656753/00000000 607 CLK: CLOCK CHECK PACKET.
T_2018-10-29_16:03:09_GMT: Stdout: pxxxxx045..../12942/270656753/00000000 607 CLK: QUEUED CLOCK CHECK RESPONSE. min 0, max 0
T_2018-10-29_16:03:34_GMT: Stdout: pxxxxx045..../12942/270656758/00000000 605 CLK: CLOCK CHECK PACKET.
T_2018-10-29_16:03:34_GMT: Stdout: pxxxxx045..../12942/270656758/00000000 605 CLK: QUEUED CLOCK CHECK RESPONSE. min 0, max 0
T_2018-10-29_16:03:58_GMT: Stdout: pxxxxx045..../12942/270656765/00000000 606 CLK: CLOCK CHECK PACKET.
T_2018-10-29_16:03:58_GMT: Stdout: pxxxxx045..../12942/270656765/00000000 606 CLK: QUEUED CLOCK CHECK RESPONSE. min 0, max 0
T_2018-10-29_18:40:56_GMT: Kognitio WX2 Service Controller v8.02.00-rel170824 on prod02
T_2018-10-29_18:40:56_GMT: (c)Copyright Kognitio Ltd 2001-2017.

T_2018-10-29_18:40:56_GMT: Stopping ODBC Gateway:T_2018-10-29_18:40:56_GMT: OK.

Looking for any suggestions on what to try next. Thanks!
Reply with quote Top
Contributor
Offline
User avatar
Posts: 15
Joined: Mon Jun 04, 2018 4:28 pm

Re: Unable to get Kognitio Server to Start (command hangs)

by tajones » Wed Oct 31, 2018 1:25 pm

Since yesterday I am now not able to get past the cluster start command. It hangs on "If this takes more than 5 minutes you may have a problem."

kodoop cluster prod02 start
Kognitio Analytical Platform software for Hadoop ver80200rel170824.
(c)Copyright Kognitio Ltd 2001-2017.

Code: Select all

Starting slider cluster for prod02
Waiting for cluster to start up
This may take a few minutes, please be patient.
If this takes more than 5 minutes you may have a problem.
Cluster started, starting local runtime
Waiting for containers to check in.
This may take a few minutes, please be patient.
If this takes more than 5 minutes you may have a problem.
The commands log stops on the following:

Code: Select all

2018-10-31 08:20:36,665 [main] INFO  impl.YarnClientImpl - Submitted application application_1539459413993_99862
2018-10-31 08:20:36,667 [main] INFO  util.ExitUtil - Exiting with status 0
RUNNING hadoop fs -put -f - .kodoop-clusters/prod02/cluster-start-info
Running /home/kodoop2/kodoop/clusters/prod02/wx2/current/bin/wxsvc -s status
Service System management daemon is running, pid 7517.
Running /home/kodoop2/kodoop/clusters/prod02/wx2/current/bin/wxtool -R
Reply with quote Top
Contributor
Offline
User avatar
Posts: 15
Joined: Mon Jun 04, 2018 4:28 pm

Re: Unable to get Kognitio Server to Start (command hangs)

by tajones » Wed Oct 31, 2018 2:11 pm

I have resolved this. I attempted to stop the smd so I could start it again but that command hung.
wxsvc -s -V stop
I had to kill the pid. After this, I was able to successfully start the cluster and then the server. I'm still researching the root cause. I have found the following in the logs. If you could provide any comment as to what this means that would be great. Thanks!

Code: Select all

T_2018-10-29_15:31:40_GMT: CG 0106: Bad r.p. node. at /kognitio/dev/releases/sys80200rel170824/src/wxdb/cg/cgrp.c:6028.
T_2018-10-29_15:31:40_GMT: CG 0106: codebuffer is full. at /kognitio/dev/releases/sys80200rel170824/src/wxdb/cg/cgcodeop.c:620.
T_2018-10-29_15:31:40_GMT: CG 0106: Bad r.p. node. at /kognitio/dev/releases/sys80200rel170824/src/wxdb/cg/cgrp.c:6028.
T_2018-10-29_15:31:40_GMT: CG 0106: codebuffer is full. at /kognitio/dev/releases/sys80200rel170824/src/wxdb/cg/cgcodeop.c:620.
These statements were followed by a bunch of warnings and these set of records repeated until T_2018-10-29_16:05:12_GMT when at that point it aborted:

Code: Select all

T_2018-10-29_16:05:12_GMT: AM id 0x310002c9: abort request received for this AM. session=34947, tno=1066105, session aborted=-1, tno aborted=1066105, abort type=1, commandrunning=1, amg->abortquery=0
T_2018-10-29_16:05:12_GMT: AM #310002c9 aborting now
T_2018-10-29_16:06:03_GMT: AM id 0x2f000234: abort request received for this AM. session=34951, tno=1066109, session aborted=-1, tno aborted=1066109, abort type=1, commandrunning=1, amg->abortquery=0
T_2018-10-29_16:06:03_GMT: AM #2f000234 aborting now
T_2018-10-29_16:06:03_GMT: AM id 0x2e00024f: abort request received for this AM. session=34949, tno=1066107, session aborted=-1, tno aborted=1066107, abort type=1, commandrunning=1, amg->abortquery=0
T_2018-10-29_16:06:03_GMT: AM #2e00024f aborting now
T_2018-10-29_16:06:03_GMT: AM id 0x310000b2: abort request received for this AM. session=34952, tno=1066110, session aborted=-1, tno aborted=1066110, abort type=1, commandrunning=1, amg->abortquery=0
T_2018-10-29_16:06:03_GMT: AM #310000b2 aborting now
T_2018-10-29_16:06:43_GMT: AM id 0x2900026f: abort request received for this AM. session=34948, tno=1066106, session aborted=-1, tno aborted=1066106, abort type=1, commandrunning=1, amg->abortquery=0
T_2018-10-29_16:06:43_GMT: AM #2900026f aborting now
T_2018-10-29_16:06:43_GMT: AM id 0x3100012c: abort request received for this AM. session=34950, tno=1066108, session aborted=-1, tno aborted=1066108, abort type=1, commandrunning=1, amg->abortquery=0
T_2018-10-29_16:06:43_GMT: AM #3100012c aborting now
T_2018-10-29_16:07:23_GMT: AM id 0x2800007b: abort request received for this AM. session=34946, tno=1066104, session aborted=-1, tno aborted=1066104, abort type=1, commandrunning=1, amg->abortquery=0
T_2018-10-29_16:07:23_GMT: AM #2800007b aborting now
T_2018-10-29_18:41:09_GMT: accept_client_connection(): poll returned 2 but no events on any listening socket?
T_2018-10-29_18:41:09_GMT: AM Listener: accept() failed (No such file or directory).
T_2018-10-29_18:41:09_GMT: accept() failure may be the result of a server crash, if one has taken place.
T_2018-10-29_18:41:09_GMT: accept() failed, and crash has already happened. Going into infinite sleep.
T_2018-10-29_18:41:09_GMT: accept_client_connection(): poll returned 2 but no events on any listening socket?
T_2018-10-29_18:41:09_GMT: AM Listener: accept() failed (No such file or directory).
T_2018-10-29_18:41:09_GMT: accept() failure may be the result of a server crash, if one has taken place.
T_2018-10-29_18:41:09_GMT: accept() failed, and crash has already happened. Going into infinite sleep.
T_2018-10-29_18:41:09_GMT: accept_client_connection(): poll returned 2 but no events on any listening socket?
T_2018-10-29_18:41:09_GMT: AM Listener: accept() failed (No such file or directory).
T_2018-10-29_18:41:09_GMT: accept() failure may be the result of a server crash, if one has taken place.
T_2018-10-29_18:41:09_GMT: accept() failed, and crash has already happened. Going into infinite sleep.
T_2018-10-29_18:41:09_GMT: accept_client_connection(): poll returned 2 but no events on any listening socket?
T_2018-10-29_18:41:09_GMT: AM Listener: accept() failed (No such file or directory).
T_2018-10-29_18:41:09_GMT: accept() failure may be the result of a server crash, if one has taken place.
T_2018-10-29_18:41:09_GMT: accept() failed, and crash has already happened. Going into infinite sleep.
T_2018-10-29_18:41:09_GMT: accept_client_connection(): poll returned 2 but no events on any listening socket?
T_2018-10-29_18:41:09_GMT: AM Listener: accept() failed (No such file or directory).
T_2018-10-29_18:41:09_GMT: accept() failure may be the result of a server crash, if one has taken place.
T_2018-10-29_18:41:09_GMT: accept() failed, and crash has already happened. Going into infinite sleep.
T_2018-10-29_18:41:09_GMT: accept_client_connection(): poll returned 2 but no events on any listening socket?
T_2018-10-29_18:41:09_GMT: AM Listener: accept() failed (No such file or directory).
T_2018-10-29_18:41:09_GMT: accept() failure may be the result of a server crash, if one has taken place.
T_2018-10-29_18:41:09_GMT: accept() failed, and crash has already happened. Going into infinite sleep.
Reply with quote Top

Who is online

Users browsing this forum: No registered users and 1 guest

cron