Forum

Information and discussion related to the Kognitio on Hadoop product
Contributor
Offline
User avatar
Posts: 14
Joined: Mon Jun 04, 2018 4:28 pm

YARN Containers getting killed due to "running beyond physical memory limits"

by tajones » Mon Jun 04, 2018 4:47 pm

We are running Kognitio on Hadoop across 19 data nodes with 1 container per node (110GB each). When attempting to image some larger views the cluster becomes unresponsive and we lose a container. The error in the slider application master is "<container> is running beyond physical memory limits. Current usage: 110.1 GB of 110 GB physical memory used; 178.0 GB of 231.0 GB virtual memory used." It kills the container and starts a new one, but at this time the cluster is inaccessible. The most recent container loss resulted from attempting to image a view that is approximately 65GB with about 820 million rows. Has anyone else experienced this and could suggest how to get around it?
Reply with quote Top
Contributor
Offline
User avatar
Posts: 15
Joined: Mon Dec 01, 2014 11:38 am

Re: YARN Containers getting killed due to "running beyond physical memory limits"

by michaeld » Mon Jun 04, 2018 6:58 pm

Hi,

Container death can often be investigated by looking at the App master logs. First run 'kodoop list_clusters' from your edge node. This will give you output containing the URL of the application, similar to the example below:

>kodoop list_clusters
Kognitio Analytical Platform software for Hadoop ver80200rel170824.
(c)Copyright Kognitio Ltd 2001-2017.
...
kognitio-clus01 RUNNING application_1524749628700_0005 http://ip-10-5-3-38.eu-west-1.compute.i ... 8700_0005/
...

Take the URL for your cluster (in this case http://ip-10-5-3-38.eu-west-1.compute.i ... 28700_0005), and run 'wget <url>' on the edge node. This should produce a log file containing a dump of the process tree on the killed container. For example:

...
Container [pid=5687,containerID=container_e02_1524749628700_0005_03_000002] is running beyond physical memory limits. Current usage: 15.8 GB of 15 GB physical memory used; 32.0 GB of 31.5 GB virtual memory used. Killing container.
Dump of the process-tree for container_e02_1524749628700_0005_03_000002 :
|- PID PPID PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS) SYSTEM_TIME(MILLIS) VMEM_USAGE(BYTES) RSSMEM_USAGE(PAGES) FULL_CMD_LINE
|- 6023 1 5687 5687 (wxsmd) 1 3 3883008 388 ./wx2/current/software/Linux/wxsmd -AlHcTt Yarn
....

The process lines begin with "|-". The memory usage columns and FULL_CMD_LINE column may be useful in determining which (if any) unusual processes were running, and how much memory they were using at the time of container death. In the past we have seen cases where many connectors were running (e.g. HDFS and related connectors), and the connectors were exceeding allocated RAM for the container. This can be due to not specifying a limit on connectors per node. You can set this limit in the target string of the connector (for example 'max_connectors_per_node 5'). This problem is fixed in recent Kodoop releases.

If you don't see connectors in your log file, and are not able to determine which processes exceeded memory, please attach the log file here for us to investigate.

Regards,
Mike
Reply with quote Top
Contributor
Offline
User avatar
Posts: 14
Joined: Mon Jun 04, 2018 4:28 pm

Re: YARN Containers getting killed due to "running beyond physical memory limits"

by tajones » Tue Jun 05, 2018 3:05 pm

I'm attaching the output from our last crash. I also put the container dump in an excel sheet so I could filter it so I'm attaching that too. I'm not sure how to tell if any of these are related to the connector issue you mention. Thanks.
Reply with quote Top
Contributor
Offline
User avatar
Posts: 15
Joined: Mon Dec 01, 2014 11:38 am

Re: YARN Containers getting killed due to "running beyond physical memory limits"

by michaeld » Wed Jun 06, 2018 3:18 pm

The process tree shows many wxhdfscli processes were running at the time of container death:

Dump of the process-tree for container_e129_1520755789953_449013_01_000003 :
|- PID PPID PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS) SYSTEM_TIME(MILLIS) VMEM_USAGE(BYTES) RSSMEM_USAGE(PAGES) FULL_CMD_LINE
|- 17761 25086 17761 47515 (wxhdfscli) 686 105 2495160320 62899 /data1/hadoop/yarn/local/usercache/kodoop2/appcache/application_1520755789953_449013/container_e129_1520755789953_449013_01_000003/app/wx2/ver80200rel170824/bin/../software/Linux/wxhdfscli -S 3 --nohup --libhdfs /usr/hdp/2.6.1.0-129/hadoop/../usr/lib/libhdfs.so --libjvm /usr/jdk64/jdk1.8.0_60/jre/lib/amd64/server/libjvm.so -s default
|- 17762 25091 17762 47515 (wxhdfscli) 710 123 2493964288 62768
....

These tend to result from using the HDFS connector (for example in an external table) to retrieve HDFS data. On your software version (ver80200rel170824), it's possible for an invocation of the HDFS connector to spawn too many wxhdfscli processes in Linux, thus exceeding YARN's limit on container memory usage. This is fixed from 8.2.0rel171101. The workaround in the meantime is to set max_connectors_per_node = 5 in the target string of the HDFS connector (or some reasonably low number - different values can be tested for performance). Connectors and target strings are documented in the ‘External tables and connectors’ section of the Kognitio Guide at http://kognitio.com/forums/viewtopic.php?f=2&t=3. An easy way to change the target string of an existing connector is to connect to the system with Kognitio Console, browse to the connector in the system tree, and enter the setting under ‘target’, and click ‘save target changes’.

Using ORC connectors can sometimes result in the same problem. On versions prior to 8.2.0rel171101, the fix is the same - set max_connectors_per_node in the connector's target string. From 8.2.0rel171101, you need to set 'max_mb_ram setting 600' in the target string - This tells the external Java processes not to exceed 600MB of RAM.

regards,
Mike
Reply with quote Top

Who is online

Users browsing this forum: No registered users and 1 guest

cron