Which Java process is using most of the CPU:
$ ps u -C java
Generate the Java thread dump:
$ jstack -l PId > PId-threads.txt
From the Java threads I can count:
$ awk '/State: / { print }' < PId-threads.txt | sort | uniq -c 450 java.lang.Thread.State: BLOCKED (on object monitor) 240 java.lang.Thread.State: RUNNABLE 47 java.lang.Thread.State: TIMED_WAITING (on object monitor) 294 java.lang.Thread.State: TIMED_WAITING (parking) 31 java.lang.Thread.State: TIMED_WAITING (sleeping) 42 java.lang.Thread.State: WAITING (on object monitor) 62 java.lang.Thread.State: WAITING (parking)
From this, we search on the ones that are “waiting to lock”:
$ awk '/waiting to lock / { print }' < PId-threads.txt | sort | uniq -c 1 - waiting to lock <0x0000000600a027d8> (a org.apache.log4j.spi.RootLogger) 294 - waiting to lock <0x0000000600f2e770> (a java.util.Hashtable) 19 - waiting to lock <0x0000000600f36fc8> (a java.lang.Object) 1 - waiting to lock <0x000000072f6e6708> (a org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler)
Main thread blocking here is: <0x0000000600f2e770>.
From the BLOCKED threads, we have many processes waiting for getConnection:
"Thread-132985" prio=10 tid=0x00007fec40784800 nid=0x662d waiting for monitor entry [0x00007fec18cd4000]
java.lang.Thread.State: BLOCKED (on object monitor)
at org.apache.hadoop.ipc.Client.getConnection(Client.java:1449)
- waiting to lock <0x0000000600f2e770> (a java.util.Hashtable)
And many others waiting for Connection.close:
"IPC Client (738091550) connection to /10.66.2.38:9022 from hadoop" daemon prio=10 tid=0x00007fec41c1f800 nid=0x2dcc waiting for monitor entry [0x00007fec15da6000]
java.lang.Thread.State: BLOCKED (on object monitor)
at org.apache.hadoop.ipc.Client$Connection.close(Client.java:1135)
- waiting to lock <0x0000000600f2e770> (a java.util.Hashtable)
All of them are BLOCKED by <0x0000000600f2e770>.
Opening other PId’s we can find the Java thread that has the lock on this resource. The culprit will look like:
"Thread-133346" prio=10 tid=0x00007fec40ac7800 nid=0x747e runnable [0x00007fec17cc4000]
java.lang.Thread.State: RUNNABLE
at java.lang.Thread.<init>(Thread.java:234)
at org.apache.hadoop.ipc.Client$Connection.<init>(Client.java:396)
at org.apache.hadoop.ipc.Client.getConnection(Client.java:1452)
- locked <0x0000000600f2e770> (a java.util.Hashtable)
at org.apache.hadoop.ipc.Client.call(Client.java:1381)
at org.apache.hadoop.ipc.Client.call(Client.java:1363)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:215)
at com.sun.proxy.$Proxy42.getApplicationReport(Unknown Source)
at org.apache.hadoop.yarn.api.impl.pb.client.ApplicationClientProtocolPBClientImpl.getApplicationReport(ApplicationClientProtocolPBClientImpl.java:163)
at sun.reflect.GeneratedMethodAccessor39.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:190)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:103)
at com.sun.proxy.$Proxy43.getApplicationReport(Unknown Source)
at org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.getApplicationReport(YarnClientImpl.java:291)
at org.apache.hadoop.mapred.ResourceMgrDelegate.getApplicationReport(ResourceMgrDelegate.java:294)
at org.apache.hadoop.mapred.ClientServiceDelegate.getProxy(ClientServiceDelegate.java:152)
at org.apache.hadoop.mapred.ClientServiceDelegate.invoke(ClientServiceDelegate.java:319)
- locked <0x0000000733033970> (a org.apache.hadoop.mapred.ClientServiceDelegate)
at org.apache.hadoop.mapred.ClientServiceDelegate.getJobStatus(ClientServiceDelegate.java:419)
at org.apache.hadoop.mapred.YARNRunner.getJobStatus(YARNRunner.java:532)
at org.apache.hadoop.mapreduce.Job$1.run(Job.java:314)
at org.apache.hadoop.mapreduce.Job$1.run(Job.java:311)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
at org.apache.hadoop.mapreduce.Job.updateStatus(Job.java:311)
- locked <0x0000000733032e20> (a org.apache.hadoop.mapreduce.Job)
at org.apache.hadoop.mapreduce.Job.getJobState(Job.java:347)
at org.apache.hadoop.mapred.JobClient$NetworkedJob.getJobState(JobClient.java:295)
- locked <0x0000000733032e10> (a org.apache.hadoop.mapred.JobClient$NetworkedJob)
at org.apache.hadoop.hive.ql.exec.mr.HadoopJobExecHelper.progress(HadoopJobExecHelper.java:244)
at org.apache.hadoop.hive.ql.exec.mr.HadoopJobExecHelper.progress(HadoopJobExecHelper.java:547)
at org.apache.hadoop.hive.ql.exec.mr.ExecDriver.execute(ExecDriver.java:426)
at org.apache.hadoop.hive.ql.exec.mr.MapRedTask.execute(MapRedTask.java:136)
at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:153)
at org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:85)
at org.apache.hadoop.hive.ql.exec.TaskRunner.run(TaskRunner.java:72)
This kind of Blocked Status is not technically a Deadlock, but just a thread blocking other threads: locked status for a resource while many other objects start queuing waiting for the same resource.