Debugging Java Threads


Which Java process is using most of the CPU:

 $ ps u -C java

Generate the Java thread dump:

$ jstack -l PId > PId-threads.txt

From the Java threads I can count:

$ awk '/State: / { print }' < PId-threads.txt  | sort | uniq -c
 450    java.lang.Thread.State: BLOCKED (on object monitor)
 240    java.lang.Thread.State: RUNNABLE
  47    java.lang.Thread.State: TIMED_WAITING (on object monitor)
 294    java.lang.Thread.State: TIMED_WAITING (parking)
  31    java.lang.Thread.State: TIMED_WAITING (sleeping)
  42    java.lang.Thread.State: WAITING (on object monitor)
  62    java.lang.Thread.State: WAITING (parking)

From this, we search on the ones that are “waiting to lock”:

$ awk '/waiting to lock / { print }' < PId-threads.txt  | sort | uniq -c
   1     - waiting to lock <0x0000000600a027d8> (a org.apache.log4j.spi.RootLogger)
 294     - waiting to lock <0x0000000600f2e770> (a java.util.Hashtable)
  19     - waiting to lock <0x0000000600f36fc8> (a java.lang.Object)
   1     - waiting to lock <0x000000072f6e6708> (a org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler)

Main thread blocking here is: <0x0000000600f2e770>.

From the BLOCKED threads, we have many processes waiting for getConnection:

"Thread-132985" prio=10 tid=0x00007fec40784800 nid=0x662d waiting for monitor entry [0x00007fec18cd4000]
   java.lang.Thread.State: BLOCKED (on object monitor)
    at org.apache.hadoop.ipc.Client.getConnection(Client.java:1449)
    - waiting to lock <0x0000000600f2e770> (a java.util.Hashtable)

And many others waiting for Connection.close:

"IPC Client (738091550) connection to /10.66.2.38:9022 from hadoop" daemon prio=10 tid=0x00007fec41c1f800 nid=0x2dcc waiting for monitor entry [0x00007fec15da6000]
   java.lang.Thread.State: BLOCKED (on object monitor)
    at org.apache.hadoop.ipc.Client$Connection.close(Client.java:1135)
    - waiting to lock <0x0000000600f2e770> (a java.util.Hashtable)

All of them are BLOCKED by <0x0000000600f2e770>.

Opening other PId’s we can find the Java thread that has the lock on this resource. The culprit will look like:

"Thread-133346" prio=10 tid=0x00007fec40ac7800 nid=0x747e runnable [0x00007fec17cc4000]
   java.lang.Thread.State: RUNNABLE
    at java.lang.Thread.<init>(Thread.java:234)
    at org.apache.hadoop.ipc.Client$Connection.<init>(Client.java:396)
    at org.apache.hadoop.ipc.Client.getConnection(Client.java:1452)
    - locked <0x0000000600f2e770> (a java.util.Hashtable)
    at org.apache.hadoop.ipc.Client.call(Client.java:1381)
    at org.apache.hadoop.ipc.Client.call(Client.java:1363)
    at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:215)
    at com.sun.proxy.$Proxy42.getApplicationReport(Unknown Source)
    at org.apache.hadoop.yarn.api.impl.pb.client.ApplicationClientProtocolPBClientImpl.getApplicationReport(ApplicationClientProtocolPBClientImpl.java:163)
    at sun.reflect.GeneratedMethodAccessor39.invoke(Unknown Source)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:606)
    at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:190)
    at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:103)
    at com.sun.proxy.$Proxy43.getApplicationReport(Unknown Source)
    at org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.getApplicationReport(YarnClientImpl.java:291)
    at org.apache.hadoop.mapred.ResourceMgrDelegate.getApplicationReport(ResourceMgrDelegate.java:294)
    at org.apache.hadoop.mapred.ClientServiceDelegate.getProxy(ClientServiceDelegate.java:152)
    at org.apache.hadoop.mapred.ClientServiceDelegate.invoke(ClientServiceDelegate.java:319)
    - locked <0x0000000733033970> (a org.apache.hadoop.mapred.ClientServiceDelegate)
    at org.apache.hadoop.mapred.ClientServiceDelegate.getJobStatus(ClientServiceDelegate.java:419)
    at org.apache.hadoop.mapred.YARNRunner.getJobStatus(YARNRunner.java:532)
    at org.apache.hadoop.mapreduce.Job$1.run(Job.java:314)
    at org.apache.hadoop.mapreduce.Job$1.run(Job.java:311)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:415)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
    at org.apache.hadoop.mapreduce.Job.updateStatus(Job.java:311)
    - locked <0x0000000733032e20> (a org.apache.hadoop.mapreduce.Job)
    at org.apache.hadoop.mapreduce.Job.getJobState(Job.java:347)
    at org.apache.hadoop.mapred.JobClient$NetworkedJob.getJobState(JobClient.java:295)
    - locked <0x0000000733032e10> (a org.apache.hadoop.mapred.JobClient$NetworkedJob)
    at org.apache.hadoop.hive.ql.exec.mr.HadoopJobExecHelper.progress(HadoopJobExecHelper.java:244)
    at org.apache.hadoop.hive.ql.exec.mr.HadoopJobExecHelper.progress(HadoopJobExecHelper.java:547)
    at org.apache.hadoop.hive.ql.exec.mr.ExecDriver.execute(ExecDriver.java:426)
    at org.apache.hadoop.hive.ql.exec.mr.MapRedTask.execute(MapRedTask.java:136)
    at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:153)
    at org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:85)
    at org.apache.hadoop.hive.ql.exec.TaskRunner.run(TaskRunner.java:72)

This kind of Blocked Status is not technically a Deadlock, but just a thread blocking other threads: locked status for a resource while many other objects start queuing waiting for the same resource.

Anuncios

Acerca de hvivani

sysadmin, developer, RHCSA
Esta entrada fue publicada en Uncategorized y etiquetada , . Guarda el enlace permanente.

Responder

Introduce tus datos o haz clic en un icono para iniciar sesión:

Logo de WordPress.com

Estás comentando usando tu cuenta de WordPress.com. Cerrar sesión / Cambiar )

Imagen de Twitter

Estás comentando usando tu cuenta de Twitter. Cerrar sesión / Cambiar )

Foto de Facebook

Estás comentando usando tu cuenta de Facebook. Cerrar sesión / Cambiar )

Google+ photo

Estás comentando usando tu cuenta de Google+. Cerrar sesión / Cambiar )

Conectando a %s