HBase and Zookeeper debugging

I came across some scenarios where an application (i.e. Mapreduce) communicating to HBase through YARN could silently fail with a timeout like the following:

2017-01-30 19:42:03,657 DEBUG [main] org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation: locateRegionInMeta parentTable=hbase:meta, metaLocation=, attempt=9 of 35 failed; retrying after sleep of 10095 because: Failed after attempts=36, exceptions:
Mon Jan 30 19:42:03 UTC 2017, null, java.net.SocketTimeoutException: callTimeout=60000, callDuration=68463: row 'test2,#cmrNo acctNo,99999999999999' on table 'hbase:meta' at region=hbase:meta,,1.1588230740, hostname=ip-172-31-3-246.us-west-2.compute.internal,16000,1485539268192, seqNum=0

The root cause for this behavior here wasn’t related to any missconfiguration at server/networking side, but a library missing in the class path.

When there is a zookeeper issue, depending on the retry parameters the exceptions are not visible.

On this case, In the Mapreduce Java application I’ve added/modified the following parameters that lead into more visibility in the communication layer between Zookeeper and HBase:

conf.set("hbase.client.retries.number", Integer.toString(1));
conf.set("zookeeper.session.timeout", Integer.toString(60000));
conf.set("zookeeper.recovery.retry", Integer.toString(1));

After this, the following exception was visible:

Exception: java.lang.NoClassDefFoundError: com/yammer/metrics/core/Gauge
 at org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithRetries(RpcRetryingCaller.java:157)
 at org.apache.hadoop.hbase.client.ResultBoundedCompletionService$QueueingFuture.run(ResultBoundedCompletionService.java:65)
 at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
 at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
 at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.IOException: com.google.protobuf.ServiceException: java.lang.NoClassDefFoundError: com/yammer/metrics/core/Gauge


Playing around these parameters will cause the application exit quickly when there is a problem with the cluster. This can be desirable in a production environment.

Reducing the parameters to a more conservative value could yield better recovery times.  Setting zookeeper.recovery.retry to 0 will still result in up to two connection attempts made to all zk servers in the quorum and cause and application failure to happen in under a minute should there be a loss of zookeeper connectivity during execution.


As an additional note, if you are receiving timeouts because the application is trying to contact localhost instead of the quorum server, you can set the explicit parameters:

// HBase through MR on Yarn is trying to connect to localhost instead of quorum.


I’ve added a couple of examples of Mapreduce applications for HBase here: https://github.com/hvivani/bigdata/tree/master/hbase


Some additional notes on this behavior: https://discuss.pivotal.io/hc/en-us/articles/200933006-Hbase-application-hangs-indefinitely-connecting-to-zookeeper



Acerca de hvivani

sysadmin, developer, RHCSA
Esta entrada fue publicada en Uncategorized y etiquetada , , , , , . Guarda el enlace permanente.


Introduce tus datos o haz clic en un icono para iniciar sesión:

Logo de WordPress.com

Estás comentando usando tu cuenta de WordPress.com. Cerrar sesión / Cambiar )

Imagen de Twitter

Estás comentando usando tu cuenta de Twitter. Cerrar sesión / Cambiar )

Foto de Facebook

Estás comentando usando tu cuenta de Facebook. Cerrar sesión / Cambiar )

Google+ photo

Estás comentando usando tu cuenta de Google+. Cerrar sesión / Cambiar )

Conectando a %s