Archivo de la etiqueta: MapReduce

HBase and Zookeeper debugging


I came across some scenarios where an application (i.e. Mapreduce) communicating to HBase through YARN could silently fail with a timeout like the following: 2017-01-30 19:42:03,657 DEBUG [main] org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation: locateRegionInMeta parentTable=hbase:meta, metaLocation=, attempt=9 of 35 failed; retrying after sleep of … Seguir leyendo

Publicado en Uncategorized | Etiquetado , , , , , | Deja un comentario

FileInputFormat vs. CombineFileInputFormat


When you put a file into HDFS, it is converted to blocks of 128 MB. (Default value for HDFS on EMR) Consider a file big enough to consume 10 blocks. When you read that file from HDFS as an input … Seguir leyendo

Publicado en Uncategorized | Etiquetado , , , | Deja un comentario

Hadoop: Output Commiter Notes


OutputCommitter describes the commit of task output for a MapReduce job. The MapReduce framework relies on the OutputCommitter of the job to: Set up the job during initialization; for example, create the temporary output directory for the job. Job setup … Seguir leyendo

Publicado en Uncategorized | Etiquetado , | Deja un comentario

YARN / Map Reduce memory settings


On Hadoop 1, we used to use mapred.child.java.opts to set the Java Heap size for the task tracker child processes. With YARN, that parameter has been deprecated in favor of: mapreduce.map.java.opts – These parameter is passed to the JVM for mappers. … Seguir leyendo

Publicado en Uncategorized | Etiquetado , , , | Deja un comentario

Hadoop 1 vs Hadoop 2


There are a lot of articles about this, but, I just needed a good summary of concepts: Hadoop 1: A master process called the JobTracker is the central scheduler for all MapReduce jobs in the cluster. Nodes have a TaskTracker … Seguir leyendo

Publicado en Uncategorized | Etiquetado , | Deja un comentario

MapReduce: Compression and Input Splits


This is something that always rise doubts: When considering compressed data that will be processed by MapReduce, it is important to check if the compression format supports splitting. If not, the number of map tasks may not be the expected. … Seguir leyendo

Publicado en Uncategorized | Etiquetado , , , | Deja un comentario