Archivo de la etiqueta: Hadoop

MapReduce: Compression and Input Splits


This is something that always rise doubts: When considering compressed data that will be processed by MapReduce, it is important to check if the compression format supports splitting. If not, the number of map tasks may not be the expected. … Seguir leyendo

Publicado en Uncategorized | Etiquetado , , , | Deja un comentario

skb rides the rocket


[21068723.434629] xen_netfront: xennet: skb rides the rocket: 19 slots skb rides the rocket bug, this issue affects often Hadoop clusters. Each time I face it, I remember this excelent blog post from Brendan Gregg: http://www.brendangregg.com/blog/2014-09-11/perf-kernel-line-tracing.html Enjoy

Publicado en Uncategorized | Etiquetado , , | Deja un comentario

yarn: change configuration and restart node manager on a live cluster


This procedure is to change Yarn configuration on a live cluster, propagate the changes to all the nodes and restart Yarn node manager. Both commands are listing all the nodes on the cluster and then filtering the DNS name to … Seguir leyendo

Publicado en Uncategorized | Etiquetado , , , , , | 1 Comentario

Hadoop 1 vs Hadoop 2 – How many slots do I have per node ?


This is a topic that always rise a discussion… In Hadoop 1, the number of tasks launched per node was specified via the settings mapred.map.tasks.maximum and mapred.reduce.tasks.maximum. But this is ignored when set on Hadoop 2. In Hadoop 2 with … Seguir leyendo

Publicado en Uncategorized | Etiquetado , , | Deja un comentario

Hadoop useful commands


– Copy fromLocal/ToLocal from/to S3: $ bin/hadoop fs -copyToLocal s3://my-bucket/myfile.rb /home/hadoop/myfile.rb $ bin/hadoop fs -copyFromLocal job5.avro s3://my-bucket/input – Merge all the files from one folder into one single file: $ hadoop jar ~/lib/emr-s3distcp-1.0.jar –src s3://my-bucket/my-folder/ –dest s3://my-bucket/logs/all-the-files-merged.log –groupBy ‘.*(*)’ –outputCodec … Seguir leyendo

Publicado en Uncategorized | Etiquetado , | Deja un comentario

Hadoop: HDFS find / recover corrupt blocks


1) Search for files on corrupt files: A command like ‘hadoop fsck /’ will show the status of the filesystem and any corrupt files. This command will ignore lines with nothing but dots and lines talking about replication: hadoop fsck … Seguir leyendo

Publicado en Uncategorized | Etiquetado , , | Deja un comentario

HDFS: Cluster to cluster copy with distcp


Este es el formato del comando distcp para copiar de hdfs a hdfs considerando cluster origen y destino en Amazon AWS: hadoop distcp “hdfs://ec2-54-86-202-252.compute-1.amazonaws.comec2-2:9000/tmp/test.txt” “hdfs://ec2-54-86-229-249.compute-1.amazonaws.comec2-2:9000/tmp/test1.txt” Mas informacion sobre distcp: http://www.cloudera.com/content/cloudera-content/cloudera-docs/CDH4/latest/CDH4-Installation-Guide/cdh4ig_topic_7_2.html http://hadoop.apache.org/docs/r1.2.1/distcp2.html  

Publicado en Uncategorized | Etiquetado , , , | Deja un comentario

Arquitectura HDFS


El diseño del sistema de archivos HDFS se basa en el Google File System (GFS). – Es capaz de almacenar una gran cantidad de datos (terabytes o petabytes). – Esta diseñado para almacenar los datos a traves de un gran … Seguir leyendo

Publicado en Uncategorized | Etiquetado , , | Deja un comentario