Archivo de la etiqueta: BigData

Consider boosting spark.yarn.executor.memoryOverhead


This is a very specific error related to the Spark Executor and the YARN container coexistence. You will typically see errors like this one on the application container logs: 15/03/12 18:53:46 WARN YarnAllocator: Container killed by YARN for exceeding memory … Seguir leyendo

Publicado en Uncategorized | Etiquetado , , | Deja un comentario

Spark Notes


Apache Spark, is an open source cluster computing framework originally developed at University of California, Berkeley but was later donated to the Apache Software Foundation where it remains today. In contrast to Hadoop’s two-stage disk-based MapReduce paradigm, Spark’s multi-stage in-memory … Seguir leyendo

Publicado en Uncategorized | Etiquetado , | Deja un comentario

How Apache Tez works


Tez enables developers to build end-user applications with much better performance and flexibility. It generalizes the MapReduce paradigm to a more powerful framework based on expressing computations as a dataflow graph. The is designed to get around limitations imposed by … Seguir leyendo

Publicado en Uncategorized | Etiquetado , | Deja un comentario

Create a really big file / Crear un archivo realmente grande


This is sometimes useful when playing with bigdata. Instead of a dd command and wait the file being created block by clock, we can run: $ fallocate -l 200G /mnt/reallyBigFile.csv It essentially “allocates” all of the space you’re seeking, but … Seguir leyendo

Publicado en Uncategorized | Etiquetado , , | 2 comentarios

Hadoop 1 vs Hadoop 2 – How many slots do I have per node ?


This is a topic that always rise a discussion… In Hadoop 1, the number of tasks launched per node was specified via the settings mapred.map.tasks.maximum and mapred.reduce.tasks.maximum. But this is ignored when set on Hadoop 2. In Hadoop 2 with … Seguir leyendo

Publicado en Uncategorized | Etiquetado , , | Deja un comentario

Hadoop: HDFS find / recover corrupt blocks


1) Search for files on corrupt files: A command like ‘hadoop fsck /’ will show the status of the filesystem and any corrupt files. This command will ignore lines with nothing but dots and lines talking about replication: hadoop fsck … Seguir leyendo

Publicado en Uncategorized | Etiquetado , , | Deja un comentario

HDFS: Cluster to cluster copy with distcp


Este es el formato del comando distcp para copiar de hdfs a hdfs considerando cluster origen y destino en Amazon AWS: hadoop distcp “hdfs://ec2-54-86-202-252.compute-1.amazonaws.comec2-2:9000/tmp/test.txt” “hdfs://ec2-54-86-229-249.compute-1.amazonaws.comec2-2:9000/tmp/test1.txt” Mas informacion sobre distcp: http://www.cloudera.com/content/cloudera-content/cloudera-docs/CDH4/latest/CDH4-Installation-Guide/cdh4ig_topic_7_2.html http://hadoop.apache.org/docs/r1.2.1/distcp2.html  

Publicado en Uncategorized | Etiquetado , , , | Deja un comentario

Arquitectura HDFS


El diseño del sistema de archivos HDFS se basa en el Google File System (GFS). – Es capaz de almacenar una gran cantidad de datos (terabytes o petabytes). – Esta diseñado para almacenar los datos a traves de un gran … Seguir leyendo

Publicado en Uncategorized | Etiquetado , , | Deja un comentario

Hive logs to stdout


Muchas veces necesitamos debugear alguna consulta Hive que esta dando error. Una manera facil es habilitar el logger por consola: hive.root.logger specifies the logging level as well as the log destination. Specifying console as the target sends the logs to … Seguir leyendo

Publicado en Uncategorized | Etiquetado , | Deja un comentario

Hive query with JOIN, GROUP BY and SUM does not return results


On Hive 0.11, and lower versions, if we set: set hive.optimize.skewjoin=true; set hive.auto.convert.join=false; A query with JOIN, GROUP BY and SUM does not return results. But if we make the query a little more simple, using JOIN but not GROUP … Seguir leyendo

Publicado en Uncategorized | Etiquetado , | Deja un comentario