Archivo de la etiqueta: BigData

Creating Bigtop patches


To contribute to Bigtop project, we need to submit a patch. We should follow this process for managing our proposed contributions: Create a Jira ticket with the description of the problem. (Note: the ticket should be Minor priority for most … Seguir leyendo

Publicado en Uncategorized | Etiquetado , | Deja un comentario

FileInputFormat vs. CombineFileInputFormat


When you put a file into HDFS, it is converted to blocks of 128 MB. (Default value for HDFS on EMR) Consider a file big enough to consume 10 blocks. When you read that file from HDFS as an input … Seguir leyendo

Publicado en Uncategorized | Etiquetado , , , | Deja un comentario

Start Hive in Debug Mode


Never go out without it: hive –hiveconf hive.root.logger=DEBUG,console

Publicado en Uncategorized | Etiquetado , | Deja un comentario

Consider boosting spark.yarn.executor.memoryOverhead


This is a very specific error related to the Spark Executor and the YARN container coexistence. You will typically see errors like this one on the application container logs: 15/03/12 18:53:46 WARN YarnAllocator: Container killed by YARN for exceeding memory … Seguir leyendo

Publicado en Uncategorized | Etiquetado , , | Deja un comentario

Spark Notes


Apache Spark, is an open source cluster computing framework originally developed at University of California, Berkeley but was later donated to the Apache Software Foundation where it remains today. In contrast to Hadoop’s two-stage disk-based MapReduce paradigm, Spark’s multi-stage in-memory … Seguir leyendo

Publicado en Uncategorized | Etiquetado , | Deja un comentario

How Apache Tez works


Tez enables developers to build end-user applications with much better performance and flexibility. It generalizes the MapReduce paradigm to a more powerful framework based on expressing computations as a dataflow graph. The is designed to get around limitations imposed by … Seguir leyendo

Publicado en Uncategorized | Etiquetado , | Deja un comentario

Create a really big file / Crear un archivo realmente grande


This is sometimes useful when playing with bigdata. Instead of a dd command and wait the file being created block by clock, we can run: $ fallocate -l 200G /mnt/reallyBigFile.csv It essentially “allocates” all of the space you’re seeking, but … Seguir leyendo

Publicado en Uncategorized | Etiquetado , , | 2 comentarios