Archivo de la categoría: Uncategorized

Running Spark with oozie

Oozie 4.2 now supports spark-action. Example job.properties file (configuration tested on EMR 4.2.0): nameNode=hdfs://172.31.25.17:8020 jobTracker=172.31.25.17:8032 master=local[*] queueName=default examplesRoot=examples oozie.use.system.libpath=true oozie.wf.application.path=${nameNode}/user/${user.name}/${examplesRoot}/apps/spark (Use the master node internal IP instead of localhost in the nameNode and jobTracker) Validate oozie workflow xml file: oozie … Seguir leyendo

Publicado en Uncategorized | Etiquetado , , | Deja un comentario

Hadoop: Output Commiter Notes

OutputCommitter describes the commit of task output for a MapReduce job. The MapReduce framework relies on the OutputCommitter of the job to: Set up the job during initialization; for example, create the temporary output directory for the job. Job setup … Seguir leyendo

Publicado en Uncategorized | Etiquetado , | Deja un comentario

Human Resources and Rocket Science / Recursos Humanos y la Coheteria

I haven’t managed too large teams in my life. But, being in the team, I’ve learned a simple concept: Human Resources are not Rocket Science (action/reaction based). If you are not proactive while managing, you will loose the Resource.   … Seguir leyendo

Publicado en Uncategorized | Etiquetado , | Deja un comentario

Puppet: Syntax validation for Hiera yaml files

I need this handy: ruby -e “require ‘yaml’; YAML.load_file(‘common.yaml’)”

Publicado en Uncategorized | Etiquetado , , | Deja un comentario

Spark Notes

Apache Spark, is an open source cluster computing framework originally developed at University of California, Berkeley but was later donated to the Apache Software Foundation where it remains today. In contrast to Hadoop’s two-stage disk-based MapReduce paradigm, Spark’s multi-stage in-memory … Seguir leyendo

Publicado en Uncategorized | Etiquetado , | Deja un comentario

How Apache Tez works

Tez enables developers to build end-user applications with much better performance and flexibility. It generalizes the MapReduce paradigm to a more powerful framework based on expressing computations as a dataflow graph. The is designed to get around limitations imposed by … Seguir leyendo

Publicado en Uncategorized | Etiquetado , | Deja un comentario

yarn: execute a script on all the nodes in the cluster

This is more Linux script related, but, sometimes we have a Hadoop (YARN) cluster running and we need to run a post install script or activity that executes on all the nodes in the cluster: for i in `yarn node … Seguir leyendo

Publicado en Uncategorized | Etiquetado , , , , | Deja un comentario