Archivo de la etiqueta: Hadoop

s3:// vs s3n:// vs s3a:// vs EMRFS


s3:// Apache Hadoop implementation of a block-based filesystem backed by S3. Apache Hadoop has deprecated use of this filesystem as of May 2016. s3n:// A native filesystem for reading and writing regular files on S3. S3N allows Hadoop to access … Seguir leyendo

Publicado en Uncategorized | Etiquetado , , , , | Deja un comentario

Kill’em All!


Use it at your own discretion: for app in `yarn application -list | awk ‘$6 == “ACCEPTED” { print $1 }’` ; do yarn application -kill “$app”;done            

Publicado en Uncategorized | Etiquetado , , | Deja un comentario

S3 and Parallel Processing – DirectFileOutputCommitter


The problem: While a Hadoop Job is writing output, it will write to a temporary directory: Task1 –> /unique/temp/directory/task1/file.tmp Task2 –> /unique/temp/directory/task2/file.tmp When the tasks finish the execution, will move (commit) the temporary file to a final location. This schema … Seguir leyendo

Publicado en Uncategorized | Etiquetado , , | Deja un comentario

Running Spark with oozie


Oozie 4.2 now supports spark-action. Example job.properties file (configuration tested on EMR 4.2.0): nameNode=hdfs://172.31.25.17:8020 jobTracker=172.31.25.17:8032 master=local[*] queueName=default examplesRoot=examples oozie.use.system.libpath=true oozie.wf.application.path=${nameNode}/user/${user.name}/${examplesRoot}/apps/spark (Use the master node internal IP instead of localhost in the nameNode and jobTracker) Validate oozie workflow xml file: oozie … Seguir leyendo

Publicado en Uncategorized | Etiquetado , , | Deja un comentario

Hadoop: Output Commiter Notes


OutputCommitter describes the commit of task output for a MapReduce job. The MapReduce framework relies on the OutputCommitter of the job to: Set up the job during initialization; for example, create the temporary output directory for the job. Job setup … Seguir leyendo

Publicado en Uncategorized | Etiquetado , | Deja un comentario

yarn: execute a script on all the nodes in the cluster


This is more Linux script related, but, sometimes we have a Hadoop (YARN) cluster running and we need to run a post install script or activity that executes on all the nodes in the cluster: for i in `yarn node … Seguir leyendo

Publicado en Uncategorized | Etiquetado , , , , | Deja un comentario

HDFS: changing the replication factor


The replication factor is a property that can be set in the HDFS configuration file that will allow you to adjust the global replication factor for the entire cluster. For each block stored in HDFS, there will be n – … Seguir leyendo

Publicado en Uncategorized | Etiquetado , | Deja un comentario