Archivo de la etiqueta: Hadoop

Running Spark with oozie


Oozie 4.2 now supports spark-action. Example job.properties file (configuration tested on EMR 4.2.0): nameNode=hdfs://172.31.25.17:8020 jobTracker=172.31.25.17:8032 master=local[*] queueName=default examplesRoot=examples oozie.use.system.libpath=true oozie.wf.application.path=${nameNode}/user/${user.name}/${examplesRoot}/apps/spark (Use the master node internal IP instead of localhost in the nameNode and jobTracker) Validate oozie workflow xml file: oozie … Seguir leyendo

Publicado en Uncategorized | Etiquetado , , | Deja un comentario

Hadoop: Output Commiter Notes


OutputCommitter describes the commit of task output for a MapReduce job. The MapReduce framework relies on the OutputCommitter of the job to: Set up the job during initialization; for example, create the temporary output directory for the job. Job setup … Seguir leyendo

Publicado en Uncategorized | Etiquetado , | Deja un comentario

yarn: execute a script on all the nodes in the cluster


This is more Linux script related, but, sometimes we have a Hadoop (YARN) cluster running and we need to run a post install script or activity that executes on all the nodes in the cluster: for i in `yarn node … Seguir leyendo

Publicado en Uncategorized | Etiquetado , , , , | Deja un comentario

HDFS: changing the replication factor


The replication factor is a property that can be set in the HDFS configuration file that will allow you to adjust the global replication factor for the entire cluster. For each block stored in HDFS, there will be n – … Seguir leyendo

Publicado en Uncategorized | Etiquetado , | Deja un comentario

Indexing Common Crawl Metadata on Elasticsearch using Cascading


If you want to explore how to parallelize the data ingestion into Elasticsearch, please have a look to this post I have written for Amazon AWS: http://blogs.aws.amazon.com/bigdata/post/TxC0CXZ3RPPK7O/Indexing-Common-Crawl-Metadata-on-Amazon-EMR-Using-Cascading-and-Elasticsearch It explains how to index Common Crawl metadata into Elasticsearch using Cascading connector … Seguir leyendo

Publicado en Mis Publicaciones, Uncategorized | Etiquetado , , , , , | Deja un comentario

How Ganglia works


What is Ganglia ? Ganglia is a scalable distributed monitoring system for high-performance computing systems such as clusters and Grids. It is based on a hierarchical design targeted at federations of clusters. It leverages widely used technologies such as XML … Seguir leyendo

Publicado en Uncategorized | Etiquetado , , | Deja un comentario

HBase useful commands


1) Connect to HBase. Connect to your running instance of HBase using the hbase shell command, located in the bin/ directory of your HBase install. $ ./bin/hbase shell hbase(main):001:0> 2) Create a table. Use the create command to create a … Seguir leyendo

Publicado en Uncategorized | Etiquetado , | Deja un comentario