Archivo de la etiqueta: Hadoop

Indexing Common Crawl Metadata on Elasticsearch using Cascading


If you want to explore how to parallelize the data ingestion into Elasticsearch, please have a look to this post I have written for Amazon AWS: http://blogs.aws.amazon.com/bigdata/post/TxC0CXZ3RPPK7O/Indexing-Common-Crawl-Metadata-on-Amazon-EMR-Using-Cascading-and-Elasticsearch It explains how to index Common Crawl metadata into Elasticsearch using Cascading connector … Seguir leyendo

Publicado en Mis Publicaciones, Uncategorized | Etiquetado , , , , , | Deja un comentario

How Ganglia works


What is Ganglia ? Ganglia is a scalable distributed monitoring system for high-performance computing systems such as clusters and Grids. It is based on a hierarchical design targeted at federations of clusters. It leverages widely used technologies such as XML … Seguir leyendo

Publicado en Uncategorized | Etiquetado , , | Deja un comentario

HBase useful commands


1) Connect to HBase. Connect to your running instance of HBase using the hbase shell command, located in the bin/ directory of your HBase install. $ ./bin/hbase shell hbase(main):001:0> 2) Create a table. Use the create command to create a … Seguir leyendo

Publicado en Uncategorized | Etiquetado , | Deja un comentario

Hive: Extracting JSON fields


Handling JSON files with Hive is not always an easy task. If you need to extract some specific fields from a structured JSON, we have some alternatives: There are two UDF functions that are usually helpful on this cases: ‘get_json_object’ … Seguir leyendo

Publicado en Uncategorized | Etiquetado , | Deja un comentario

Elasticsearch and Kibana on EMR Hadoop cluster


If you need to add Elasticsearch and Kibana on EMR, please have a look to this post I have written for Amazon AWS: http://blogs.aws.amazon.com/bigdata/post/Tx1E8WC98K4TB7T/Getting-Started-with-Elasticsearch-and-Kibana-on-Amazon-EMR It contains all the steps to launch a cluster and perform the basic testings on both … Seguir leyendo

Publicado en Mis Publicaciones, Uncategorized | Etiquetado , , , , , | 3 comentarios

YARN / Map Reduce memory settings


On Hadoop 1, we used to use mapred.child.java.opts to set the Java Heap size for the task tracker child processes. With YARN, that parameter has been deprecated in favor of: mapreduce.map.java.opts – These parameter is passed to the JVM for mappers. … Seguir leyendo

Publicado en Uncategorized | Etiquetado , , , | Deja un comentario

Hadoop 1 vs Hadoop 2


There are a lot of articles about this, but, I just needed a good summary of concepts: Hadoop 1: A master process called the JobTracker is the central scheduler for all MapReduce jobs in the cluster. Nodes have a TaskTracker … Seguir leyendo

Publicado en Uncategorized | Etiquetado , | Deja un comentario

Adding a JAR path to Hadoop classpath


This is simple, but it is a frequent question: If we need to add some specific path pointing to a thirdparty library we can run a command like the following: $ export HADOOP_CLASSPATH=$HADOOP_CLASSPATH:/home/hadoop/.versions/Cascading-2.5-SDK/binary/cascading/*:/home/hadoop/.versions/Cascading-2.5-SDK/binary/cascading/lib/cascading-core/* Here I am adding two directories to … Seguir leyendo

Publicado en Uncategorized | Etiquetado , , | Deja un comentario

Hive: dealing with Out of Memory and Garbage Collector errors.


This is the common error: java.lang.OutOfMemoryError: GC overhead limit exceeded This error will occur in several Java environments, but, in particular, with Hive, is pretty common when big structures or several thousands objects are stored in memory. According to Sun, … Seguir leyendo

Publicado en Uncategorized | Etiquetado , , , | Deja un comentario

HBase Basics


NoSQL? HBase is a type of “NoSQL” database. “NoSQL” is a general term meaning that the database isn’t an RDBMS which supports SQL as its primary access language, but there are many types of NoSQL databases: BerkeleyDB is an example … Seguir leyendo

Publicado en Uncategorized | Etiquetado , , | Deja un comentario