Instalando Maven en instancia Amazon EC2

Maven es una herramienta de software para la gestión y construcción de proyectos Java

Obtenemos maven:

$ wget http://apache.saix.net/maven/maven-3/3.2.3/binaries/apache-maven-3.2.3-bin.tar.gz

Descomprimimos:

$ tar -xzvf apache-maven-3.2.3-bin.tar.gz

Movemos la carpeta a un directorio de instalación permanente:

$ sudo mv /home/ec2-user/apache-maven-3.2.3 /usr/local/maven

Creamos link simbólico a la version current (por si instalamos otras versiones):

$ sudo ln -s /usr/local/maven /usr/local/maven/current

Modificamos el inicio de sesión del usuario para hacer disponible maven:

$ sudo vi ~/.bashrc

Añadimos las siguientes entradas:

export MAVEN_HOME=/usr/local/maven/current
export M2_HOME=/usr/local/maven/current
export M2=/usr/local/maven/current/bin
export PATH=/usr/local/maven/current/bin:$PATH

Cargamos el bash profile con esta modifiación:

$ source ~/.bashrc

Chequeamos la instalación:

$ mvn -version
Apache Maven 3.2.1 (ea8b2b07643dbb1b84b6d16e1f08391b666bc1e9;2014-02-14T17:37:52+00:00)
Maven home: /usr/local/maven/current
Java version: 1.7.0_55, vendor: Oracle Corporation
Java home: /usr/lib/jvm/java-1.7.0-openjdk-1.7.0.55.x86_64/jre
Publicado en Uncategorized | Etiquetado , , , | Deja un comentario

Adding a JAR path to Hadoop classpath

This is simple, but it is a frequent question:

If we need to add some specific path pointing to a thirdparty library we can run a command like the following:

$ export HADOOP_CLASSPATH=$HADOOP_CLASSPATH:/home/hadoop/.versions/Cascading-2.5-SDK/binary/cascading/*:/home/hadoop/.versions/Cascading-2.5-SDK/binary/cascading/lib/cascading-core/*

Here I am adding two directories to the hadoop classpath:

/home/hadoop/.versions/Cascading-2.5-SDK/binary/cascading/*
/home/hadoop/.versions/Cascading-2.5-SDK/binary/cascading/lib/cascading-core/*

We can check the hadoop classpath with the following command:

$ hadoop classpath

 

Publicado en Uncategorized | Etiquetado , , | Deja un comentario

Hive: dealing with Out of Memory and Garbage Collector errors.

This is the common error:

java.lang.OutOfMemoryError: GC overhead limit exceeded

This error will occur in several Java environments, but, in particular, with Hive, is pretty common when big structures or several thousands objects are stored in memory.

According to Sun, the error will raise if too much time is being spent in garbage collection:

If more than 98% of the total time is spent in garbage collection and less than 2% of the heap is recovered, an OutOfMemoryError will be thrown.

This feature is designed to prevent applications from running for an extended period of time while making little or no progress because the heap is too small.

If necessary, this feature can be disabled by adding the option -XX:-UseGCOverheadLimit.

Also, we can increase the Heap Size, via “-Xmx1024m” option.

Another interesting option is the Concurrent Collector “UseConcMarkSweepGC“: It performs most of its work concurrently (i.e., while the application is still running) to keep garbage collection pauses short.

It is designed for applications with medium to large-sized data sets for which response time is more important than overall throughput, since the techniques used to minimize pauses can reduce application performance.

In Hive, an example command is:

SET mapred.child.java.opts="-server -Xmx1g -XX:+UseConcMarkSweepGC -XX:-UseGCOverheadLimit";

An alternative to avoid storing all the structure in memory:

Write intermediate results to a temporary table in the database instead of hashmaps, a database table table is not memory bound, so use an indexed table is a solution in many cases.

When the intermediate table is complete, execute a sql statement(s) from it instead of from memory.

Publicado en Uncategorized | Etiquetado , , , | Deja un comentario

HBase Basics

NoSQL?

HBase is a type of “NoSQL” database. “NoSQL” is a general term meaning that the database isn’t an RDBMS which supports SQL as its primary access language, but there are many types of NoSQL databases: BerkeleyDB is an example of a local NoSQL database, whereas HBase is very much a distributed database.

Technically speaking, HBase is really more a “Data Store” than “Data Base” because it lacks many of the features you find in an RDBMS, such as typed columns, secondary indexes, triggers, and advanced query languages, etc.

However, HBase has many features which supports both linear and modular scaling. HBase clusters expand by adding RegionServers that are hosted on commodity class servers. If a cluster expands from 10 to 20 RegionServers, for example, it doubles both in terms of storage and as well as processing capacity. RDBMS can scale well, but only up to a point – specifically, the size of a single database server – and for the best performance requires specialized hardware and storage devices. HBase features of note are:

  • Strongly consistent reads/writes: HBase is not an “eventually consistent” DataStore. This makes it very suitable for tasks such as high-speed counter aggregation.
  • Automatic sharding: HBase tables are distributed on the cluster via regions, and regions are automatically split and re-distributed as your data grows.
  • Automatic RegionServer failover
  • Hadoop/HDFS Integration: HBase supports HDFS out of the box as its distributed file system.
  • MapReduce: HBase supports massively parallelized processing via MapReduce for using HBase as both source and sink.
  • Java Client API: HBase supports an easy to use Java API for programmatic access.
  • Thrift/REST API: HBase also supports Thrift and REST for non-Java front-ends.
  • Block Cache and Bloom Filters: HBase supports a Block Cache and Bloom Filters for high volume query optimization.
  • Operational Management: HBase provides build-in web-pages for operational insight as well as JMX metrics.

When Should I Use HBase?

HBase isn’t suitable for every problem.

1) Make sure you have enough data. If you have hundreds of millions or billions of rows, then HBase is a good candidate. If you only have a few thousand/million rows, then using a traditional RDBMS might be a better choice due to the fact that all of your data might wind up on a single node (or two) and the rest of the cluster may be sitting idle.

2) Make sure you can live without all the extra features that an RDBMS provides (e.g., typed columns, secondary indexes, transactions, advanced query languages, etc.)

An application built against an RDBMS cannot be “ported” to HBase by simply changing a JDBC driver, for example. Consider moving from an RDBMS to HBase as a complete redesign as opposed to a port.

3) Make sure you have enough hardware. Even HDFS doesn’t do well with anything less than 5 DataNodes (due to things such as HDFS block replication which has a default of 3), plus a NameNode.

HBase can run quite well stand-alone on a laptop – but this should be considered a development configuration only.

How does HBase distribute the data across the cluster ?

HBase stores rows of data in tables. Tables are split into chunks of rows called “regions”. Those regions are distributed across the cluster, hosted and made available to client processes by the RegionServer process.

hbase-architectureA region is a continuous range within the key space, meaning all rows in the table that sort between the region’s start key and end key are stored in the same region. Regions are non-overlapping, i.e. a single row key belongs to exactly one region at any point in time. A region is only served by a single region server at any point in time, which is how HBase guarantees strong consistency within a single row#. Together with the -ROOT- and .META. regions, a table’s regions effectively form a 3 level B-Tree for the purposes of locating a row within a table.

HBase depends on HDFS for data storage. RegionServers collocate with the HDFS DataNodes. This enables data locality for the data served by the RegionServers, at least in the common case. Region assignment, DDL operations, and other book-keeping facilities are handled by the HBase Master process.

hbase-physical-architectureIt uses Zookeeper to maintain live cluster state. When accessing data, clients communicate with HBase RegionServers directly. That way, Zookeeper and the Master process don’t bottle-neck data throughput. No persistent state lives in Zookeeper or the Master. HBase is designed to recover from complete failure entirely from data persisted durably to HDFS.

What Is The Difference Between HBase and Hadoop/HDFS?

HDFS is a distributed file system that is well suited for the storage of large files. Its documentation states that it is not, however, a general purpose file system, and does not provide fast individual record lookups in files. HBase, on the other hand, is built on top of HDFS and provides fast record lookups (and updates) for large tables. This can sometimes be a point of conceptual confusion. HBase internally puts your data in indexed “StoreFiles” that exist on HDFS for high-speed lookups.

More info:

http://hbase.apache.org/book/book.html

http://hortonworks.com/blog/apache-hbase-region-splitting-and-merging/

Publicado en Uncategorized | Etiquetado , , | Deja un comentario

Mandus Momberg’s Blog ! – the beauty of BASH

I would like to share with you a new awesome blog from an awesome professional:

http://blog.mandusmomberg.com/

And… as a first post, a nice one, about the beauty of BASH:

http://blog.mandusmomberg.com/blog/2014/12/01/o-what-a-beautiful-bashing/

Enjoy !

Publicado en Uncategorized | Etiquetado , | Deja un comentario

vi sudo save with root permissions / grabar cambios con permisos de root

Just:

:w !sudo tee %

% is current file.

!sudo tee calls tee with administrator privileges and writes to current file.  But not vi buffered file.

That’s why you will see a warning like this when using the command:

W12: Warning: File "/etc/myfile.txt" has changed and the buffer was changed in Vim as well

Thanks Mandus for this! I feel better now !

 

 

Publicado en Uncategorized | Etiquetado , | Deja un comentario

MapReduce: Compression and Input Splits

This is something that always rise doubts:

When considering compressed data that will be processed by MapReduce, it is important to check if the compression format supports splitting. If not, the number of map tasks may not be the expected.

Let’s suppose an uncompressed file stored in HDFS whose size is 1 GB: With a HDFS block size of 64 MB, the file will be stored as 16 blocks, and a MapReduce job using this file as input will create 16 input splits, each processed independently as input to a separate map task.

Now if the file is a gzip-compressed file whose compressed size is 1 GB: As before, HDFS will store the file as 16 blocks. But, creating a split for each block will not work since it is impossible to start reading at an arbitrary point in the gzip stream, and therefore impossible for a map task to read its split independently of the others.

In this case, MapReduce will not try to split the gzipped file, since it knows that the input is gzip-compressed (by looking at the filename extension) and that gzip does not support splitting.

At this scenario a single map will process the 16 HDFS blocks, most of which will not be local to the map (it will have additionally a data locality cost).

This Job, will not parallelize as expected, it will be less granular, and so may take longer to run.

The gzip format uses DEFLATE to store the compressed data, and DEFLATE stores data as a series of compressed blocks. The problem is that the start of each block is not distinguished in any way that would allow a reader positioned at an arbitrary point in the stream to advance to the beginning of the next block, thereby synchronizing itself with the stream. For this reason, gzip does not support splitting.

Here we have a summary of compression formats:

hadoop_spplitable_formats(a)  DEFLATE is a compression algorithm whose standard implementation is zlib. There is no commonly available command-line tool for producing files in DEFLATE format, as gzip is normally used. (Note that the gzip file format is DEFLATE with extra headers and a footer.) The .deflate filename extension is a Hadoop convention.

Source: Hadoop The Definitive Guide.

 

Publicado en Uncategorized | Etiquetado , , , | Deja un comentario