AWS EMR – Big Data in Strata New York

Will you be in New York next week (Sept 25th – Sept 28th)?

aws_sponsor                   strata_data

Come meet the AWS Big Data team at Strata Data Conference, where we’ll be happy to answer your questions, hear about your requirements, and help you with your big data initiatives.

See you there!





Publicado en Uncategorized | Etiquetado , , , , , | Deja un comentario

s3:// vs s3n:// vs s3a:// vs EMRFS


Apache Hadoop implementation of a block-based filesystem backed by S3. Apache Hadoop has deprecated use of this filesystem as of May 2016.


A native filesystem for reading and writing regular files on S3. S3N allows Hadoop to access files on S3 that were written with other tools, and conversely, other tools can access files written to S3N using Hadoop. S3N is stable and widely used, but it is not being updated with any new features. S3N requires a suitable version of the jets3t JAR on the classpath.

  • Uses jets3t


Hadoop’s successor to the S3N filesystem. S3A uses Amazon’s libraries to interact with S3. S3A supports accessing files larger than 5 GB, and it provides performance enhancements and other improvements. For Apache Hadoop, S3A is the successor to S3N and is backward compatible with S3N. Using Apache Hadoop, all objects accessible from s3n:// URLs should also be accessible from S3A by replacing the URL scheme.

  • Uses AWS SDK.
  • Amazon EMR does not currently support use of the Apache Hadoop S3A file system.


On Amazon EMR, both the s3:// and s3n:// URIs are associated with the EMR filesystem and are functionally interchangeable in the context of Amazon EMR. For consistency sake, however, it is recommended to use the s3:// URI in the context of Amazon EMR.

EMRFS can be used by invoking the prefix s3n:// or s3:// or s3a:// depending on the client application implementation.

Publicado en Uncategorized | Etiquetado , , , , | Deja un comentario

Kill’em All!

Use it at your own discretion:

for app in `yarn application -list | awk '$6 == "ACCEPTED" { print $1 }'` ; do yarn application -kill "$app";done







Publicado en Uncategorized | Etiquetado , , | Deja un comentario

S3 and Parallel Processing – DirectFileOutputCommitter

The problem:

While a Hadoop Job is writing output, it will write to a temporary directory:
Task1 –> /unique/temp/directory/task1/file.tmp
Task2 –> /unique/temp/directory/task2/file.tmp

When the tasks finish the execution, will move (commit) the temporary file to a final location.

This schema makes possible the support speculative execution feature on Hadoop.

Moving the task output to its final destination (commit), involves a Rename operation. This rename operation, on a normal filesystem is just a change of pointer in the FS metadata.

Now, as S3 is not a filesystem, rename operations are more costly: it will involve a copy (Put) + Delete operation.

The solution:

In Mapreduce (this behavior can be different for other applications), to avoid these expensive operations, we can change the mapred-site.xml file, “mapred.output.committer.class” property to “org.apache.hadoop.mapred.DirectFileOutputCommitter”, so the the task output directly to it’s final destination.


For this and other useful parallel processing S3 considerations, please have a look here:

Publicado en Uncategorized | Etiquetado , , | Deja un comentario

God does not cast dice / Dios no juega a los dados


Niels Bohr (left) and Albert Einstein (right) discussing quantum mechanics.

Publicado en Uncategorized | Etiquetado | Deja un comentario

Copy Data with Hive and Spark / Copiar Datos con Hive y Spark

These are two examples of how to copy data from one S3 location to other S3 location. Same operation can be done from S3 to HDFS and vice-versa.

I’m considering that you are able to launch the Hive client or spark-shell client.


Using Mapreduce engine or Tez engine:

set hive.execution.engine=mr; 


set hive.execution.engine=tez; 
CREATE EXTERNAL TABLE source_table(a_col string, b_col string, c_col string)
LOCATION 's3://mybucket/hive/csv/';

CREATE TABLE destination_table(a_col string, b_col string, c_col string) LOCATION 's3://mybucket/output-hive/csv_1/';

INSERT OVERWRITE TABLE destination_table SELECT * FROM source_table;




If you want to copy data to HDFS, you can also explore s3-dist-cp:


s3-dist-cp --src s3://mybucket/hive/csv/ --dest=hdfs:///output-hive/csv_10/


Publicado en Uncategorized | Etiquetado , , , | Deja un comentario

Buñuelos Valencianos (de calabaza)


  • 1 calabaza mediana (aprox. 800g)
  • 500 gr harina
  • 100 g levadura fresca
  • 1/2 vaso de gaseosa (soda)
  • Agua
  • Aceite para freir (Girasol/Maiz/Oliva)


Pelar, sacar las semillas y hervir la calabaza para obtener un puré fino. Se reserva la mitad del agua donde se ha hervido la calabaza.

Mezclar la harina con la levadura (si la levadura es deshidratada, disolverla antes en agua tibia mas una cucharada de azucar dejandola fermentar unos 10 minutos), agregar el puré de calabaza que hemos hecho y el agua de hervir la calabaza. Unas 3 tazas deberia ser suficiente para lograr el punto. Agregar la soda. Se amasa a mano hasta conseguir una masa blanda y suave.



Dejar reposar la masa unos 20 minutos para que duplique su tamano. Los buñuelos deben tener un agujerito en el medio que se le puede hacer, sencillamente apretando el pulgar en le centro de la masa.
En una sartén con aceite caliente se van echando poco a poco los buñuelos hasta que se doran. Moderar la temperatura del aceite para que no queden crudos adentro !


Publicado en Cooking, Uncategorized | Deja un comentario