S3 and Parallel Processing – DirectFileOutputCommitter


The problem:

While a Hadoop Job is writing output, it will write to a temporary directory:
Task1 –> /unique/temp/directory/task1/file.tmp
Task2 –> /unique/temp/directory/task2/file.tmp

When the tasks finish the execution, will move (commit) the temporary file to a final location.

This schema makes possible the support speculative execution feature on Hadoop.

Moving the task output to its final destination (commit), involves a Rename operation. This rename operation, on a normal filesystem is just a change of pointer in the FS metadata.

Now, as S3 is not a filesystem, rename operations are more costly: it will involve a copy (Put) + Delete operation.

The solution:

In Mapreduce (this behavior can be different for other applications), to avoid these expensive operations, we can change the mapred-site.xml file, “mapred.output.committer.class” property to “org.apache.hadoop.mapred.DirectFileOutputCommitter”, so the the task output directly to it’s final destination.

<property>
  <name>mapred.output.committer.class</name>
  <value>org.apache.hadoop.mapred.DirectFileOutputCommitter</value>
</property>

For this and other useful parallel processing S3 considerations, please have a look here:

http://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-troubleshoot-errors-io.html#emr-troubleshoot-errors-io-1
https://aws.amazon.com/blogs/aws/amazon-s3-performance-tips-tricks-seattle-hiring-event/
http://docs.aws.amazon.com/AmazonS3/latest/dev/request-rate-perf-considerations.html

Anuncios

Acerca de hvivani

sysadmin, developer, RHCSA
Esta entrada fue publicada en Uncategorized y etiquetada , , . Guarda el enlace permanente.

Responder

Introduce tus datos o haz clic en un icono para iniciar sesión:

Logo de WordPress.com

Estás comentando usando tu cuenta de WordPress.com. Cerrar sesión / Cambiar )

Imagen de Twitter

Estás comentando usando tu cuenta de Twitter. Cerrar sesión / Cambiar )

Foto de Facebook

Estás comentando usando tu cuenta de Facebook. Cerrar sesión / Cambiar )

Google+ photo

Estás comentando usando tu cuenta de Google+. Cerrar sesión / Cambiar )

Conectando a %s