s3:// vs s3n:// vs s3a:// vs EMRFS


s3://

Apache Hadoop implementation of a block-based filesystem backed by S3. Apache Hadoop has deprecated use of this filesystem as of May 2016.

s3n://

A native filesystem for reading and writing regular files on S3. S3N allows Hadoop to access files on S3 that were written with other tools, and conversely, other tools can access files written to S3N using Hadoop. S3N is stable and widely used, but it is not being updated with any new features. S3N requires a suitable version of the jets3t JAR on the classpath.

  • Uses jets3t

s3a://

Hadoop’s successor to the S3N filesystem. S3A uses Amazon’s libraries to interact with S3. S3A supports accessing files larger than 5 GB, and it provides performance enhancements and other improvements. For Apache Hadoop, S3A is the successor to S3N and is backward compatible with S3N. Using Apache Hadoop, all objects accessible from s3n:// URLs should also be accessible from S3A by replacing the URL scheme.

  • Uses AWS SDK.
  • Amazon EMR does not currently support use of the Apache Hadoop S3A file system.

EMRFS:

On Amazon EMR, both the s3:// and s3n:// URIs are associated with the EMR filesystem and are functionally interchangeable in the context of Amazon EMR. For consistency sake, however, it is recommended to use the s3:// URI in the context of Amazon EMR.

EMRFS can be used by invoking the prefix s3n:// or s3:// or s3a:// depending on the client application implementation.

Source: https://aws.amazon.com/premiumsupport/knowledge-center/emr-file-system-s3/
Anuncios
Publicado en Uncategorized | Etiquetado , , , , | Deja un comentario

S3 and Parallel Processing – DirectFileOutputCommitter


The problem:

While a Hadoop Job is writing output, it will write to a temporary directory:
Task1 –> /unique/temp/directory/task1/file.tmp
Task2 –> /unique/temp/directory/task2/file.tmp

When the tasks finish the execution, will move (commit) the temporary file to a final location.

This schema makes possible the support speculative execution feature on Hadoop.

Moving the task output to its final destination (commit), involves a Rename operation. This rename operation, on a normal filesystem is just a change of pointer in the FS metadata.

Now, as S3 is not a filesystem, rename operations are more costly: it will involve a copy (Put) + Delete operation.

The solution:

In Mapreduce (this behavior can be different for other applications), to avoid these expensive operations, we can change the mapred-site.xml file, “mapred.output.committer.class” property to “org.apache.hadoop.mapred.DirectFileOutputCommitter”, so the the task output directly to it’s final destination.

<property>
  <name>mapred.output.committer.class</name>
  <value>org.apache.hadoop.mapred.DirectFileOutputCommitter</value>
</property>

For this and other useful parallel processing S3 considerations, please have a look here:

http://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-troubleshoot-errors-io.html#emr-troubleshoot-errors-io-1
https://aws.amazon.com/blogs/aws/amazon-s3-performance-tips-tricks-seattle-hiring-event/
http://docs.aws.amazon.com/AmazonS3/latest/dev/request-rate-perf-considerations.html

Publicado en Uncategorized | Etiquetado , , | Deja un comentario

God does not cast dice / Dios no juega a los dados


einstein_bohr

Niels Bohr (left) and Albert Einstein (right) discussing quantum mechanics.

Publicado en Uncategorized | Etiquetado | Deja un comentario

Copy Data with Hive and Spark / Copiar Datos con Hive y Spark


These are two examples of how to copy data from one S3 location to other S3 location. Same operation can be done from S3 to HDFS and vice-versa.

I’m considering that you are able to launch the Hive client or spark-shell client.

Hive:

Using Mapreduce engine or Tez engine:

set hive.execution.engine=mr; 

or

set hive.execution.engine=tez; 
CREATE EXTERNAL TABLE source_table(a_col string, b_col string, c_col string)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
LOCATION 's3://mybucket/hive/csv/';

CREATE TABLE destination_table(a_col string, b_col string, c_col string) LOCATION 's3://mybucket/output-hive/csv_1/';

INSERT OVERWRITE TABLE destination_table SELECT * FROM source_table;

Spark:

sc.textFile("s3://aws-publicdatasets/common-crawl/crawl-data/CC-MAIN-2013-48/segments/1387346051826/warc/").saveAsTextFile("s3://mybucket/spark/bigfiles")

 

If you want to copy data to HDFS, you can also explore s3-dist-cp:

s3DistCP:

s3-dist-cp --src s3://mybucket/hive/csv/ --dest=hdfs:///output-hive/csv_10/

 

Publicado en Uncategorized | Etiquetado , , , | Deja un comentario

Buñuelos Valencianos (de calabaza)


Ingredientes:

  • 1 calabaza mediana (aprox. 800g)
  • 500 gr harina
  • 100 g levadura fresca
  • 1/2 vaso de gaseosa (soda)
  • Agua
  • Aceite para freir (Girasol/Maiz/Oliva)

Pasos:

Pelar, sacar las semillas y hervir la calabaza para obtener un puré fino. Se reserva la mitad del agua donde se ha hervido la calabaza.

Mezclar la harina con la levadura (si la levadura es deshidratada, disolverla antes en agua tibia mas una cucharada de azucar dejandola fermentar unos 10 minutos), agregar el puré de calabaza que hemos hecho y el agua de hervir la calabaza. Unas 3 tazas deberia ser suficiente para lograr el punto. Agregar la soda. Se amasa a mano hasta conseguir una masa blanda y suave.

valencianos_punto_masa

 

Dejar reposar la masa unos 20 minutos para que duplique su tamano. Los buñuelos deben tener un agujerito en el medio que se le puede hacer, sencillamente apretando el pulgar en le centro de la masa.
En una sartén con aceite caliente se van echando poco a poco los buñuelos hasta que se doran. Moderar la temperatura del aceite para que no queden crudos adentro !

valencianos_probando

Publicado en Cooking, Uncategorized | Deja un comentario

HBase and Zookeeper debugging


I came across some scenarios where an application (i.e. Mapreduce) communicating to HBase through YARN could silently fail with a timeout like the following:

2017-01-30 19:42:03,657 DEBUG [main] org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation: locateRegionInMeta parentTable=hbase:meta, metaLocation=, attempt=9 of 35 failed; retrying after sleep of 10095 because: Failed after attempts=36, exceptions:
Mon Jan 30 19:42:03 UTC 2017, null, java.net.SocketTimeoutException: callTimeout=60000, callDuration=68463: row 'test2,#cmrNo acctNo,99999999999999' on table 'hbase:meta' at region=hbase:meta,,1.1588230740, hostname=ip-172-31-3-246.us-west-2.compute.internal,16000,1485539268192, seqNum=0

The root cause for this behavior here wasn’t related to any missconfiguration at server/networking side, but a library missing in the class path.

When there is a zookeeper issue, depending on the retry parameters the exceptions are not visible.

On this case, In the Mapreduce Java application I’ve added/modified the following parameters that lead into more visibility in the communication layer between Zookeeper and HBase:

conf.set("hbase.client.retries.number", Integer.toString(1));
conf.set("zookeeper.session.timeout", Integer.toString(60000));
conf.set("zookeeper.recovery.retry", Integer.toString(1));


After this, the following exception was visible:

Exception: java.lang.NoClassDefFoundError: com/yammer/metrics/core/Gauge
 at org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithRetries(RpcRetryingCaller.java:157)
 at org.apache.hadoop.hbase.client.ResultBoundedCompletionService$QueueingFuture.run(ResultBoundedCompletionService.java:65)
 at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
 at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
 at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.IOException: com.google.protobuf.ServiceException: java.lang.NoClassDefFoundError: com/yammer/metrics/core/Gauge

 

Playing around these parameters will cause the application exit quickly when there is a problem with the cluster. This can be desirable in a production environment.

Reducing the parameters to a more conservative value could yield better recovery times.  Setting zookeeper.recovery.retry to 0 will still result in up to two connection attempts made to all zk servers in the quorum and cause and application failure to happen in under a minute should there be a loss of zookeeper connectivity during execution.

 

As an additional note, if you are receiving timeouts because the application is trying to contact localhost instead of the quorum server, you can set the explicit parameters:

// HBase through MR on Yarn is trying to connect to localhost instead of quorum.
        conf.set("hbase.zookeeper.quorum","172.31.3.246");
        conf.set("hbase.zookeeper.property.clientPort","2181");

 

I’ve added a couple of examples of Mapreduce applications for HBase here: https://github.com/hvivani/bigdata/tree/master/hbase

 

Some additional notes on this behavior: https://discuss.pivotal.io/hc/en-us/articles/200933006-Hbase-application-hangs-indefinitely-connecting-to-zookeeper

 

Publicado en Uncategorized | Etiquetado , , , , , | Deja un comentario

Checking Yarn child execution environment


Never go out without this:

$ sudo -u yarn jps
27343 YarnChild
4156 NodeManager
27292 Jps

$ sudo strings -f /proc/27343/environ
/proc/27343/environ: STDERR_LOGFILE_ENV=/var/log/hadoop-yarn/containers/application_1485807340469_0019/container_1485807340469_0019_01_000003/stderr
/proc/27343/environ: SHELL=/bin/bash
/proc/27343/environ: TERM=linux
/proc/27343/environ: HADOOP_HOME=/usr/lib/hadoop
/proc/27343/environ: YARN_PID_DIR=/var/run/hadoop-yarn
/proc/27343/environ: NM_HOST=ip-172-31-5-156.us-west-2.compute.internal
/proc/27343/environ: HADOOP_PREFIX=/usr/lib/hadoop
/proc/27343/environ: YARN_OPTS= -XX:OnOutOfMemoryError='kill -9 %p' -XX:OnOutOfMemoryError='kill -9 %p' -server  -Dhadoop.log.dir=/var/log/hadoop-yarn -Dyarn.log.dir=/var/log/hadoop-yarn -Dhadoop.log.file=yarn-yarn-nodemanager-ip-172-31-5-156.log -Dyarn.log.file=yarn-yarn-nodemanager-ip-172-31-5-156.log -Dyarn.home.dir=/usr/lib/hadoop-yarn -Dhadoop.home.dir=/usr/lib/hadoop -Dhadoop.root.logger=INFO,DRFA -Dyarn.root.logger=INFO,DRFA -Dsun.net.inetaddr.ttl=30 -Djava.library.path=:/usr/lib/hadoop-lzo/lib/native:/usr/lib/hadoop/lib/native:/usr/lib/hadoop-lzo/lib/native:/usr/lib/hadoop/lib/native
/proc/27343/environ: NM_AUX_SERVICE_mapreduce_shuffle=AAA0+gAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA=
/proc/27343/environ: YARN_NICENESS=0
/proc/27343/environ: NM_HTTP_PORT=8042
/proc/27343/environ: LOCAL_DIRS=/mnt/yarn/usercache/hadoop/appcache/application_1485807340469_0019,/mnt1/yarn/usercache/hadoop/appcache/application_1485807340469_0019
/proc/27343/environ: USER=hadoop
/proc/27343/environ: JAVA_LIBRARY_PATH=:/usr/lib/hadoop-lzo/lib/native:/usr/lib/hadoop/lib/native:/usr/lib/hadoop-lzo/lib/native:/usr/lib/hadoop/lib/native
/proc/27343/environ: LD_LIBRARY_PATH=/mnt/yarn/usercache/hadoop/appcache/application_1485807340469_0019/container_1485807340469_0019_01_000003:/usr/lib/hadoop/lib/native:/usr/lib/hadoop-lzo/lib/native
/proc/27343/environ: JSVC_HOME=/usr/lib/bigtop-utils
/proc/27343/environ: HADOOP_LIBEXEC_DIR=/usr/lib/hadoop/libexec
/proc/27343/environ: HADOOP_TOKEN_FILE_LOCATION=/mnt/yarn/usercache/hadoop/appcache/application_1485807340469_0019/container_1485807340469_0019_01_000003/container_tokens
/proc/27343/environ: SVC_USER=yarn
/proc/27343/environ: LOG_DIRS=/var/log/hadoop-yarn/containers/application_1485807340469_0019/container_1485807340469_0019_01_000003
/proc/27343/environ: MALLOC_ARENA_MAX=4
/proc/27343/environ: HADOOP_JOB_HISTORYSERVER_HEAPSIZE=2396
/proc/27343/environ: YARN_ROOT_LOGGER=INFO,DRFA
/proc/27343/environ: NLSPATH=/usr/dt/lib/nls/msg/%L/%N.cat
/proc/27343/environ: PATH=/usr/local/sbin:/usr/local/bin:/usr/bin:/usr/sbin:/sbin:/bin
/proc/27343/environ: CONF_DIR=/etc/hadoop/conf
/proc/27343/environ: YARN_IDENT_STRING=yarn
/proc/27343/environ: HADOOP_HDFS_HOME=/usr/lib/hadoop-hdfs
/proc/27343/environ: DAEMON_FLAGS=nodemanager
/proc/27343/environ: HADOOP_CLIENT_OPTS=
/proc/27343/environ: PWD=/mnt/yarn/usercache/hadoop/appcache/application_1485807340469_0019/container_1485807340469_0019_01_000003
/proc/27343/environ: HADOOP_COMMON_HOME=/usr/lib/hadoop
/proc/27343/environ: HADOOP_YARN_HOME=/usr/lib/hadoop-yarn
/proc/27343/environ: JAVA_HOME=/usr/lib/jvm/java-openjdk
/proc/27343/environ: HADOOP_CLASSPATH=/mnt/yarn/usercache/hadoop/appcache/application_1485807340469_0019/container_1485807340469_0019_01_000003:job.jar/job.jar:job.jar/classes/:job.jar/lib/*:/mnt/yarn/usercache/hadoop/appcache/application_1485807340469_0019/container_1485807340469_0019_01_000003/*:/mnt/yarn/usercache/hadoop/appcache/application_1485807340469_0019/container_1485807340469_0019_01_000001:job.jar/job.jar:job.jar/classes/:job.jar/lib/*:/mnt/yarn/usercache/hadoop/appcache/application_1485807340469_0019/container_1485807340469_0019_01_000001/*:/usr/lib/hbase/*:/usr/lib/hbase/lib/*:/etc/tez/conf:/usr/lib/tez/*:/usr/lib/tez/lib/*:/usr/lib/hadoop-lzo/lib/*:/usr/share/aws/aws-java-sdk/*:/usr/share/aws/emr/emrfs/conf:/usr/share/aws/emr/emrfs/lib/*:/usr/share/aws/emr/emrfs/auxlib/*:/usr/share/aws/emr/ddb/lib/emr-ddb-hadoop.jar:/usr/share/aws/emr/goodies/lib/emr-hadoop-goodies.jar:/usr/share/aws/emr/kinesis/lib/emr-kinesis-hadoop.jar:/usr/share/aws/emr/cloudwatch-sink/lib/*:/usr/share/aws/emr/security/conf:/usr/share/aws/emr/security/lib/*
/proc/27343/environ: HADOOP_CONF_DIR=/etc/hadoop/conf
/proc/27343/environ: DAEMON=hadoop-yarn-nodemanager
/proc/27343/environ: STDOUT_LOGFILE_ENV=/var/log/hadoop-yarn/containers/application_1485807340469_0019/container_1485807340469_0019_01_000003/stdout
/proc/27343/environ: LANG=en_US.UTF-8
/proc/27343/environ: SLEEP_TIME=10
/proc/27343/environ: XFILESEARCHPATH=/usr/dt/app-defaults/%L/Dt
/proc/27343/environ: HADOOP_OPTS= -server -XX:OnOutOfMemoryError='kill -9 %p' -Dhadoop.log.dir=/usr/lib/hadoop/logs -Dhadoop.log.file=hadoop.log -Dhadoop.home.dir=/usr/lib/hadoop -Dhadoop.id.str= -Dhadoop.root.logger=INFO,console -Djava.library.path=:/usr/lib/hadoop-lzo/lib/native:/usr/lib/hadoop/lib/native -Dhadoop.policy.file=hadoop-policy.xml -Djava.net.preferIPv4Stack=true -server -XX:OnOutOfMemoryError='kill -9 %p' -Dhadoop.log.dir=/usr/lib/hadoop/logs -Dhadoop.log.file=hadoop.log -Dhadoop.home.dir=/usr/lib/hadoop -Dhadoop.id.str= -Dhadoop.root.logger=INFO,console -Djava.library.path=:/usr/lib/hadoop-lzo/lib/native:/usr/lib/hadoop/lib/native:/usr/lib/hadoop-lzo/lib/native:/usr/lib/hadoop/lib/native -Dhadoop.policy.file=hadoop-policy.xml -Djava.net.preferIPv4Stack=true
/proc/27343/environ: PIDFILE=/var/run/hadoop-yarn/yarn-yarn-nodemanager.pid
/proc/27343/environ: YARN_LOG_DIR=/var/log/hadoop-yarn
/proc/27343/environ: DESC=Hadoop nodemanager
/proc/27343/environ: EXEC_PATH=/usr/lib/hadoop-yarn/sbin/yarn-daemon.sh
/proc/27343/environ: SHLVL=5
/proc/27343/environ: HOME=/home/
/proc/27343/environ: JVM_PID=27333
/proc/27343/environ: YARN_CONF_DIR=/etc/hadoop/conf
/proc/27343/environ: YARN_LOGFILE=yarn-yarn-nodemanager-ip-172-31-5-156.log
/proc/27343/environ: YARN_NODEMANAGER_HEAPSIZE=2048
/proc/27343/environ: UPSTART_INSTANCE=
/proc/27343/environ: HADOOP_MAPRED_HOME=/usr/lib/hadoop-mapreduce
/proc/27343/environ: LOGNAME=hadoop
/proc/27343/environ: NM_PORT=8041
/proc/27343/environ: HADOOP_HOME_WARN_SUPPRESS=true
/proc/27343/environ: CLASSPATH=/mnt/yarn/usercache/hadoop/appcache/application_1485807340469_0019/container_1485807340469_0019_01_000003:/etc/hadoop/conf:/usr/lib/hadoop/*:/usr/lib/hadoop/lib/*:/usr/lib/hadoop-hdfs/*:/usr/lib/hadoop-hdfs/lib/*:/usr/lib/hadoop-mapreduce/*:/usr/lib/hadoop-mapreduce/lib/*:/usr/lib/hadoop-yarn/*:/usr/lib/hadoop-yarn/lib/*:/usr/lib/hadoop-lzo/lib/*:/usr/share/aws/emr/emrfs/conf:/usr/share/aws/emr/emrfs/lib/*:/usr/share/aws/emr/emrfs/auxlib/*:/usr/share/aws/emr/lib/*:/usr/share/aws/emr/ddb/lib/emr-ddb-hadoop.jar:/usr/share/aws/emr/goodies/lib/emr-hadoop-goodies.jar:/usr/share/aws/emr/kinesis/lib/emr-kinesis-hadoop.jar:/usr/share/aws/emr/cloudwatch-sink/lib/*:/usr/lib/hadoop-mapreduce/share/hadoop/mapreduce/*:/usr/lib/hadoop-mapreduce/share/hadoop/mapreduce/lib/*:/usr/lib/hadoop-lzo/lib/*:/usr/share/aws/emr/emrfs/conf:/usr/share/aws/emr/emrfs/lib/*:/usr/share/aws/emr/emrfs/auxlib/*:/usr/share/aws/emr/lib/*:/usr/share/aws/emr/ddb/lib/emr-ddb-hadoop.jar:/usr/share/aws/emr/goodies/lib/emr-hadoop-goodies.jar:/usr/share/aws/emr/kinesis/lib/emr-kinesis-hadoop.jar:/usr/share/aws/emr/cloudwatch-sink/lib/*:job.jar/job.jar:job.jar/classes/:job.jar/lib/*:/mnt/yarn/usercache/hadoop/appcache/application_1485807340469_0019/container_1485807340469_0019_01_000003/*
/proc/27343/environ: CONTAINER_ID=container_1485807340469_0019_01_000003
/proc/27343/environ: YARN_PROXYSERVER_HEAPSIZE=2396
/proc/27343/environ: HADOOP_ROOT_LOGGER=DEBUG,console
/proc/27343/environ: WORKING_DIR=/var/lib/hadoop-yarn
/proc/27343/environ: UPSTART_JOB=hadoop-yarn-nodemanager
/proc/27343/environ: HADOOP_NAMENODE_HEAPSIZE=1740
/proc/27343/environ: HADOOP_DATANODE_HEAPSIZE=757
/proc/27343/environ: YARN_RESOURCEMANAGER_HEAPSIZE=2396
/proc/27343/environ: BASH_FUNC_run_prestart()=() {  su -s /bin/bash $SVC_USER -c "cd $WORKING_DIR && $EXEC_PATH --config '$CONF_DIR' start $DAEMON_FLAGS"
/proc/27343/environ: _=/usr/lib/jvm/java-openjdk/bin/java
Publicado en Uncategorized | Etiquetado , , | Deja un comentario