Archivo de la etiqueta: Spark

Copy Data with Hive and Spark / Copiar Datos con Hive y Spark


These are two examples of how to copy data from one S3 location to other S3 location. Same operation can be done from S3 to HDFS and vice-versa. I’m considering that you are able to launch the Hive client or … Seguir leyendo

Publicado en Uncategorized | Etiquetado , , , | Deja un comentario

get the driver’s IP in spark yarn-cluster mode


In cluster mode, the Spark driver runs inside an application master process which is managed by YARN on the cluster, and the client can go away after initiating the application. In client mode, the driver runs in the client process, … Seguir leyendo

Publicado en Uncategorized | Etiquetado , | Deja un comentario

Consider boosting spark.yarn.executor.memoryOverhead


This is a very specific error related to the Spark Executor and the YARN container coexistence. You will typically see errors like this one on the application container logs: 15/03/12 18:53:46 WARN YarnAllocator: Container killed by YARN for exceeding memory … Seguir leyendo

Publicado en Uncategorized | Etiquetado , , | Deja un comentario

Compile Scala program with sbt


Install sbt: curl https://bintray.com/sbt/rpm/rpm | sudo tee /etc/yum.repos.d/bintray-sbt-rpm.repo sudo yum install sbt Compile & Build Place build.sbt and the .scala program in the same directory and run: sbt package    

Publicado en Uncategorized | Etiquetado | Deja un comentario

Running Spark with oozie


Oozie 4.2 now supports spark-action. Example job.properties file (configuration tested on EMR 4.2.0): nameNode=hdfs://172.31.25.17:8020 jobTracker=172.31.25.17:8032 master=local[*] queueName=default examplesRoot=examples oozie.use.system.libpath=true oozie.wf.application.path=${nameNode}/user/${user.name}/${examplesRoot}/apps/spark (Use the master node internal IP instead of localhost in the nameNode and jobTracker) Validate oozie workflow xml file: oozie … Seguir leyendo

Publicado en Uncategorized | Etiquetado , , | Deja un comentario

Spark Notes


Apache Spark, is an open source cluster computing framework originally developed at University of California, Berkeley but was later donated to the Apache Software Foundation where it remains today. In contrast to Hadoop’s two-stage disk-based MapReduce paradigm, Spark’s multi-stage in-memory … Seguir leyendo

Publicado en Uncategorized | Etiquetado , | Deja un comentario