Hadoop 1 vs Hadoop 2 – How many slots do I have per node ?

This is a topic that always rise a discussion…

In Hadoop 1, the number of tasks launched per node was specified via the settings mapred.map.tasks.maximum and mapred.reduce.tasks.maximum.

But this is ignored when set on Hadoop 2.

In Hadoop 2 with YARN, we can determine how many concurrent tasks are launched per node by dividing the resources allocated to YARN by the resources allocated to each MapReduce task, and taking the minimum of the two types of resources (memory and CPU).

This approach is an improvement over that of Hadoop 1, because the administrator no longer has to bundle CPU and memory into a Hadoop-specific concept of a “slot”.

The number of tasks that will be spawned per node:

min(
    yarn.nodemanager.resource.memory-mb / mapreduce.[map|reduce].memory.mb
    ,
    yarn.nodemanager.resource.cpu-vcores / mapreduce.[map|reduce].cpu.vcores
    )

Obtained value will be set on the variable ‘mapreduce.job.maps‘ on the ‘mapred-site.xml‘ file.

Of course, YARN is more dynamic than that, and each job can have unique resource requirements — so in a multitenant cluster with different types of jobs running, the calculation isn’t as straightforward.

More information:
http://blog.cloudera.com/blog/2014/04/apache-hadoop-yarn-avoiding-6-time-consuming-gotchas/

Publicado en Uncategorized | Etiquetado , , | Deja un comentario

Hadoop useful commands

- Copy fromLocal/ToLocal from/to S3:

$ bin/hadoop fs -copyToLocal s3://my-bucket/myfile.rb /home/hadoop/myfile.rb
$ bin/hadoop fs -copyFromLocal job5.avro s3://my-bucket/input

- Merge all the files from one folder into one single file:

$ hadoop jar ~/lib/emr-s3distcp-1.0.jar --src s3://my-bucket/my-folder/ --dest s3://my-bucket/logs/all-the-files-merged.log --groupBy '.*(*)' --outputCodec none

- Create directory on HDFS:

$ bin/hadoop fs -mkdir -p /user/ubuntu

- List HDFS directory:

bin/hadoop fs -ls /

- Put a file in HDFS:

bin/hadoop dfs -put localfile.txt /user/hadoop/hadoopfile

- Check HDFS filesystem utilization:

$ bin/hadoop dfsadmin -report

- Cat of file on HDFS:

$ bin/hadoop  dfs -cat /user/ubuntu/RESULTS/part-00000

More commands:

http://hadoop.apache.org/docs/r0.18.3/hdfs_shell.html

Publicado en Uncategorized | Etiquetado , | Deja un comentario

Generar clave publica desde clave privada

Necesito tener esto a mano:

ssh-keygen -y -f ~/.ssh/test-key.pem > ~/.ssh/test-key.pem.pub

Chequear previamente que los permisos en test-key.pem sean 600.

Publicado en Uncategorized | Etiquetado , | Deja un comentario

Hadoop: HDFS find / recover corrupt blocks

1) Search for files on corrupt files:

A command like ‘hadoop fsck /’ will show the status of the filesystem and any corrupt files. This command will ignore lines with nothing but dots and lines talking about replication:

hadoop fsck / | egrep -v '^\.+$' | grep -v eplica

2) Determine the corrupt blocks:

hadoop fsck /path/to/corrupt/file -locations -blocks -files

(Use that output to determine where blocks might live. If the file is larger than your block size it might have multiple blocks.)

3) Try to copy the files to S3 with s3distcp or s3cmd. If that fails, you will have the option to run:

hadoop fsck -move

which will move what is left of the corrupt blocks into hdfs /lost+found

4) Delete the file:

hadoop fs -rm /path/to/file/with/permanently/missing/blocks

Check file system state again with step 1.

A more drastic command is:

hadoop fsck / -delete

that will search and delete all corrupted files.

Hadoop should not use corrupt blocks again unless the replication factor is low and it does not have enough replicas

References:

http://hadoop.apache.org/docs/r0.19.0/commands_manual.html#fsck

Publicado en Uncategorized | Etiquetado , , | Deja un comentario

Simple Java Telnet Port Scanner

It can be improved in many ways, but..

import java.io.*;  
import java.net.*;  
import java.util.*;  
import java.util.TimerTask;  
//import org.apache.commons.*;
//import org.apache.commons.net.telnet.TelnetClient;  
class Connectivity extends TimerTask  
{  
    public static void main(String args[])  
    {  
        try  
        {  
            System.out.println("Please enter ip address");  
            Scanner sc=new Scanner(System.in);  
            String ip=sc.nextLine().trim();  
            System.out.println("Please enter port number");  
            TimerTask con  = new Connectivity();  
            Scanner sc1=new Scanner(System.in);  
            int port=sc1.nextInt();  
            Timer timer = new Timer();  
            timer.scheduleAtFixedRate(con,1,1000);  
            Socket s1=new Socket(ip,port);  
            InputStream is=s1.getInputStream();  
            DataInputStream dis=new DataInputStream(is);  
            if(dis!=null)  
            {  
                System.out.println("Connected with ip "+ip+" and port "+port);  
            }  
            else  
            {  
                System.out.println("Connection invalid");  
            }  
              
            dis.close();  
            s1.close();  
              
        }  
        catch(Exception e)  
        {  
            System.out.println("Not Connected,Please enter proper input");  
              
        }  
          
    }  
 
    @Override  
    public void run() {  
        // TODO Auto-generated method stub  
          
    }  
}
Publicado en Uncategorized | Etiquetado | Deja un comentario

Testing Java Cryptography Extension (JCE) is installed

If JCE is already installed, you should see on that the jar files ‘local_policy.jar’ and ‘US_export_policy.jar’ are on $JAVA_HOME/jre/lib/security/

But, we can test it:

import javax.crypto.Cipher;
import java.security.*;
import javax.crypto.*;

class TestJCE {
 public static void main(String[] args) {
 boolean JCESupported = false;
 try {
    KeyGenerator kgen = KeyGenerator.getInstance("AES", "SunJCE");
    kgen.init(256);
    JCESupported = true;
 } catch (NoSuchAlgorithmException e) {
    JCESupported = false;
 } catch (NoSuchProviderException e) {
    JCESupported = false;
 }
    System.out.println("JCE Supported=" + JCESupported);
 }
} 

To compile (assuming file name is TestJCE.java):

$ javac TestJCE.java

Previous command will create TestJCE.class output file.

To Interpreting and Running the program:

$ java TestJCE

 

Publicado en Uncategorized | Etiquetado , | Deja un comentario

HDFS: Cluster to cluster copy with distcp

Este es el formato del comando distcp para copiar de hdfs a hdfs considerando cluster origen y destino en Amazon AWS:

hadoop distcp "hdfs://ec2-54-86-202-252.compute-1.amazonaws.comec2-2:9000/tmp/test.txt" "hdfs://ec2-54-86-229-249.compute-1.amazonaws.comec2-2:9000/tmp/test1.txt"

Mas informacion sobre distcp:

http://www.cloudera.com/content/cloudera-content/cloudera-docs/CDH4/latest/CDH4-Installation-Guide/cdh4ig_topic_7_2.html
http://hadoop.apache.org/docs/r1.2.1/distcp2.html

 

Publicado en Uncategorized | Etiquetado , , , | Deja un comentario