AWS S3 API – Throughput Notes


Some notes on settings to maximize throughput and increase parallelism while using S3 API:

aws configure set default.s3.max_concurrent_requests 20
aws configure set default.s3.max_queue_size 10000
aws configure set default.s3.multipart_threshold 64MB
aws configure set default.s3.multipart_chunksize 16MB
aws configure set default.s3.max_bandwidth 50MB/s
aws configure set default.s3.use_accelerate_endpoint true
aws configure set default.s3.addressing_style path

 

max_concurrent_requests

Default10

The aws s3 transfer commands are multithreaded. At any given time, multiple requests to Amazon S3 are in flight. For example, if you are uploading a directory via aws s3 cp localdir s3://bucket/ --recursive, the AWS CLI could be uploading the local files localdir/file1, localdir/file2, and localdir/file3 in parallel. The max_concurrent_requests specifies the maximum number of transfer commands that are allowed at any given time.

You may need to change this value for a few reasons:

  • Decreasing this value – On some environments, the default of 10 concurrent requests can overwhelm a system. This may cause connection timeouts or slow the responsiveness of the system. Lowering this value will make the S3 transfer commands less resource intensive. The tradeoff is that S3 transfers may take longer to complete. Lowering this value may be necessary if using a tool such as trickle to limit bandwidth.
  • Increasing this value – In some scenarios, you may want the S3 transfers to complete as quickly as possible, using as much network bandwidth as necessary. In this scenario, the default number of concurrent requests may not be sufficient to utilize all the network bandwidth available. Increasing this value may improve the time it takes to complete an S3 transfer.

max_queue_size

Default1000

The AWS CLI internally uses a producer consumer model, where we queue up S3 tasks that are then executed by consumers, which in this case utilize a bound thread pool, controlled by max_concurrent_requests. A task generally maps to a single S3 operation. For example, as task could be a PutObjectTask, or a GetObjectTask, or an UploadPartTask. The enqueuing rate can be much faster than the rate at which consumers are executing tasks. To avoid unbounded growth, the task queue size is capped to a specific size. This configuration value changes the value of that maximum number.

You generally will not need to change this value. This value also corresponds to the number of tasks we are aware of that need to be executed. This means that by default we can only see 1000 tasks ahead. Until the S3 command knows the total number of tasks executed, the progress line will show a total of .... Increasing this value means that we will be able to more quickly know the total number of tasks needed, assuming that the enqueuing rate is quicker than the rate of task consumption. The tradeoff is that a larger max queue size will require more memory.

multipart_threshold

Default8MB

When uploading, downloading, or copying a file, the S3 commands will switch to multipart operations if the file reaches a given size threshold. The multipart_threshold controls this value. You can specify this value in one of two ways:

  • The file size in bytes. For example, 1048576.
  • The file size with a size suffix. You can use KB, MB, GB, TB. For example: 10MB, 1GB. Note that S3 imposes constraints on valid values that can be used for multipart operations.

multipart_chunksize

Default8MB

Minimum For Uploads5MB

Once the S3 commands have decided to use multipart operations, the file is divided into chunks. This configuration option specifies what the chunk size (also referred to as the part size) should be. This value can specified using the same semantics as multipart_threshold, that is either as the number of bytes as an integer, or using a size suffix.

max_bandwidth

Default – None

This controls the maximum bandwidth that the S3 commands will utilize when streaming content data to and from S3. Thus, this value only applies for uploads and downloads. It does not apply to copies nor deletes because those data transfers take place server side. The value is in terms of bytes per second. The value can be specified as:

  • An integer. For example, 1048576 would set the maximum bandwidth usage to 1 megabyte per second.
  • A rate suffix. You can specify rate suffixes using: KB/s, MB/s, GB/s, etc. For example: 300KB/s, 10MB/s.

In general, it is recommended to first use max_concurrent_requests to lower transfers to the desired bandwidth consumption. The max_bandwidth setting should then be used to further limit bandwidth consumption if setting max_concurrent_requests is unable to lower bandwidth consumption to the desired rate. This is recommended because max_concurrent_requests controls how many threads are currently running. So if a high max_concurrent_requests value is set and a low max_bandwidth value is set, it may result in threads having to wait unneccessarily which can lead to excess resource consumption and connection timeouts.

use_accelerate_endpoint

Defaultfalse

If set to true, will direct all Amazon S3 requests to the S3 Accelerate endpoint: s3-accelerate.amazonaws.com. To use this endpoint, your bucket must be enabled to use S3 Accelerate. All request will be sent using the virtual style of bucket addressing: my-bucket.s3-accelerate.amazonaws.com. Any ListBuckets, CreateBucket, and DeleteBucket requests will not be sent to the Accelerate endpoint as the endpoint does not support those operations. This behavior can also be set if --endpoint-url parameter is set to https://s3-accelerate.amazonaws.com or http://s3-accelerate.amazonaws.com for any s3 or s3api command. This option is mutually exclusive with the use_dualstack_endpoint option.

use_dualstack_endpoint

Defaultfalse

If set to true, will direct all Amazon S3 requests to the dual IPv4 / IPv6 endpoint for the configured region. This option is mutually exclusive with the use_accelerate_endpoint option.

addressing_style

Defaultauto

There’s two styles of constructing an S3 endpoint. The first is with the bucket included as part of the hostname. This corresponds to the addressing style of virtual. The second is with the bucket included as part of the path of the URI, corresponding to the addressing style of path. The default value in the CLI is to use auto, which will attempt to use virtual where possible, but will fall back to path style if necessary. For example, if your bucket name is not DNS compatible, the bucket name cannot be part of the hostname and must be in the path. With auto, the CLI will detect this condition and automatically switch to path style for you. If you set the addressing style to path, you must ensure that the AWS region you configured in the AWS CLI matches the same region of your bucket.

payload_signing_enabled

If set to true, s3 payloads will receive additional content validation in the form of a SHA256 checksum which will be calculated for you and included in the request signature. If set to false, the checksum will not be calculated. Disabling this can be useful to save the performance overhead that the checksum calculation would otherwise cause.

By default, this is disabled for streaming uploads (UploadPart and PutObject), but only if a ContentMD5 is present (it is generated by default) and the endpoint uses HTTPS.

 

Source: https://docs.aws.amazon.com/cli/latest/topic/s3-config.html

Anuncios
Publicado en Uncategorized | Etiquetado , | Deja un comentario

Secondary NameNode in Hadoop 2


This is a frequent asked question:

In hadoop 2, Secondary Name Node can be implemented in two ways:

1. With HA (High Availability Cluster): if you are setting up HA cluster then you may not need to use Secondary namenode because standby namenode keep its state synchronized with the Active namenode.

The HDFS NameNode High Availability feature enables you to run redundant NameNodes in the same cluster in an Active/Passive configuration with a hot standby.Both NameNode require the same type of hardware configuration.In HA hadoop cluster Active NameNode reads and write metadata information in Separate JournalNode.

In the event of failover, standby NameNode will ensure that its namespace is completely updated according to edit logs before it is changes to active state. So there is no need of Secondary NameNode in this Cluster Setup.

2. Without HA: you can have a hadoop setup without standby node. Then the secondary NameNode will act as you already mentioned in Hadoop 1.x

 

Source: https://stackoverflow.com/questions/37830777/use-of-secondary-namenode-in-hadoop-in-2-x

Publicado en Uncategorized | Etiquetado , , , | Deja un comentario

Adding a mount point to HDFS


Before proceeding:

This procedure considers that you don’t have any current useful data on HDFS. All the data will be lost after adding mount points with this method.

This procedure should be applied to every datanode in the cluster. No intervention in the master node is needed if the framework is configured properly.

#checking available block devices:
[ec2-user@ip-10-0-15-76 media]$ lsblk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
nvme2n1 259:4 0 2.5T 0 disk
nvme1n1 259:3 0 2.5T 0 disk /media/ebs0
nvme4n1 259:6 0 2.5T 0 disk
nvme0n1 259:0 0 2G 0 disk
├─nvme0n1p1 259:1 0 2G 0 part /
└─nvme0n1p128 259:2 0 1M 0 part
nvme3n1 259:5 0 2.5T 0 disk

#checking formatted filesystem:
[ec2-user@ip-10-0-15-76 media]$ sudo file -s /dev/nvme2n1
/dev/nvme2n1: data

(this filesystem is not formatted)

#formatting to ext4:
[ec2-user@ip-10-0-15-76 media]$ sudo mkfs -t ext4 /dev/nvme2n1
mke2fs 1.42.12 (29-Aug-2014)
Creating filesystem with 655360000 4k blocks and 163840000 inodes
Filesystem UUID: 6d9c997f-d47b-4529-85c8-e56e8ef47a1d
Superblock backups stored on blocks:
32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208,
4096000, 7962624, 11239424, 20480000, 23887872, 71663616, 78675968,
102400000, 214990848, 512000000, 550731776, 644972544

Allocating group tables: done
Writing inode tables: done
Creating journal (32768 blocks): done
Writing superblocks and filesystem accounting information: done

#mounting
[ec2-user@ip-10-0-15-76 media]$ sudo mkdir /media/ebs1
[ec2-user@ip-10-0-15-76 media]$ sudo mount /dev/nvme2n1 /media/ebs1
[ec2-user@ip-10-0-15-76 media]$ lsblk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
nvme2n1 259:4 0 2.5T 0 disk /media/ebs1
nvme1n1 259:3 0 2.5T 0 disk /media/ebs0
nvme4n1 259:6 0 2.5T 0 disk
nvme0n1 259:0 0 2G 0 disk
├─nvme0n1p1 259:1 0 2G 0 part /
└─nvme0n1p128 259:2 0 1M 0 part
nvme3n1 259:5 0 2.5T 0 disk

#final mount result
[ec2-user@ip-10-0-60-46 ~]$ lsblk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
nvme2n1 259:4 0 2.5T 0 disk /media/ebs1
nvme1n1 259:3 0 2.5T 0 disk /media/ebs0
nvme4n1 259:6 0 2.5T 0 disk /media/ebs3
nvme0n1 259:0 0 2G 0 disk
├─nvme0n1p1 259:1 0 2G 0 part /
└─nvme0n1p128 259:2 0 1M 0 part
nvme3n1 259:5 0 2.5T 0 disk /media/ebs2

#checking mount points in hdfs-site.xml
[ec2-user@ip-10-0-60-46 media]$ cat /opt/hadoop-2.7.3/etc/hadoop/hdfs-site.xml |grep -A1 dfs.datanode.data.dir
<name>dfs.datanode.data.dir</name>
<value>/media/ebs0/hadoop/datanodes,/media/ebs1/hadoop/datanodes,/media/ebs2/hadoop/datanodes,/media/ebs3/hadoop/datanodes</value>

# create defined directory structure on mount point (for each mount point):
sudo mkdir -p /media/ebs1/hadoop/datanodes

# modify owner to the user that will start DFS (for each mount point):
sudo chown -R ec2-user:ec2-user /media/ebs1/hadoop/datanodes

#format namenode:
hadoop namenode -format

# stop/start DFS:
/opt/hadoop-2.7.3/sbin/stop-dfs.sh
/opt/hadoop-2.7.3/sbin/start-dfs.sh

# check service start status
tail -f /var/log/hadoop/hadoop-ec2-user-datanode-ip-10-0-15-76.log

 

**some ENV variables I usually use on these environments:

export HADOOP_SSH_OPTS="-i /home/ec2-user/.ssh/mykey -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null"
export JAVA_HOME=/usr/lib/jvm/java-1.7.0-openjdk-1.7.0.151.x86_64/jre
Publicado en Uncategorized | Etiquetado , , , | Deja un comentario

Muffins de Banana


Ingredientes:

  • 100 g de harina
  • 1 cucharada de polvo para hornear
  • 100 g de banana madura
  • 3 huevos
  • 50 g de azúcar blanca
  • 1 cucharada de vainilla
  • 60 ml de leche

Preparacion:

Precalienta el horno a 170°C.

Mezcla en un recipiente pequeño los huevos y el azúcar. Luego añade la banana pisada, la vainilla y la leche. Mezcla todo bien hasta incorporar.

En otro recipiente más grande, mezcla la harina y el polvo para hornear.

Añade los ingredientes húmedos en este recipiente y mezcla hasta integrar bien. No sobrebatir.

Llena los pirotines y hornea por unos 25 minutos o hasta que salga un cuchillo limpio desde el centro y estén ligeramente dorados por encima.

2017-10-29 21.09.59_preview

Publicado en Cooking, Uncategorized | Deja un comentario

AWS EMR – Big Data in Strata New York


Will you be in New York next week (Sept 25th – Sept 28th)?

aws_sponsor                   strata_data

Come meet the AWS Big Data team at Strata Data Conference, where we’ll be happy to answer your questions, hear about your requirements, and help you with your big data initiatives.

See you there!

 

 

 

 

Publicado en Uncategorized | Etiquetado , , , , , | Deja un comentario

s3:// vs s3n:// vs s3a:// vs EMRFS


s3://

Apache Hadoop implementation of a block-based filesystem backed by S3. Apache Hadoop has deprecated use of this filesystem as of May 2016.

s3n://

A native filesystem for reading and writing regular files on S3. S3N allows Hadoop to access files on S3 that were written with other tools, and conversely, other tools can access files written to S3N using Hadoop. S3N is stable and widely used, but it is not being updated with any new features. S3N requires a suitable version of the jets3t JAR on the classpath.

  • Uses jets3t

s3a://

Hadoop’s successor to the S3N filesystem. S3A uses Amazon’s libraries to interact with S3. S3A supports accessing files larger than 5 GB, and it provides performance enhancements and other improvements. For Apache Hadoop, S3A is the successor to S3N and is backward compatible with S3N. Using Apache Hadoop, all objects accessible from s3n:// URLs should also be accessible from S3A by replacing the URL scheme.

  • Uses AWS SDK.
  • Amazon EMR does not currently support use of the Apache Hadoop S3A file system.

EMRFS:

On Amazon EMR, both the s3:// and s3n:// URIs are associated with the EMR filesystem and are functionally interchangeable in the context of Amazon EMR. For consistency sake, however, it is recommended to use the s3:// URI in the context of Amazon EMR.

EMRFS can be used by invoking the prefix s3n:// or s3:// or s3a:// depending on the client application implementation.

Source: https://aws.amazon.com/premiumsupport/knowledge-center/emr-file-system-s3/
Publicado en Uncategorized | Etiquetado , , , , | Deja un comentario

Kill’em All!


Use it at your own discretion:

for app in `yarn application -list | awk '$6 == "ACCEPTED" { print $1 }'` ; do yarn application -kill "$app";done

 

 

 

 

 

 

Publicado en Uncategorized | Etiquetado , , | Deja un comentario