You are on page 1of 26

Getting Started with AWS

Analyzing Big Data


Getting Started with AWS: Analyzing Big Data
Getting Started with AWS Analyzing Big Data
Overview ................................................................................................................................................. 1
Getting Started ....................................................................................................................................... 3
Step 1: Sign Up for the Service .............................................................................................................. 4
Step 2: Create a Key Pair ....................................................................................................................... 4
Step 3: Create an Interactive Job Flow Using the Console .................................................................... 5
Step 4: SSH into the Master Node .......................................................................................................... 8
Step 5: Start and Configure Hive .......................................................................................................... 11
Step 6: Create the Hive Table and Load Data into HDFS ..................................................................... 11
Step 7: Query Hive ............................................................................................................................... 12
Step 8: Clean Up .................................................................................................................................. 13
Variations .............................................................................................................................................. 15
Calculating Pricing ................................................................................................................................ 17
Related Resources ............................................................................................................................... 21
Document History ................................................................................................................................. 23
3
Getting Started with AWS Analyzing Big Data
Overview
Big data refers to data sets that are too large to be hosted in traditional relational databases and are
inefficient to analyze using nondistributed applications. As the amount of data that businesses generate
and store continues to grow rapidly, tools and techniques to process this large-scale data become
increasingly important.
This guide explains how Amazon Web Services helps you manage these large data sets. As an example,
well look at a common source of large scale data: web server logs.
Web server logs can contain a wealth of information about visitors to your site: where their interests lie,
how they use your website, and how they found it; however, web service logs grow rapidly, and their
format is not immediately compatible with relational data stores.
A popular way to analyze big data sets is with clusters of commodity computers running in parallel. Each
computer processes a portion of the data, and then the results are aggregated. One technique that uses
this strategy is called map-reduce.
In the map phase, the problem set is apportioned among a set of computers in atomic chunks. For
example, if the question was, How many people used the search keyword chocolate to find this page?,
the set of web server logs would be divided among the set of computers, each counting instances of that
keyword in their partial data set.
The reduce phase aggregates the results from all the computers to a final result. To continue our example,
as each computer finished processing its set of data, it would report its results to a single node that tallies
the total to produce the final answer.
Apache Hadoop is an open-source implementation of MapReduce that supports distributed processing
of large data sets across clusters of computers. You can configure the size of your Hadoop cluster based
on the number of physical machines you would like to use; however, purchasing and maintaining a set
of physical servers can be an expensive proposition. Further, if your processing needs fluctuate, you
could find yourself paying for the upkeep of idle machines.
How does AWS help?
Amazon Web Services provides several services to help you process large-scale data. You pay for only
the resources that you use, so you don't need to maintain a cluster of physical servers and storage
devices. Further, AWS makes it easy to provision, configure, and monitor a Hadoop cluster.
1
Getting Started with AWS Analyzing Big Data
How does AWS help?
The following table shows the Amazon Web Services that can help you manage big data.
Benefits Amazon Web Services Challenges
Amazon S3 can store large
amounts of data, and its capacity
can grow to meet your needs. It is
highly redundant, protecting
against data loss.
Amazon Simple Storage Service
(Amazon S3)
Web server logs can be very
large. They need to be stored at
low cost while protecting against
corruption or loss.
By running map-reduce
applications on virtual Amazon
EC2 servers, you pay for the
servers only while the application
is running, and you can expand
the number of servers to match
the processing needs of your
application.
Amazon Elastic Compute Cloud
(Amazon EC2)
Maintaining a cluster of physical
servers to process data is
expensive and time-consuming.
Amazon EMR handles Hadoop
cluster configuration, monitoring,
and management while integrating
seamlessly with other AWS
services to simplify large-scale
data processing in the cloud.
Amazon EMR Hadoop and other open-source
big-data tools can be challenging
to configure, monitor, and
operate.
Data Processing Architecture
In the following diagram, the data to be processed is stored in an Amazon S3 bucket. Amazon EMR
streams the data from Amazon S3 and launches Amazon EC2 instances to process the data in parallel.
When Amazon EMR launches the EC2 instances, it initializes them with an Amazon Machine Image (AMI)
that can be preloaded with open-source tools, such as Hadoop, Hive, and Pig, which are optimized to
work with other AWS services.
Let's see how we can use this architecture to analyze big data.
2
Getting Started with AWS Analyzing Big Data
Data Processing Architecture
Getting Started
Topics
Step 1: Sign Up for the Service (p. 4)
Step 2: Create a Key Pair (p. 4)
Step 3: Create an Interactive Job Flow Using the Console (p. 5)
Step 4: SSH into the Master Node (p. 8)
Step 5: Start and Configure Hive (p. 11)
Step 6: Create the Hive Table and Load Data into HDFS (p. 11)
Step 7: Query Hive (p. 12)
Step 8: Clean Up (p. 13)
Suppose you host a popular e-commerce website. In order to understand your customers better, you
want to analyze your Apache web logs to discover how people are finding your site. Youd especially like
to determine which of your online ad campaigns are most successful in driving traffic to your online store.
The web services logs, however, are too large to import into a MySql database, and they are not in a
relational format. You need another way to analyze them.
Amazon EMR integrates open-source applications such as Hadoop and Hive with Amazon Web Services
to provide a scalable and efficient architecture to analyze large scale data, such as your Apache web
logs.
In the following tutorial, well import data from Amazon S3 and launch an Amazon EMR job flow from the
AWS Management Console. Then we'll use Secure Shell (SSH) to connect to the master node of the job
flow, where we'll run Hive to query the Apache logs using a simplified SQL syntax.
Working through this tutorial will cost you 29 cents each hour or partial hour your job flow is running. The
tutorial typically takes less than an hour to complete. The input data we will use is hosted on an Amazon
S3 bucket owned by the Amazon Elastic MapReduce team, so you pay only for the Amazon Elastic
MapReduce processing on three m1.small EC2 instances (29 cents per hour) to launch and manage the
job flow in the US-East region.
Let's begin!
3
Getting Started with AWS Analyzing Big Data
Step 1: Sign Up for the Service
If you don't already have an AWS account, youll need to get one. Your AWS account gives you access
to all services, but you will be charged only for the resources that you use. For this example walkthrough,
the charges will be minimal.
To sign up for AWS
1. Go to http://aws.amazon.com and click Sign Up.
2. Follow the on-screen instructions.
AWS notifies you by email when your account is active and available for you to use.
You use your AWS account credentials to deploy and manage resources within AWS. If you give other
people access to your resources, you will probably want to control who has access and what they can
do. AWS Identity and Access Management (IAM) is a web service that controls access to your resources
by other people. In IAM, you create users, which other people can use to obtain access and permissions
that you define. For more information about IAM, go to Using IAM.
Step 2: Create a Key Pair
You can create a key pair to connect to the Amazon EC2 instances that Amazon EMR launches. For
security reasons, EC2 instances use a public/private key pair, rather than a user name and password,
to authenticate connection requests. The public key half of this pair is embedded in the instance, so you
can use the private key to log in securely without a password. In this step we will use the AWS Management
Console to create a key pair. Later, well use this key pair to connect to the master node of the Amazon
EMR job flow in order to run Hive.
To generate a key pair
1. Open the Amazon EC2 console at https://console.aws.amazon.com/ec2/.
2. In the top navigation bar, in the region selector, click US East (N. Virginia).
3. In the left navigation pane, under Network and Security, click Key Pairs.
4. Click Create Key Pair.
5. Type mykeypair in the new Key Pair Name box and then click Create.
6. Download the private key file, which is named mykeypair.pem, and keep it in a safe place. You will
need it to access any instances that you launch with this key pair.
Important
If you lose the key pair, you cannot connect to your Amazon EC2 instances.
For more information about key pairs, see Getting an SSH Key Pair in the Amazon Elastic Compute
Cloud User Guide.
4
Getting Started with AWS Analyzing Big Data
Step 1: Sign Up for the Service
Step 3: Create an Interactive Job Flow Using the
Console
To create a job flow using the console
1. Sign in to the AWS Management Console and open the Amazon Elastic MapReduce console at
https://console.aws.amazon.com/elasticmapreduce/.
2. In the Region box, click US East, and then click Create New Job Flow.
3. On the Define Job Flow page, click Run your own application. In the Choose a Job Type box,
click Hive Program, and then click Continue. You can leave Job Flow Name as My Job Flow, or
you can rename it.
Select which version of Hadoop to run on your job flow in Hadoop Version. You can choose to run
the Amazon distribution of Hadoop or one of two MapR distributions. For more information about
MapR distributions for Hadoop, see Using the MapR Distribution for Hadoop in the Amazon Elastic
MapReduce Developer Guide.
5
Getting Started with AWS Analyzing Big Data
Step 3: Create an Interactive Job Flow Using the Console
4. On the Specify Parameters page, click Start an Interactive Hive Session, and then click Continue.
Hive is an open-source tool that runs on top of Hadoop. With Hive, you can query job flows by using
a simplified SQL syntax. We are selecting an interactive session because well be issuing commands
from a terminal window.
Note
You can select the Execute a Hive Script check box to run Hive commands that you store
in a text file in an Amazon S3 bucket. This option is useful for automating Hive queries you
want to perform on an recurring basis. Because of the requirements of Hadoop, Amazon
S3 bucket names used with Amazon EMR must contain only lowercase letters, numbers,
periods (.), and hyphens (-).
5. On the Configure EC2 Instances page, you can set the number and type of instances used to
process the big data set in parallel. To keep the cost of this tutorial low, we will accept the default
instance types, an m1.small master node and two m1.small core nodes. Click Continue.
6
Getting Started with AWS Analyzing Big Data
Step 3: Create an Interactive Job Flow Using the Console
Note
When you are analyzing data in a real application, you may want to increase the size or
number of these nodes to increase processing power and reduce computational time. In
addition, you can specify some or all of your job flow to run as Spot Instances, a way of
purchasing unused Amazon EC2 capacity at reduced cost. For more information about spot
instances, go to Lowering Costs with Spot Instances in the Amazon Elastic MapReduce
Developer Guide.
6. On the Advanced Options page, specify the key pair that you created earlier.
Leave the rest of the settings on this page at the default values. For example, Amazon VPC Subnet
Id should remain set to No preference.
7
Getting Started with AWS Analyzing Big Data
Step 3: Create an Interactive Job Flow Using the Console
Note
In a production environment, debugging can be a useful tool to analyze errors or inefficiencies
in a job flow. For more information on how to use debugging in Amazon EMR, go to
Troubleshooting in the Amazon Elastic MapReduce Developer Guide.
7. On the Bootstrap Actions page, click Proceed with no Bootstrap Actions and then click Continue.
Bootstrap actions load custom software onto the AMIs that Amazon EMR launches. For this tutorial,
we will be using Hive, which is already installed on the AMI, so no bootstrap action is needed.
8. On the Review page, review the settings for your job flow. If everything looks correct, click Create
Job Flow. When the confirmation window closes, your new job flow will appear in the list of job flows
in the Amazon EMR console with the status STARTING. It will take a few minutes for Amazon EMR
to provision the Amazon EC2 instances for your job flow.
Step 4: SSH into the Master Node
When the Job Flows Status in the Amazon EMR console is WAITING, the master node is ready for you
to connect to it. Before you can do that, however, you must acquire the DNS name of the master node
and configure your connection tools and credentials.
8
Getting Started with AWS Analyzing Big Data
Step 4: SSH into the Master Node
To locate the DNS name of the master node
Locate the DNS name of the master node in the Amazon EMR console by selecting the job from
from the list of running job flows. This causes details about the job flow to appear in the lower pane.
The DNS name you will use to connect to the instance is listed on the Description tab as Master
Public DNS Name. Make a note of the DNS name; youll need it in the next step.
Next we'll use a Secure Shell (SSH) application to open a terminal connection to the master node. An
SSH application is installed by default on most Linux, Unix, and Mac OS installations. Windows users
can use an application called PuTTY to connect to the master node. Platform-specific instructions for
configuring a Windows application to open an SSH connection are described later in this topic.
You must first configure your credentials, or SSH will return an error message saying that your private
key file is unprotected, and it will reject the key. You need to do this step only the first time you use the
private key to connect.
To configure your credentials on Linux/Unix/Mac OS X
1. Open a terminal window. This is found at Applications/Utilities/Terminal on Mac OS X and at
Applications/Accessories/Terminal on many Linux distributions.
2. Set the permissions on the PEM file for your Amazon EC2 key pair so that only the key owner has
permissions to access the key. For example, if you saved the file as mykeypair.pem in the users
home directory, the command is:
chmod og-rwx ~/mykeypair.pem
To connect to the master node using Linux/Unix/Mac OS X
1. From the terminal window, enter the following command line, where ssh is the command, hadoop
is the user name you are using to connect, the at symbol (@) joins the username and the DNS of the
machine you are connecting to, and the -i parameter indicates the location of the private key file
you saved in step 6 of Step 2: Create a Key Pair (p. 4). In this example, were assuming its been
saved to your home directory.
9
Getting Started with AWS Analyzing Big Data
Step 4: SSH into the Master Node
ssh hadoop@master-public-dns-name \
-i ~/mykeypair.pem
2. You will receive a warning that the authenticity of the host you are connecting to cant be verified.
Type yes to continue connecting.
If you are using a Windows-based computer, you will need to install an SSH program in order to connect
to the master node. In this tutorial, we will use PuTTY. If you have already installed PuTTY and configured
your key pair, you can skip this procedure.
To install and configure PuTTY on Windows
1. Download PuTTYgen.exe and PuTTY.exe to your computer from
http://www.chiark.greenend.org.uk/~sgtatham/putty/download.html.
2. Launch PuTTYgen.
3. Click Load. Select the PEM file you created earlier. You may have to change the search parameters
from file of type PuTTY Private Key Files (*.ppk) to All Files (*.*).
4. Click Open.
5. On the PuTTYgen Notice telling you the key was successfully imported, click OK.
6. To save the key in the PPK format, click Save private key.
7. When PuTTYgen prompts you to save the key without a pass phrase, click Yes.
8. Enter a name for your PuTTY private key, such as mykeypair.ppk.
To connect to the master node using Windows/Putty
1. Start PuTTY.
2. In the Category list, click Session. In the Host Name box, type hadoop@DNS. The input looks
similar to hadoop@ec2-184-72-128-177.compute-1.amazonaws.com.
3. In the Category list, expand Connection, expand SSH, and then click Auth.
4. In the Options controlling SSH authentication pane, click Browse for Private key file for
authentication, and then select the private key file that you generated earlier. If you are following
this guide, the file name is mykeypair.ppk.
5. Click Open.
6. To connect to the master node, click Open.
7. In the PuTTY Security Alert, click Yes.
Note
For more information about how to install PuTTY and use it to connect to an EC2 instance, such
as the master node, go to Appendix D: Connecting to a Linux/UNIX Instance from Windows
using PuTTY in the Amazon Elastic Compute Cloud User Guide.
When you are successfully connected to the master node via SSH, you will see a welcome screen and
prompt like the following.
-----------------------------------------------------------------------------
Welcome to Amazon EMR running Hadoop and Debian/Lenny.
Hadoop is installed in /home/hadoop. Log files are in /mnt/var/log/hadoop. Check
10
Getting Started with AWS Analyzing Big Data
Step 4: SSH into the Master Node
/mnt/var/log/hadoop/steps for diagnosing step failures.
The Hadoop UI can be accessed via the following commands:
JobTracker lynx http://localhost:9100/
NameNode lynx http://localhost:9101/
-----------------------------------------------------------------------------
hadoop@ip-10-245-190-34:~$
Step 5: Start and Configure Hive
Apache Hive is a data warehouse application you can use to query data contained in Amazon EMR job
flows by using a SQL-like language. Because we launched the job flow as a Hive application, Amazon
EMR will install Hive on the Amazon EC2 instances it launches to process the job flow.
We will use Hive interactively to query the web server log data. To begin, we will start Hive, and then well
load some additional libraries that add functionality such as the ability to easily access Amazon S3. The
additional libraries are contained in a Java archive file named hive_contrib.jar on the master node. When
you load these libraries, Hive bundles them with the map-reduce job that it launches to process your
queries.
To learn more about Hive, go to http://hive.apache.org/.
To start and configure Hive on the master node
1. From the command line of the master node, type hive, and then press Enter.
2. At the hive> prompt, type the following command, and then press Enter.
hive> add jar /home/hadoop/hive/lib/hive_contrib.jar;
Step 6: Create the Hive Table and Load Data into
HDFS
In order for Hive to interact with data, it must translate the data from its current format (in the case of
Apache web logs, a text file) into a format that can be represented as a database table. Hive does this
translation using a serializer/deserializer (SerDe). SerDes exist for all kinds of data formats. For information
about how to write a custom SerDe, go to the Apache Hive Developer Guide.
The SerDe well use in this example uses regular expressions to parse the log file data. It comes from
the Hive open source community, and the code for this SerDe can be found online at
https://github.com/apache/hive/blob/trunk/contrib/src/java/org/apache/hadoop/hive/contrib/serde2/RegexSerDe.java.
Using this SerDe, we can define the log files as a table, which well query using SQL-like statements later
in this tutorial.
11
Getting Started with AWS Analyzing Big Data
Step 5: Start and Configure Hive
To translate the Apache log file data into a Hive table
Copy the following multiline command. At the hive command prompt, paste the command, and then
press Enter.
CREATE TABLE serde_regex(
host STRING,
identity STRING,
user STRING,
time STRING,
request STRING,
status STRING,
size STRING,
referer STRING,
agent STRING)
ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
"input.regex" = "([^ ]*) ([^ ]*) ([^ ]*) (-|\\[[^\\]]*\\]) ([^
\"]*|\"[^\"]*\") (-|[0-9]*) (-|[0-9]*)(?: ([^ \"]*|\"[^\"]*\") ([^
\"]*|\"[^\"]*\"))?",
"output.format.string" = "%1$s %2$s %3$s %4$s %5$s %6$s %7$s %8$s %9$s"
)
LOCATION 's3://elasticmapreduce/samples/pig-apache/input/';
In the command, the LOCATION parameter specifies the location of the Apache log files in Amazon S3.
For this tutorial we are using a set of sample log files. To analyze your own Apache web server log files,
you would replace the Amazon S3 URL above with the location of your own log files in Amazon S3.
Because of the requirements of Hadoop, Amazon S3 bucket names used with Amazon EMR must contain
only lowercase letters, numbers, periods (.), and hyphens (-).
After you run the command above, you should receive a confirmation like this one:
Found class for org.apache.hadoop.hive.contrib.serde2.RegexSerDe
OK
Time taken: 12.56 seconds
hive>
Once Hive has loaded the data, the data will persist in HDFS storage as long as the Amazon EMR job
flow is running, even if you shut down your Hive session and close the SSH terminal.
Step 7: Query Hive
You are now ready to start querying the Apache log file data. Below are some queries to run.
Count the number of rows in the Apache webserver log files.
12
Getting Started with AWS Analyzing Big Data
Step 7: Query Hive
select count(1) from serde_regex;
Return all fields from one row of log file data.
select * from serde_regex limit 1;
Count the number of requests from the host with an IP address of 192.168.1.198.
select count(1) from serde_regex where host="192.168.1.198";
To return query results, Hive translates your query into a MapReduce application that is run on the Amazon
EMR job flow.
You can invent your own queries for the web server log files. Hive SQL is a subset of SQL. For more
information about the available query syntax, go to the Hive Language Manual.
Step 8: Clean Up
To prevent your account from accruing additional charges, you should terminate the job flow when you
are done with this tutorial. Because you launched the job flow as interactive, it has to be manually
terminated.
To disconnect from Hive and SSH
1. In Secure Shell, press Click CTRL+C to exit Hive.
2. At the SSH command prompt, type exit, and then press Enter. Afterwards you can close the terminal
or PuTTY window.
exit
To terminate a job flow
In the Amazon Elastic MapReduce console, click the job flow, and then click Terminate.
The next step is optional. It deletes the key pair you created in Step 2. You are not charged for key pairs.
If you are planning to explore Amazon EMR further, you should retain the key pair.
To delete a key pair
1. From the Amazon EC2 console, select Key Pairs from the left-hand pane.
2. In the right pane, select the key pair you created in Step 2 and click Delete.
13
Getting Started with AWS Analyzing Big Data
Step 8: Clean Up
The next step is optional. It deletes two security groups created for you by Amazon EMR when you
launched the job flow. You are not charged for security groups. If you are planning to explore Amazon
EMR further, you should retain them.
To delete Amazon EMR security groups
1. From the Amazon EC2 console, in the Navigation pane, click Security Groups.
2. In the Security Groups pane, click the ElasticMapReduce-slave security group.
3. In the details pane for the ElasticMapReduce-slave security group, delete all actions that reference
ElasticMapReduce. Click Apply Rule Changes.
4. In the right pane, select the ElasticMapReduce-master security group.
5. In the details pane for the ElasticMapReduce-master security group, delete all actions that reference
ElasticMapReduce. Click Apply Rule Changes.
6. With ElasticMapReduce-master still selected in the Security Groups pane, click Delete. Click Yes
to confirm.
7. In the Security Groupspane, click ElasticMapReduce-slave, and then click Delete. Click Yes to
confirm.
14
Getting Started with AWS Analyzing Big Data
Step 8: Clean Up
Variations
Topics
Script Your Hive Queries (p. 15)
Use Pig Instead of Hive to Analyze Your Data (p. 15)
Create Custom Applications to Analyze Your Data (p. 15)
Now that you know how to launch and run an interactive Hive job using Amazon Elastic MapReduce to
process large quantities of data, lets consider some alternatives.
Script Your Hive Queries
Interactively querying data is the most direct way to get results, and is useful when you are exploring the
data and developing a set of queries that will best answer your business questions. Once youve created
a set of queries that you want to run on a regular basis, you can automate the process by saving your
Hive commands as a script and uploading them to Amazon S3. Then you can launch a Hive job flow
using the Execute a Hive Script option that we bypassed in Step 4 of Create an Interactive Job Flow
Using the Console. For more information on how to launch a job flow using a Hive script, go to How to
Create a Job Flow Using Hive in the Amazon Elastic MapReduce Developer Guide.
Use Pig Instead of Hive to Analyze Your Data
Amazon Elastic MapReduce provides access to many open source tools that you can use to analyze
your data. Another of these is Pig, which uses a language called Pig Latin to abstract map-reduce jobs.
For an example of how to process the same log files used in this tutorial using Pig, go to Parsing Logs
with Apache Pig and Amazon Elastic MapReduce.
Create Custom Applications to Analyze Your
Data
If the functionality you need is not available as an open-source project or if your data has special analysis
requirements, you can write a custom Hadoop map-reduce application and then use Amazon Elastic
15
Getting Started with AWS Analyzing Big Data
Script Your Hive Queries
MapReduce to run it on Amazon Web Services. For more information, go to Run a Hadoop Application
to Process Data in the Amazon Elastic MapReduce Developer Guide.
Alternatively, you can create a streaming job flow that reads data from standard input and then runs a
script or executable (a mapper) against the input. Once all the data is processed by the mapper, a second
script or executable (a reducer) processes the mapper results. The results from the reducer are sent to
standard output. The mapper and the reducer can each be referenced as a file or you can supply a Java
class. You can implement the mapper and reducer in any of the supported languages, including Ruby,
Perl, Python, PHP, or Bash. For more information, go to Launch a Streaming Cluster in the Amazon
Elastic MapReduce Developer Guide.
16
Getting Started with AWS Analyzing Big Data
Create Custom Applications to Analyze Your Data
Calculating Pricing
Topics
Amazon S3 Cost Breakdown (p. 18)
Amazon Elastic MapReduce Cost Breakdown (p. 19)
The AWS Simple Monthly Calculator estimates your monthly bill. It provides a per service cost breakdown,
as well as an aggregate monthly estimate. You can also use the calculator to see an estimation and
breakdown of costs for common solutions.
This topic walks you through an example of using the AWS Simple Monthly Calculator to estimate your
monthly bill for processing 1 GB of weekly web server log files. For additional information, download the
whitepaper How AWS Pricing Works.
To estimate costs using the AWS Simple Monthly Calculator
1. Go to http://calculator.s3.amazonaws.com/calc5.html.
2. In the navigation pane, select a web service that your application requires. Enter your estimated
monthly usage for that web service. Click Add To Bill to add the cost for that service to your total.
Repeat this step for each web service your application uses.
3. To see the total estimated monthly charges, click the Estimate of your Monthly Bill tab.
Suppose you have a website that produces web logs that you upload each day to an S3 bucket. This
transfer averages about 1 GB of web-server data a week. You want to analyze the data once a week for
an hour, and then delete the web logs. To make processing the log files efficient, well use three m1.large
Amazon EC2 instances in the Amazon Elastic MapReduce job flow.
Note
AWS pricing you see in this documentation is current at the time of publication. c
17
Getting Started with AWS Analyzing Big Data
Amazon S3 Cost Breakdown
The following table shows the characteristics for Amazon S3 we have identified for this web application
hosting architecture. For the latest pricing information, go to AWS Service Pricing Overview.
Description Metric Characteristic
The website generates 1 GB of web service logs a week, and
we store only one weeks worth of data at a time in Amazon
S3.
1 GB/month Storage
Each day a file containing the daily web logs is uploaded to
Amazon S3, resulting in 30 put requests per month.
Each week, Amazon Elastic MapReduce loads seven daily
web log files from Amazon S3 to HDFS, resulting in an
average of 30 get requests per month.
PUT requests:
30/month
GET requests:
30/month
Requests
If the average amount of data generated by the web service
logs each week is 1 GB, the total amount of data transferred
to Amazon S3 each month is 4 GB. Data transfer into AWS
is free.
In addition, there is no charge for transferring data between
services within AWS in the same Region, so the transfer of
data from Amazon S3 to HDFS on an Amazon EC2 instance
is also free.
Data in: 4
GB/month
Data out: 4
GB/month
Data Transfer
The following image shows the cost breakdown for Amazon S3 in the AWS Simple Monthly Calculator
for region US-East.
The total monthly cost is the sum of the cost for uploading the daily web logs, storing a weeks worth of
logs in Amazon S3, and then transferring them to Amazon Elastic MapReduce for processing.
Calculation Formula Variable
$0.13 Storage Rate
($0.125 / GB)
x Storage Amount
(1 GB*)
-------------------------------
Provisioned
Storage
18
Getting Started with AWS Analyzing Big Data
Amazon S3 Cost Breakdown
Calculation Formula Variable
$0.01 Request Rate
($0.01 / 1000
requests)
x Number of
requests (30**)
-------------------------------
PUT requests
$0.01 Request Rate
($0.01 / 1000
requests)
x Number of
requests (30**)
-------------------------------
GET requests
$0.15 Total Amazon S3
Charges
(*) The charge of $0.125 is rounded up to $0.13.
(**) Any number of requests fewer than 1000 is charged the minimum, which is $0.01.
A way to reduce your Amazon S3 storage charges is to use Reduced Redundancy Storage (RRS), a
lower cost option for storing non-critical, reproducible data. For more information about RRS, go to
http://aws.amazon.com/s3/faqs/#What_is_RRS. then
Amazon Elastic MapReduce Cost Breakdown
The following table shows the characteristics for Amazon Elastic MapReduce well use for our example.
For more information, go to Amazon Elastic MapReduce Pricing.
Description Metric Characteristic
To make processing 1GB of data efficient, well use m1.large
instances to boost computational power.
m1.large
(on-demand)
Machine
Characteristics
One master node and two slave nodes. 3 Instance Scale
Four hours each month to process the log files. 4 hours / month Uptime
The following image shows the cost breakdown for Amazon Elastic MapReduce in the AWS Simple
Monthly Calculator.
19
Getting Started with AWS Analyzing Big Data
Amazon Elastic MapReduce Cost Breakdown
The total monthly cost is the sum of the cost of running and managing the EC2 Instances in the job flow.
Calculation Formula Variable
$0.38
x 3
x 4
---------------
$4.56
Compute cost per
hour
x Number of
instances
x Uptime in hours
------------------------
Total Amazon
Elastic
MapReduce
Charges
Instance
Management Cost
20
Getting Started with AWS Analyzing Big Data
Amazon Elastic MapReduce Cost Breakdown
Related Resources
The following table lists related resources that you'll find useful as you work with AWS services.
Description Resource
A comprehensive list of products and services AWS offers. AWS Products and Services
Official documentation for each AWS product including service
introductions, service features, and API references, and other
useful information.
Documentation
Provides the necessary guidance and best practices to build
highly scalable and reliable applications in the AWS cloud.
These resources help you understand the AWS platform, its
services and features. They also provide architectural
guidance for design and implementation of systems that run
on the AWS infrastructure.
AWS Architecture Center
Provides access to information, tools, and resources to
compare the costs of Amazon Web Services with IT
infrastructure alternatives.
AWS Economics Center
Features a comprehensive list of technical AWS whitepapers
covering topics such as architecture, security, and economics.
These whitepapers have been authored either by the Amazon
team or by AWS customers or solution providers.
AWS Cloud Computing Whitepapers
Previously recorded webinars and videos about products,
architecture, security, and more.
Videos and Webinars
A community-based forum for developers to discuss technical
questions related to Amazon Web Services.
Discussion Forums
The home page for AWS Technical Support, including access
to our Developer Forums, Technical FAQs, Service Status
page, and AWS Premium Support. (subscription required).
AWS Support Center
The primary web page for information about AWS Premium
Support, a one-on-one, fast-response support channel to help
you build and run applications on AWS Infrastructure Services.
AWS Premium Support Information
21
Getting Started with AWS Analyzing Big Data
Description Resource
This form is only for account questions. For technical
questions, use the Discussion Forums.
Form for questions related to your AWS
account: Contact Us
Detailed information about the copyright and trademark usage
at Amazon.com and other topics.
Conditions of Use
22
Getting Started with AWS Analyzing Big Data
Document History
This document history is associated with the 2011-12-12 release of: Getting Started with AWS Analyzing
Big Data with AWS.
Release Date Description Change
6 March 2012 Updated to reflect
reductions in
Amazon EC2 and
Amazon EMR
service fees.
Updated Amazon
EC2 and Amazon
EMR Pricing
9 February 2012 Updated to reflect
reduction in
Amazon S3
storage fees.
Updated Amazon
S3 Pricing
12 December 2011 Created new
document.
New content
23
Getting Started with AWS Analyzing Big Data

You might also like