Getting Started with AWS: Analyzing Big Data Getting Started with AWS Analyzing Big Data Overview ................................................................................................................................................. 1 Getting Started ....................................................................................................................................... 3 Step 1: Sign Up for the Service .............................................................................................................. 4 Step 2: Create a Key Pair ....................................................................................................................... 4 Step 3: Create an Interactive Job Flow Using the Console .................................................................... 5 Step 4: SSH into the Master Node .......................................................................................................... 8 Step 5: Start and Configure Hive .......................................................................................................... 11 Step 6: Create the Hive Table and Load Data into HDFS ..................................................................... 11 Step 7: Query Hive ............................................................................................................................... 12 Step 8: Clean Up .................................................................................................................................. 13 Variations .............................................................................................................................................. 15 Calculating Pricing ................................................................................................................................ 17 Related Resources ............................................................................................................................... 21 Document History ................................................................................................................................. 23 3 Getting Started with AWS Analyzing Big Data Overview Big data refers to data sets that are too large to be hosted in traditional relational databases and are inefficient to analyze using nondistributed applications. As the amount of data that businesses generate and store continues to grow rapidly, tools and techniques to process this large-scale data become increasingly important. This guide explains how Amazon Web Services helps you manage these large data sets. As an example, well look at a common source of large scale data: web server logs. Web server logs can contain a wealth of information about visitors to your site: where their interests lie, how they use your website, and how they found it; however, web service logs grow rapidly, and their format is not immediately compatible with relational data stores. A popular way to analyze big data sets is with clusters of commodity computers running in parallel. Each computer processes a portion of the data, and then the results are aggregated. One technique that uses this strategy is called map-reduce. In the map phase, the problem set is apportioned among a set of computers in atomic chunks. For example, if the question was, How many people used the search keyword chocolate to find this page?, the set of web server logs would be divided among the set of computers, each counting instances of that keyword in their partial data set. The reduce phase aggregates the results from all the computers to a final result. To continue our example, as each computer finished processing its set of data, it would report its results to a single node that tallies the total to produce the final answer. Apache Hadoop is an open-source implementation of MapReduce that supports distributed processing of large data sets across clusters of computers. You can configure the size of your Hadoop cluster based on the number of physical machines you would like to use; however, purchasing and maintaining a set of physical servers can be an expensive proposition. Further, if your processing needs fluctuate, you could find yourself paying for the upkeep of idle machines. How does AWS help? Amazon Web Services provides several services to help you process large-scale data. You pay for only the resources that you use, so you don't need to maintain a cluster of physical servers and storage devices. Further, AWS makes it easy to provision, configure, and monitor a Hadoop cluster. 1 Getting Started with AWS Analyzing Big Data How does AWS help? The following table shows the Amazon Web Services that can help you manage big data. Benefits Amazon Web Services Challenges Amazon S3 can store large amounts of data, and its capacity can grow to meet your needs. It is highly redundant, protecting against data loss. Amazon Simple Storage Service (Amazon S3) Web server logs can be very large. They need to be stored at low cost while protecting against corruption or loss. By running map-reduce applications on virtual Amazon EC2 servers, you pay for the servers only while the application is running, and you can expand the number of servers to match the processing needs of your application. Amazon Elastic Compute Cloud (Amazon EC2) Maintaining a cluster of physical servers to process data is expensive and time-consuming. Amazon EMR handles Hadoop cluster configuration, monitoring, and management while integrating seamlessly with other AWS services to simplify large-scale data processing in the cloud. Amazon EMR Hadoop and other open-source big-data tools can be challenging to configure, monitor, and operate. Data Processing Architecture In the following diagram, the data to be processed is stored in an Amazon S3 bucket. Amazon EMR streams the data from Amazon S3 and launches Amazon EC2 instances to process the data in parallel. When Amazon EMR launches the EC2 instances, it initializes them with an Amazon Machine Image (AMI) that can be preloaded with open-source tools, such as Hadoop, Hive, and Pig, which are optimized to work with other AWS services. Let's see how we can use this architecture to analyze big data. 2 Getting Started with AWS Analyzing Big Data Data Processing Architecture Getting Started Topics Step 1: Sign Up for the Service (p. 4) Step 2: Create a Key Pair (p. 4) Step 3: Create an Interactive Job Flow Using the Console (p. 5) Step 4: SSH into the Master Node (p. 8) Step 5: Start and Configure Hive (p. 11) Step 6: Create the Hive Table and Load Data into HDFS (p. 11) Step 7: Query Hive (p. 12) Step 8: Clean Up (p. 13) Suppose you host a popular e-commerce website. In order to understand your customers better, you want to analyze your Apache web logs to discover how people are finding your site. Youd especially like to determine which of your online ad campaigns are most successful in driving traffic to your online store. The web services logs, however, are too large to import into a MySql database, and they are not in a relational format. You need another way to analyze them. Amazon EMR integrates open-source applications such as Hadoop and Hive with Amazon Web Services to provide a scalable and efficient architecture to analyze large scale data, such as your Apache web logs. In the following tutorial, well import data from Amazon S3 and launch an Amazon EMR job flow from the AWS Management Console. Then we'll use Secure Shell (SSH) to connect to the master node of the job flow, where we'll run Hive to query the Apache logs using a simplified SQL syntax. Working through this tutorial will cost you 29 cents each hour or partial hour your job flow is running. The tutorial typically takes less than an hour to complete. The input data we will use is hosted on an Amazon S3 bucket owned by the Amazon Elastic MapReduce team, so you pay only for the Amazon Elastic MapReduce processing on three m1.small EC2 instances (29 cents per hour) to launch and manage the job flow in the US-East region. Let's begin! 3 Getting Started with AWS Analyzing Big Data Step 1: Sign Up for the Service If you don't already have an AWS account, youll need to get one. Your AWS account gives you access to all services, but you will be charged only for the resources that you use. For this example walkthrough, the charges will be minimal. To sign up for AWS 1. Go to http://aws.amazon.com and click Sign Up. 2. Follow the on-screen instructions. AWS notifies you by email when your account is active and available for you to use. You use your AWS account credentials to deploy and manage resources within AWS. If you give other people access to your resources, you will probably want to control who has access and what they can do. AWS Identity and Access Management (IAM) is a web service that controls access to your resources by other people. In IAM, you create users, which other people can use to obtain access and permissions that you define. For more information about IAM, go to Using IAM. Step 2: Create a Key Pair You can create a key pair to connect to the Amazon EC2 instances that Amazon EMR launches. For security reasons, EC2 instances use a public/private key pair, rather than a user name and password, to authenticate connection requests. The public key half of this pair is embedded in the instance, so you can use the private key to log in securely without a password. In this step we will use the AWS Management Console to create a key pair. Later, well use this key pair to connect to the master node of the Amazon EMR job flow in order to run Hive. To generate a key pair 1. Open the Amazon EC2 console at https://console.aws.amazon.com/ec2/. 2. In the top navigation bar, in the region selector, click US East (N. Virginia). 3. In the left navigation pane, under Network and Security, click Key Pairs. 4. Click Create Key Pair. 5. Type mykeypair in the new Key Pair Name box and then click Create. 6. Download the private key file, which is named mykeypair.pem, and keep it in a safe place. You will need it to access any instances that you launch with this key pair. Important If you lose the key pair, you cannot connect to your Amazon EC2 instances. For more information about key pairs, see Getting an SSH Key Pair in the Amazon Elastic Compute Cloud User Guide. 4 Getting Started with AWS Analyzing Big Data Step 1: Sign Up for the Service Step 3: Create an Interactive Job Flow Using the Console To create a job flow using the console 1. Sign in to the AWS Management Console and open the Amazon Elastic MapReduce console at https://console.aws.amazon.com/elasticmapreduce/. 2. In the Region box, click US East, and then click Create New Job Flow. 3. On the Define Job Flow page, click Run your own application. In the Choose a Job Type box, click Hive Program, and then click Continue. You can leave Job Flow Name as My Job Flow, or you can rename it. Select which version of Hadoop to run on your job flow in Hadoop Version. You can choose to run the Amazon distribution of Hadoop or one of two MapR distributions. For more information about MapR distributions for Hadoop, see Using the MapR Distribution for Hadoop in the Amazon Elastic MapReduce Developer Guide. 5 Getting Started with AWS Analyzing Big Data Step 3: Create an Interactive Job Flow Using the Console 4. On the Specify Parameters page, click Start an Interactive Hive Session, and then click Continue. Hive is an open-source tool that runs on top of Hadoop. With Hive, you can query job flows by using a simplified SQL syntax. We are selecting an interactive session because well be issuing commands from a terminal window. Note You can select the Execute a Hive Script check box to run Hive commands that you store in a text file in an Amazon S3 bucket. This option is useful for automating Hive queries you want to perform on an recurring basis. Because of the requirements of Hadoop, Amazon S3 bucket names used with Amazon EMR must contain only lowercase letters, numbers, periods (.), and hyphens (-). 5. On the Configure EC2 Instances page, you can set the number and type of instances used to process the big data set in parallel. To keep the cost of this tutorial low, we will accept the default instance types, an m1.small master node and two m1.small core nodes. Click Continue. 6 Getting Started with AWS Analyzing Big Data Step 3: Create an Interactive Job Flow Using the Console Note When you are analyzing data in a real application, you may want to increase the size or number of these nodes to increase processing power and reduce computational time. In addition, you can specify some or all of your job flow to run as Spot Instances, a way of purchasing unused Amazon EC2 capacity at reduced cost. For more information about spot instances, go to Lowering Costs with Spot Instances in the Amazon Elastic MapReduce Developer Guide. 6. On the Advanced Options page, specify the key pair that you created earlier. Leave the rest of the settings on this page at the default values. For example, Amazon VPC Subnet Id should remain set to No preference. 7 Getting Started with AWS Analyzing Big Data Step 3: Create an Interactive Job Flow Using the Console Note In a production environment, debugging can be a useful tool to analyze errors or inefficiencies in a job flow. For more information on how to use debugging in Amazon EMR, go to Troubleshooting in the Amazon Elastic MapReduce Developer Guide. 7. On the Bootstrap Actions page, click Proceed with no Bootstrap Actions and then click Continue. Bootstrap actions load custom software onto the AMIs that Amazon EMR launches. For this tutorial, we will be using Hive, which is already installed on the AMI, so no bootstrap action is needed. 8. On the Review page, review the settings for your job flow. If everything looks correct, click Create Job Flow. When the confirmation window closes, your new job flow will appear in the list of job flows in the Amazon EMR console with the status STARTING. It will take a few minutes for Amazon EMR to provision the Amazon EC2 instances for your job flow. Step 4: SSH into the Master Node When the Job Flows Status in the Amazon EMR console is WAITING, the master node is ready for you to connect to it. Before you can do that, however, you must acquire the DNS name of the master node and configure your connection tools and credentials. 8 Getting Started with AWS Analyzing Big Data Step 4: SSH into the Master Node To locate the DNS name of the master node Locate the DNS name of the master node in the Amazon EMR console by selecting the job from from the list of running job flows. This causes details about the job flow to appear in the lower pane. The DNS name you will use to connect to the instance is listed on the Description tab as Master Public DNS Name. Make a note of the DNS name; youll need it in the next step. Next we'll use a Secure Shell (SSH) application to open a terminal connection to the master node. An SSH application is installed by default on most Linux, Unix, and Mac OS installations. Windows users can use an application called PuTTY to connect to the master node. Platform-specific instructions for configuring a Windows application to open an SSH connection are described later in this topic. You must first configure your credentials, or SSH will return an error message saying that your private key file is unprotected, and it will reject the key. You need to do this step only the first time you use the private key to connect. To configure your credentials on Linux/Unix/Mac OS X 1. Open a terminal window. This is found at Applications/Utilities/Terminal on Mac OS X and at Applications/Accessories/Terminal on many Linux distributions. 2. Set the permissions on the PEM file for your Amazon EC2 key pair so that only the key owner has permissions to access the key. For example, if you saved the file as mykeypair.pem in the users home directory, the command is: chmod og-rwx ~/mykeypair.pem To connect to the master node using Linux/Unix/Mac OS X 1. From the terminal window, enter the following command line, where ssh is the command, hadoop is the user name you are using to connect, the at symbol (@) joins the username and the DNS of the machine you are connecting to, and the -i parameter indicates the location of the private key file you saved in step 6 of Step 2: Create a Key Pair (p. 4). In this example, were assuming its been saved to your home directory. 9 Getting Started with AWS Analyzing Big Data Step 4: SSH into the Master Node ssh hadoop@master-public-dns-name \ -i ~/mykeypair.pem 2. You will receive a warning that the authenticity of the host you are connecting to cant be verified. Type yes to continue connecting. If you are using a Windows-based computer, you will need to install an SSH program in order to connect to the master node. In this tutorial, we will use PuTTY. If you have already installed PuTTY and configured your key pair, you can skip this procedure. To install and configure PuTTY on Windows 1. Download PuTTYgen.exe and PuTTY.exe to your computer from http://www.chiark.greenend.org.uk/~sgtatham/putty/download.html. 2. Launch PuTTYgen. 3. Click Load. Select the PEM file you created earlier. You may have to change the search parameters from file of type PuTTY Private Key Files (*.ppk) to All Files (*.*). 4. Click Open. 5. On the PuTTYgen Notice telling you the key was successfully imported, click OK. 6. To save the key in the PPK format, click Save private key. 7. When PuTTYgen prompts you to save the key without a pass phrase, click Yes. 8. Enter a name for your PuTTY private key, such as mykeypair.ppk. To connect to the master node using Windows/Putty 1. Start PuTTY. 2. In the Category list, click Session. In the Host Name box, type hadoop@DNS. The input looks similar to hadoop@ec2-184-72-128-177.compute-1.amazonaws.com. 3. In the Category list, expand Connection, expand SSH, and then click Auth. 4. In the Options controlling SSH authentication pane, click Browse for Private key file for authentication, and then select the private key file that you generated earlier. If you are following this guide, the file name is mykeypair.ppk. 5. Click Open. 6. To connect to the master node, click Open. 7. In the PuTTY Security Alert, click Yes. Note For more information about how to install PuTTY and use it to connect to an EC2 instance, such as the master node, go to Appendix D: Connecting to a Linux/UNIX Instance from Windows using PuTTY in the Amazon Elastic Compute Cloud User Guide. When you are successfully connected to the master node via SSH, you will see a welcome screen and prompt like the following. ----------------------------------------------------------------------------- Welcome to Amazon EMR running Hadoop and Debian/Lenny. Hadoop is installed in /home/hadoop. Log files are in /mnt/var/log/hadoop. Check 10 Getting Started with AWS Analyzing Big Data Step 4: SSH into the Master Node /mnt/var/log/hadoop/steps for diagnosing step failures. The Hadoop UI can be accessed via the following commands: JobTracker lynx http://localhost:9100/ NameNode lynx http://localhost:9101/ ----------------------------------------------------------------------------- hadoop@ip-10-245-190-34:~$ Step 5: Start and Configure Hive Apache Hive is a data warehouse application you can use to query data contained in Amazon EMR job flows by using a SQL-like language. Because we launched the job flow as a Hive application, Amazon EMR will install Hive on the Amazon EC2 instances it launches to process the job flow. We will use Hive interactively to query the web server log data. To begin, we will start Hive, and then well load some additional libraries that add functionality such as the ability to easily access Amazon S3. The additional libraries are contained in a Java archive file named hive_contrib.jar on the master node. When you load these libraries, Hive bundles them with the map-reduce job that it launches to process your queries. To learn more about Hive, go to http://hive.apache.org/. To start and configure Hive on the master node 1. From the command line of the master node, type hive, and then press Enter. 2. At the hive> prompt, type the following command, and then press Enter. hive> add jar /home/hadoop/hive/lib/hive_contrib.jar; Step 6: Create the Hive Table and Load Data into HDFS In order for Hive to interact with data, it must translate the data from its current format (in the case of Apache web logs, a text file) into a format that can be represented as a database table. Hive does this translation using a serializer/deserializer (SerDe). SerDes exist for all kinds of data formats. For information about how to write a custom SerDe, go to the Apache Hive Developer Guide. The SerDe well use in this example uses regular expressions to parse the log file data. It comes from the Hive open source community, and the code for this SerDe can be found online at https://github.com/apache/hive/blob/trunk/contrib/src/java/org/apache/hadoop/hive/contrib/serde2/RegexSerDe.java. Using this SerDe, we can define the log files as a table, which well query using SQL-like statements later in this tutorial. 11 Getting Started with AWS Analyzing Big Data Step 5: Start and Configure Hive To translate the Apache log file data into a Hive table Copy the following multiline command. At the hive command prompt, paste the command, and then press Enter. CREATE TABLE serde_regex( host STRING, identity STRING, user STRING, time STRING, request STRING, status STRING, size STRING, referer STRING, agent STRING) ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe' WITH SERDEPROPERTIES ( "input.regex" = "([^ ]*) ([^ ]*) ([^ ]*) (-|\\[[^\\]]*\\]) ([^ \"]*|\"[^\"]*\") (-|[0-9]*) (-|[0-9]*)(?: ([^ \"]*|\"[^\"]*\") ([^ \"]*|\"[^\"]*\"))?", "output.format.string" = "%1$s %2$s %3$s %4$s %5$s %6$s %7$s %8$s %9$s" ) LOCATION 's3://elasticmapreduce/samples/pig-apache/input/'; In the command, the LOCATION parameter specifies the location of the Apache log files in Amazon S3. For this tutorial we are using a set of sample log files. To analyze your own Apache web server log files, you would replace the Amazon S3 URL above with the location of your own log files in Amazon S3. Because of the requirements of Hadoop, Amazon S3 bucket names used with Amazon EMR must contain only lowercase letters, numbers, periods (.), and hyphens (-). After you run the command above, you should receive a confirmation like this one: Found class for org.apache.hadoop.hive.contrib.serde2.RegexSerDe OK Time taken: 12.56 seconds hive> Once Hive has loaded the data, the data will persist in HDFS storage as long as the Amazon EMR job flow is running, even if you shut down your Hive session and close the SSH terminal. Step 7: Query Hive You are now ready to start querying the Apache log file data. Below are some queries to run. Count the number of rows in the Apache webserver log files. 12 Getting Started with AWS Analyzing Big Data Step 7: Query Hive select count(1) from serde_regex; Return all fields from one row of log file data. select * from serde_regex limit 1; Count the number of requests from the host with an IP address of 192.168.1.198. select count(1) from serde_regex where host="192.168.1.198"; To return query results, Hive translates your query into a MapReduce application that is run on the Amazon EMR job flow. You can invent your own queries for the web server log files. Hive SQL is a subset of SQL. For more information about the available query syntax, go to the Hive Language Manual. Step 8: Clean Up To prevent your account from accruing additional charges, you should terminate the job flow when you are done with this tutorial. Because you launched the job flow as interactive, it has to be manually terminated. To disconnect from Hive and SSH 1. In Secure Shell, press Click CTRL+C to exit Hive. 2. At the SSH command prompt, type exit, and then press Enter. Afterwards you can close the terminal or PuTTY window. exit To terminate a job flow In the Amazon Elastic MapReduce console, click the job flow, and then click Terminate. The next step is optional. It deletes the key pair you created in Step 2. You are not charged for key pairs. If you are planning to explore Amazon EMR further, you should retain the key pair. To delete a key pair 1. From the Amazon EC2 console, select Key Pairs from the left-hand pane. 2. In the right pane, select the key pair you created in Step 2 and click Delete. 13 Getting Started with AWS Analyzing Big Data Step 8: Clean Up The next step is optional. It deletes two security groups created for you by Amazon EMR when you launched the job flow. You are not charged for security groups. If you are planning to explore Amazon EMR further, you should retain them. To delete Amazon EMR security groups 1. From the Amazon EC2 console, in the Navigation pane, click Security Groups. 2. In the Security Groups pane, click the ElasticMapReduce-slave security group. 3. In the details pane for the ElasticMapReduce-slave security group, delete all actions that reference ElasticMapReduce. Click Apply Rule Changes. 4. In the right pane, select the ElasticMapReduce-master security group. 5. In the details pane for the ElasticMapReduce-master security group, delete all actions that reference ElasticMapReduce. Click Apply Rule Changes. 6. With ElasticMapReduce-master still selected in the Security Groups pane, click Delete. Click Yes to confirm. 7. In the Security Groupspane, click ElasticMapReduce-slave, and then click Delete. Click Yes to confirm. 14 Getting Started with AWS Analyzing Big Data Step 8: Clean Up Variations Topics Script Your Hive Queries (p. 15) Use Pig Instead of Hive to Analyze Your Data (p. 15) Create Custom Applications to Analyze Your Data (p. 15) Now that you know how to launch and run an interactive Hive job using Amazon Elastic MapReduce to process large quantities of data, lets consider some alternatives. Script Your Hive Queries Interactively querying data is the most direct way to get results, and is useful when you are exploring the data and developing a set of queries that will best answer your business questions. Once youve created a set of queries that you want to run on a regular basis, you can automate the process by saving your Hive commands as a script and uploading them to Amazon S3. Then you can launch a Hive job flow using the Execute a Hive Script option that we bypassed in Step 4 of Create an Interactive Job Flow Using the Console. For more information on how to launch a job flow using a Hive script, go to How to Create a Job Flow Using Hive in the Amazon Elastic MapReduce Developer Guide. Use Pig Instead of Hive to Analyze Your Data Amazon Elastic MapReduce provides access to many open source tools that you can use to analyze your data. Another of these is Pig, which uses a language called Pig Latin to abstract map-reduce jobs. For an example of how to process the same log files used in this tutorial using Pig, go to Parsing Logs with Apache Pig and Amazon Elastic MapReduce. Create Custom Applications to Analyze Your Data If the functionality you need is not available as an open-source project or if your data has special analysis requirements, you can write a custom Hadoop map-reduce application and then use Amazon Elastic 15 Getting Started with AWS Analyzing Big Data Script Your Hive Queries MapReduce to run it on Amazon Web Services. For more information, go to Run a Hadoop Application to Process Data in the Amazon Elastic MapReduce Developer Guide. Alternatively, you can create a streaming job flow that reads data from standard input and then runs a script or executable (a mapper) against the input. Once all the data is processed by the mapper, a second script or executable (a reducer) processes the mapper results. The results from the reducer are sent to standard output. The mapper and the reducer can each be referenced as a file or you can supply a Java class. You can implement the mapper and reducer in any of the supported languages, including Ruby, Perl, Python, PHP, or Bash. For more information, go to Launch a Streaming Cluster in the Amazon Elastic MapReduce Developer Guide. 16 Getting Started with AWS Analyzing Big Data Create Custom Applications to Analyze Your Data Calculating Pricing Topics Amazon S3 Cost Breakdown (p. 18) Amazon Elastic MapReduce Cost Breakdown (p. 19) The AWS Simple Monthly Calculator estimates your monthly bill. It provides a per service cost breakdown, as well as an aggregate monthly estimate. You can also use the calculator to see an estimation and breakdown of costs for common solutions. This topic walks you through an example of using the AWS Simple Monthly Calculator to estimate your monthly bill for processing 1 GB of weekly web server log files. For additional information, download the whitepaper How AWS Pricing Works. To estimate costs using the AWS Simple Monthly Calculator 1. Go to http://calculator.s3.amazonaws.com/calc5.html. 2. In the navigation pane, select a web service that your application requires. Enter your estimated monthly usage for that web service. Click Add To Bill to add the cost for that service to your total. Repeat this step for each web service your application uses. 3. To see the total estimated monthly charges, click the Estimate of your Monthly Bill tab. Suppose you have a website that produces web logs that you upload each day to an S3 bucket. This transfer averages about 1 GB of web-server data a week. You want to analyze the data once a week for an hour, and then delete the web logs. To make processing the log files efficient, well use three m1.large Amazon EC2 instances in the Amazon Elastic MapReduce job flow. Note AWS pricing you see in this documentation is current at the time of publication. c 17 Getting Started with AWS Analyzing Big Data Amazon S3 Cost Breakdown The following table shows the characteristics for Amazon S3 we have identified for this web application hosting architecture. For the latest pricing information, go to AWS Service Pricing Overview. Description Metric Characteristic The website generates 1 GB of web service logs a week, and we store only one weeks worth of data at a time in Amazon S3. 1 GB/month Storage Each day a file containing the daily web logs is uploaded to Amazon S3, resulting in 30 put requests per month. Each week, Amazon Elastic MapReduce loads seven daily web log files from Amazon S3 to HDFS, resulting in an average of 30 get requests per month. PUT requests: 30/month GET requests: 30/month Requests If the average amount of data generated by the web service logs each week is 1 GB, the total amount of data transferred to Amazon S3 each month is 4 GB. Data transfer into AWS is free. In addition, there is no charge for transferring data between services within AWS in the same Region, so the transfer of data from Amazon S3 to HDFS on an Amazon EC2 instance is also free. Data in: 4 GB/month Data out: 4 GB/month Data Transfer The following image shows the cost breakdown for Amazon S3 in the AWS Simple Monthly Calculator for region US-East. The total monthly cost is the sum of the cost for uploading the daily web logs, storing a weeks worth of logs in Amazon S3, and then transferring them to Amazon Elastic MapReduce for processing. Calculation Formula Variable $0.13 Storage Rate ($0.125 / GB) x Storage Amount (1 GB*) ------------------------------- Provisioned Storage 18 Getting Started with AWS Analyzing Big Data Amazon S3 Cost Breakdown Calculation Formula Variable $0.01 Request Rate ($0.01 / 1000 requests) x Number of requests (30**) ------------------------------- PUT requests $0.01 Request Rate ($0.01 / 1000 requests) x Number of requests (30**) ------------------------------- GET requests $0.15 Total Amazon S3 Charges (*) The charge of $0.125 is rounded up to $0.13. (**) Any number of requests fewer than 1000 is charged the minimum, which is $0.01. A way to reduce your Amazon S3 storage charges is to use Reduced Redundancy Storage (RRS), a lower cost option for storing non-critical, reproducible data. For more information about RRS, go to http://aws.amazon.com/s3/faqs/#What_is_RRS. then Amazon Elastic MapReduce Cost Breakdown The following table shows the characteristics for Amazon Elastic MapReduce well use for our example. For more information, go to Amazon Elastic MapReduce Pricing. Description Metric Characteristic To make processing 1GB of data efficient, well use m1.large instances to boost computational power. m1.large (on-demand) Machine Characteristics One master node and two slave nodes. 3 Instance Scale Four hours each month to process the log files. 4 hours / month Uptime The following image shows the cost breakdown for Amazon Elastic MapReduce in the AWS Simple Monthly Calculator. 19 Getting Started with AWS Analyzing Big Data Amazon Elastic MapReduce Cost Breakdown The total monthly cost is the sum of the cost of running and managing the EC2 Instances in the job flow. Calculation Formula Variable $0.38 x 3 x 4 --------------- $4.56 Compute cost per hour x Number of instances x Uptime in hours ------------------------ Total Amazon Elastic MapReduce Charges Instance Management Cost 20 Getting Started with AWS Analyzing Big Data Amazon Elastic MapReduce Cost Breakdown Related Resources The following table lists related resources that you'll find useful as you work with AWS services. Description Resource A comprehensive list of products and services AWS offers. AWS Products and Services Official documentation for each AWS product including service introductions, service features, and API references, and other useful information. Documentation Provides the necessary guidance and best practices to build highly scalable and reliable applications in the AWS cloud. These resources help you understand the AWS platform, its services and features. They also provide architectural guidance for design and implementation of systems that run on the AWS infrastructure. AWS Architecture Center Provides access to information, tools, and resources to compare the costs of Amazon Web Services with IT infrastructure alternatives. AWS Economics Center Features a comprehensive list of technical AWS whitepapers covering topics such as architecture, security, and economics. These whitepapers have been authored either by the Amazon team or by AWS customers or solution providers. AWS Cloud Computing Whitepapers Previously recorded webinars and videos about products, architecture, security, and more. Videos and Webinars A community-based forum for developers to discuss technical questions related to Amazon Web Services. Discussion Forums The home page for AWS Technical Support, including access to our Developer Forums, Technical FAQs, Service Status page, and AWS Premium Support. (subscription required). AWS Support Center The primary web page for information about AWS Premium Support, a one-on-one, fast-response support channel to help you build and run applications on AWS Infrastructure Services. AWS Premium Support Information 21 Getting Started with AWS Analyzing Big Data Description Resource This form is only for account questions. For technical questions, use the Discussion Forums. Form for questions related to your AWS account: Contact Us Detailed information about the copyright and trademark usage at Amazon.com and other topics. Conditions of Use 22 Getting Started with AWS Analyzing Big Data Document History This document history is associated with the 2011-12-12 release of: Getting Started with AWS Analyzing Big Data with AWS. Release Date Description Change 6 March 2012 Updated to reflect reductions in Amazon EC2 and Amazon EMR service fees. Updated Amazon EC2 and Amazon EMR Pricing 9 February 2012 Updated to reflect reduction in Amazon S3 storage fees. Updated Amazon S3 Pricing 12 December 2011 Created new document. New content 23 Getting Started with AWS Analyzing Big Data