Using Amazon Elastic MapReduce, you can instantly provision as much or as little capacity as you like to perform data-intensive tasks for applications such as web indexing, data mining, log file analysis, machine learning, financial analysis, scientific simulation, and bioinformatics research. Amazon Elastic MapReduce lets you focus on crunching or analyzing your data without having to worry about time-consuming set-up, management or tuning of Hadoop clusters or the compute capacity upon which they sit.
The company’s press release quotes VP for Product Management & Developer Relations, Adam Selipsky, who notes;
Some researchers and developers already run Hadoop on Amazon EC2, and many of them have asked for even simpler tools for large-scale data analysis. Amazon Elastic MapReduce makes crunching in the cloud much easier as it dramatically reduces the time, effort, complexity and cost of performing data-intensive tasks.
MapReduce was brought to prominence by Google, and is one of the principal techniques at that company’s disposal in enabling them to break massive data sets into manageable chunks suitable for cost-effective processing on the commodity hardware for which they are known. The abstract for a Google research paper on the topic outlines the value proposition reasonably succinctly;
MapReduce is a programming model and an associated implementation for processing and generating large data sets. Users specify a map function that processes a key/value pair to generate a set of intermediate key/value pairs, and a reduce function that merges all intermediate values associated with the same intermediate key. Many real world tasks are expressible in this model, as shown in the paper.
Programs written in this functional style are automatically parallelized and executed on a large cluster of commodity machines. The run-time system takes care of the details of partitioning the input data, scheduling the program’s execution across a set of machines, handling machine failures, and managing the required inter-machine communication. This allows programmers without any experience with parallel and distributed systems to easily utilize the resources of a large distributed system.
Our implementation of MapReduce runs on a large cluster of commodity machines and is highly scalable: a typical MapReduce computation processes many terabytes of data on thousands of machines. Programmers find the system easy to use: hundreds of MapReduce programs have been implemented and upwards of one thousand MapReduce jobs are executed on Google’s clusters every day.
Hadoop is a Yahoo!-nurtured Open Source equivalent to Google’s MapReduce, managed as a project of the Apache Software Foundation, and reputedly scalable to handle many petabytes of data distributed across thousands of CPUs.
As Adam noted in the press release, customers (such as the New York Times and Netflix) are already using Hadoop on Amazon’s Web Services. Today’s announcement makes it easier to cost-effectively and transparently commission (and decommission) the required compute resources. This is the ‘elasticity’ referred to in the new service’s name, and is an increasingly important aspect of the current generation of Cloud-based compute services; much of the economic value proposition lies in only using (and therefore paying for) the resources you actually need to complete a task. If demand increases, the number of (virtual) machines available should rapidly increase to cope, and they should shut back down just as rapidly when the demand passes;
Amazon Elastic MapReduce enables you to use as many or as few compute instances running Hadoop as you want. You can commission one, hundreds, or even thousands of instances to process gigabytes, terabytes, or even petabytes of data. And, you can run as many job flows concurrently as you wish. You can instantly spin up large Hadoop job flows which will start processing within minutes, not hours or days. When your job flow completes, unless you specify otherwise, the service automatically tears down your instances.
Elastic MapReduce is currently available only for data centres in Amazon’s US region (so non-US customers can use the service; they just have to be able/willing to transfer the data beyond their borders), and is priced in addition to existing EC2 instances with Elastic MapReduce on a $US0.10 per hour ’small’ instance costing a further $US0.015 per hour (yes, 1 and a half cents per hour) and on a $US0.80 per hour ‘extra large’ instance costing a further $US0.12 per hour.
Elastic MapReduce is another nice example of slow, incremental improvement to Amazon’s core Web Services offer.
It remains to be seen, as developers get down to using it for real, whether it’s pitched as a low-end disruptor that simply rounds out another piece of the emerging AWS whole, or if it’s a viable competitor in its own right to the recently announced Cloudera which sees taking Hadoop to mainstream enterprise customers as its raison d’etre;
Cloudera can help you install, configure and run Hadoop for large-scale data processing and analysis. Get Cloudera’s Distribution for Hadoop and start working with Big Data today.
Update: Amazon’s Jeff Barr provides a lot more detail in a post to the AWS Blog.
Content cross-posted from a blog post on The Cloud of Data.