What the reader will learn:
• Essential
web technology for the cloud
• How
virtualisation supports the cloud
• About
distributed programming and the MapReduce algorithm
• How
to create a desktop virtual machine
• How
to create a simple MapReduce job
4.1 Introduction
If we are asked to sum up cloud computing in four key words we might arguably choose ‘web’, ‘elasticity’, ‘utility’ and ‘scalability’. In this chapter, we are going to look at the technology underlying the cloud. Cloud applications are accessed via the web, and web technology is integral to the cloud, so we will begin with a brief review of the current state of web technology. We will then move on to virtualisation, a key cloud technology which has many bene fi ts including improved use of resources. Virtualisation can be used to provide the elasticity required to offer cloud computing as a utility. We then turn our attention to the MapReduce programming model, originally developed by the founders of Google and now used to provide scalability to many of the distributed applications which are typical of the cloud and simply too big to be handled in a user-friendly time frame by traditional systems.
At the end of this chapter, you will be guided
through the process of creating your own virtual machine (VM) using VMWare and
the open source Ubuntu (Linux) operating system. You will then run a simple
MapReduce job using your virtual machine. We will use this VM again in later
chapters.
R. Hill et
al., Guide to Cloud Computing: Principles and Practice, Computer 65
Communications and Networks, DOI 10.1007/978-1-4471-4603-2_4,
© Springer-Verlag London 2013
4.2 Web Technology
A feature of
web and cloud applications is that a number of separate technologies are
required to work together. A minimum web application is likely to use HTTP,
XHTML, CSS, JavaScript, XML, server programming (e.g. PHP) and some mechanism
to persist data (to be examined in detail in a later chapter). We will
introduce these technologies in a brief way here; if you are already familiar
with web technology you can safely skip this section. On the other hand, if you
are completely new to web technology, you may prefer to begin by working
through the basic tutorials for HTML, CSS, JavaScript and XML at the excellent
W3 site: http://www. w3schools.com/ .
4.2.1 HTTP
Cloud systems are
built on the World Wide Web, and the web is built on the underlying network
communication system known as HTTP or Hypertext Transfer Protocol. HTTP is the
key to building cloud systems, and at a low level, each interaction in a cloud
application uses HTTP (Nixon 2009 ).
Berners-Lee (who originally proposed the World
Wide Web) and his team are credited with inventing the original HTTP. The fi r
st version had just one method, namely, GET, which was used when a user made a
request for a web page from a server. A number of other methods have since been
added including:
• HEAD
which asks the server for information about a resource
• PUT
which stores data in a resource
• POST
which sends data to a program to be processed on the server
• DELETE
which deletes the speci fi ed resource
H TTP uses a simple
request/response cycle which allows two parties, often referred to as the
client and the server, to communicate. The client drives the communication by
sending requests to the server. The server processes the request and sends
responses, which may contain content such as HTML. A response may contain
status information and the content requested in the message body.
HTTP requests are centred on a particular
resource. We can think of a resource as anything on the network that has a
name. Every HTTP request is either asking to retrieve data from a resource or
sending data to a resource.
Each HTML request that you generate using a
browser or application has three parts:
1. A
request line consisting of the HTTP method, the URL of the resource and the
protocol version (usually HTTP 1.1)
2. A
series of header lines containing metadata about the resource 3. The body
consisting of a stream of data A typical
HTTP request might look like:
GET/cloud/book/chapter4 HTTP /1.1
User-Agent: Mozilla/5.001 (windows; U; NT4.0;
en-US; rv:1.0)
Host: cloud-principles-practices.appspot.com
4.2 Web Technology
Status code
|
Description
|
200
|
OK
|
301
|
Moved permanently
|
401
|
Unauthorised
|
403
|
Forbidden
|
404
|
Not found
|
500
|
Internal server error
|
Table 4.1 Common HTTP
status codes
W hen a server responds, the fi r
st line includes a 3-digit status code (frequently encountered codes are
summarised below) and message indicating whether or not the request succeeded,
for example,
HTTP /1.1 200 OK
Other codes, which
you have probably encountered, are shown in Table 4.1
An important aspect of the HTTP is that you
are unable to store mutable state; that is, we cannot assign values to
variables or store anything to be shared between different calls to the same
function. Everything that is needed by a server process must be sent explicitly
as a parameter.
4.2.2 HTML (HyperText Markup Language) and CSS (Cascading Style Sheets)
Web pages are
written using HTML also originally proposed by Berners-Lee. HTML is used to
describe the structure of the page and includes structural elements such as
paragraphs (e.g. <p></p>), headings (e.g. <h1></h1>)
and lists (e.g. <li></li>).
CSS is a technology which allows us to create
styles which de fi ne how HTML elements should look. The combination of HTML
and CSS allows web page developers to separate structure from appearance. The
‘separation of concerns’ is an important principle in many areas of software
engineering and is central to the development of cloud systems.
All pages on a site are usually styled in the
same way, and styles are generally not speci fi c to a single web page. CSS
allows us to de fi ne our style information and then apply it to many pages
across a site. So, for example, should we wish to present our pages in a
different way to distinct user groups, the required changes may be limited to a
small number of CSS pages rather than updating all the pages on our site. The
separation of structure from appearance can be translated into the separation
of the logic of our cloud application to the rendering of the pages to be
presented to our users.
Separating style information lets you write it
once and use it on all pages instead of coding it again and again. This also
relates to another important principle of software engineering, namely, ‘Don’t
Repeat Yourself’ or DRY. When the DRY principle is applied successfully, modifi
c ations are localised and should not occur in unrelated elements. DRY aims to
reduce repetition of all kinds of information but is particularly important to
multi-tier web or cloud architectures where the improvements in maintainability
become critical to success.
4.2.3 XML (eXtensible Markup Language)
XML is a
technology introduced by the W3C to allow data to be encoded in a textual
format which is readable to both humans and machines. XML simpli fi es and
standardises the exchange and storage of data and has become a standard way of
representing data. In XML, the tags relate to the meaning of the enclosed text,
making XML self-explanatory. XML allows for a hierarchical approach, is
portable and interoperable and has been used to create a wide variety of markup
languages including XHTML which is simply HTML in XML format.
T he Document Object Model (DOM)
allows for machine navigation of an XML document as if it were a tree of node
objects representing the document contents, and there are many tools and
libraries freely available in many programming languages which will parse XML.
Unsurprisingly, XML is widely used in cloud systems. XLM does, however, have
some disadvantages and in particular is criticised for being ‘tag heavy’;
indeed, a signifi c ant fraction of an XML fi l e can be taken up by the tags.
4.2.4 JSON (JavaScript Object Notation)
J SON is an increasingly popular lightweight
alternative to XML and is directly based on data structures found in the
JavaScript language (although JSON is language independent). JSON is built on
two human readable structures, namely, a collection of name/value pairs or an
ordered list of values.
4.2.5 JavaScript and AJAX (Asynchronous JavaScript and XML)
JavaScript is a
programming language which normally runs in the client’s browser and allows the
web page to be updated by traversing and updating the DOM tree. This means that
the programmer is able to dynamically update an XHTML document by navigating the
tree and then applying insert, delete or update operations to elements,
attributes (including style information) or contents. JavaScript can also add
important functionality such as the ability to validate user input before an
HTTP request is sent to the server.
X MLHttpRequest is a JavaScript
object which manages HTTP requests and has the useful capability of working
outside the normal browser request/response cycle. The technology surrounding
the XMLHttpRequest is known as AJAX and allows for partial page updates which
in turn improve the user experience and allow for web applications to work in a
comparable manner to desktop applications. Such web applications are sometimes
referred to as Rich Internet Applications (RIA). However, the asynchronous nature
of the HTTP calls adds complexity to web application programming, and as pages
may be updated without changing the URL in the user’s browser, special
consideration needs to be given to handling the user’s back button and
bookmarking capability. Furthermore, older browsers do not support the
technology so developers must check the browser type and version when using
AJAX.
4.2 Web Technology
Fig. 4.1 MVC
4.2.6 Model-View-Controller (MVC)
M VC is a standard architecture for separating
components in an interactive application. The architecture was fi r st
described well before the World Wide Web in 1979 by Trygve Reenskaug but in
recent years has proved increasingly popular in the development of web
applications. MVC is interpreted and applied in slightly different ways, but
here we use it to refer to the separation of application data (contained in the
model) from graphical representation (the view) and inputprocessing logic (the
controller). A simple summary of MVC is given in
Figure 4.1.
T he controller implements logic
for processing the input, the model contains the application data and the view
presents the data from the model. An MVC application consists of a set of
model/view/controller triples responsible for different elements of the user interface.
MVC is a natural way to write cloud applications, and so it is worth brie fl y
describing the three components in more detail.
4.2.6.1 Model
W e mentioned earlier that HTTP does not allow us to
assign values or store information between requests. In a cloud application
based on MVC, the model is responsible for maintaining the state of the
application, whether this is just for a small number of request/response cycles
or if we need to persist data in a permanent way, for example, using a server database.
The model is, however, more than just data as the model is responsible for the
enforcement of the business rules that apply to the data. The model acts as a
‘gatekeeper’ and a data store. Note that the model is totally decoupled from
the interface and works instead with the underlying data, often stored in a
database. As our application is based on HTTP, the server must wait for
requests from the client before it can send its response.
4.2.6.2 View
This is the
visible part of the user interface and will normally be based on data in the
model written out in XHTML. The view will render the XHTML when instructed to
do so by the controller, but the view’s work is done once the data is
displayed. The view itself never handles incoming data. There may be many
views, for example, for different devices with different screen sizes and
capability, which access the same model data.
4.2.6.3 Controller
T he controller (sometimes called the ‘input
controller’) is the bridge between the client and the server and in a sense
orchestrates the whole application. Interface actions are accepted and may be
transformed into operations which can be performed on the model. The controller
takes content data produced by the model and translates that data into a form
the view can render. In short, a controller accepts input from the user and
instructs the model and a view to perform actions based on that input.
4.3 Autonomic Computing
Autonomic
computing seeks to improve computer systems by decreasing human involvement in
their operation with the ultimate goal of producing systems which manage
themselves and adapt to unpredictable changes. The advantages become signifi c
ant in a large data centre of hundreds or thousands of nodes. In this
environment, machine failures are not rare events but are regular occurrences
which must be managed carefully.
Autonomous or semi-autonomous machines rely on
three elements: monitoring probes and gauges, an adaption engine and on
effectors to apply any required changes to the system. Decisions made by an
autonomic system are usually based on a set of prede fi ned policies. IBM’s
Autonomic Computing Intuitive de fi nes four properties of autonomic systems,
namely, self-confi g uration, self-optimisation, selfhealing and
self-protection.
4.4 Virtualisation
V irtualisation refers to the process where abstract or
virtual resources are used to simulate physical resources. Virtualisation can
be of great bene fi t to cloud systems as it can improve resource pooling and
enable rapid and elastic resource provisioning. These benefi t s make for
agile, fl e xible networks leading to signifi c ant cost reductions. In typical
cloud computing applications, servers, storage and network devices may all be
virtualized (Goldne 2008 ).
Virtualisation has a long history in the
software industry and the term has been used in a variety of ways. The earliest
type of virtual machine dates back to IBM in the 1960s which aimed to maximise
use of expensive mainframe systems through multitasking. We will brie fl y
introduce the main types of virtualisation in the following sections.
4.4 Virtualisation
4.4.1 Application Virtualisation
Application
virtualisation, sometimes called a ‘process virtual machine’, provides a
platform-independent programming environment which hides the details of the
underlying hardware and operating system. The Java Virtual Machine (JVM) is a
commonly cited and popular example of an application virtual machine. The JVM
provides an environment in which Java bytecode can be executed. The JVM can run
on many hardware platforms, and the use of the same bytecode allows for the
‘write once run anywhere’ description which is highly appealing to application
developers. Java bytecode is an intermediate language which is typically
compiled from Java, but a number of other languages such as Python (Jython) or
Ruby (JRuby) can also be used.
4.4.2 Virtual Machine
I n the cloud context, ‘virtual machine’ is usually
taken to mean a software implementation of a machine with a complete
independent operating system, which executes programs like a physical machine.
This type of virtualisation began with work in the Linux community but has lead
to a variety of commercial and open source libraries.
The virtual machine
may have a number of virtual components including:
• ‘Virtual
processors’ which share time with other virtual machines on the physical
processor
• ‘Virtual
memory’ which are normally implemented as a slice of physical RAM on the host
machine (not to be confused with common use of ‘virtual memory’ referring to
the combination of various types of storage media)
• ‘Virtual
hard disk’ which is typically one or more large fi l es on the physical disk
• ‘Virtual
network’ based on a network interface controller (NIC)
In the next chapter, we will see how vendors
such as Amazon use the power of virtualisation to offer users access to
cloud-based resources such as compute power (e.g. Amazon EC2) or storage (e.g.
Amazon S3) which they can rent from Amazon at an hourly rate.
4.4.3 Desktop Virtualisation
In desktop
virtualisation, the user is provided with access to a virtual machine over
which they may have complete control. This can be very useful, for example, in
an organisation where users have very limited rights in terms of program
installation and con fi guration on local machines, they can be given complete
freedom to install their own programs on a virtual machine. Furthermore, the
virtual machine can run an entirely different operating system to that of the
host, so users can choose and con fi gure their preferred environment. Virtual
machines can be copied, moved or destroyed as required, and a ‘snapshot’ can be
taken of a machine at any time so that the user can save and return to a
particular environment. This capability also provides for a useful and safe way
of testing new upgrades and environments. Examples of the technology include
VMWare Workstation, Microsoft VirtualPC and Oracle Virtual Box. At the end of
this chapter, you will create your own desktop virtual machine using VMWare.
4.4.4 Server Virtualisation
Server
virtualisation applies the same principle as desktop virtualisation but to the
server environment. Server virtualisation allows entire operating systems,
applications and accessories to be packaged into a set of fi l es that can be
moved or copied and then run on common hardware in any compatible virtualized
environment. Virtual machines allow us to decouple the computing infrastructure
from the physical infrastructure, potentially leading to a number of important
bene fi ts in the running of a data centre.
4.4.4.1 Ef fi ciency
I t has been estimated that most servers are highly
underutilised and run at about 15% of their total capacity. Virtualisation
allows for a more ef fi cient use of machine resources which can lead to a
reduction in power consumption, cooling and space requirements (see also Chap. 3) . This improved use of resources
can make a big difference to the overall energy consumption of a large data
centre, and it has been suggested that virtualisation can be an important
component in achieving reduced carbon emissions from the IT industry.
4.4.4.2 Isolation
V irtualisation means that secure, encapsulated clones
of your entire work environment can be created for testing and backup purposes
leading to improved maintenance, fl
exibility and high availability. The failure of one virtual machine should not
affect any of the other machines running on the same host. This has important
and benefi c ial implications for vendor’s offering cloud based resources which
are available to multiple tenants. Thus, for example, if a user of an Amazon
EC2 machine accidentally corrupts their operating system or runs a program with
an infi n ite loop, it should not have a detrimental effect on other users
running virtual machines in the same data centre.
4.4.4.3 Mobility
V irtual machine mobility enables applications to
migrate from one server to another even whilst they are still running. For
example, with systems such as vSphere, virtual machines can be con fi gured to
automatically restart on a different physical server in a cluster if a host fails.
This capability can be a critical component in the provision of autonomic
computing as well as leading to the following advantageous outcomes:
• A
reduction in the number of servers required
• Data
centre maintenance without downtime
• Improved
availability, disaster avoidance and recovery
4.4 Virtualisation
• Simpli
fi cation of data centre migration or consolidation
• Data
centre expansion
• Workload
balancing across multiple sites There
are, of course, some disadvantages:
• Virtual
servers require a hypervisor to keep track of virtual machine identi fi cation
(see below), local policies and security, which can be a complex task.
• The
integration of the hypervisor with existing network management software can be
problematic and may require changes to the network hardware.
4.4.5 Storage Virtualisation
S torage virtualisation provides different servers
access to the same storage space. Data centre storage is comprised of storage
area networks (SAN) and network-attached storage (NAS). In a cloud computing environment,
a network can be enabled to address large virtual pools of storage resources.
Storage virtualisation allows for the use of all the multiple disk arrays,
often made by different vendors and scattered over the network, as if they were
a single storage device, which can be centrally managed.
4.4.6 Implementing Virtualisation
A s mentioned above, multiple virtual machines can run
simultaneously on the same physical machine, and each virtual machine can have
a different operating system. A virtual machine monitor (VMM) is used to
control and manage the virtual machines on a single node. A VMM is often
referred to as a hypervisor (see below). At a higher level, virtual machine
infrastructure managers (VIMs) are used to manage, deploy and monitor virtual
machines on a distributed pool of resources such as a data center. Cloud
infrastructure managers (CIM) are web-based management solutions speci fi cally
for cloud systems.
V irtualisation management is the
technology that abstracts the coupling between the hardware and the operating
system, allowing computing environments to be dynamically created, expanded,
shrunk, archived or removed. It is therefore extremely well suited to the
dynamic cloud infrastructure. Load balancing is of course an important part of
this management and soon becomes critical in the prevention of system
bottlenecks due to unbalanced loads.
4.4.7 Hypervisor
Generally,
virtualisation is achieved by the use of a hypervisor. The hypervisor is a
software that allows multiple virtual images to share a single physical machine
and to logically assign and separate physical resources. The hypervisor allows
the computer hardware to run multiple guest operating systems concurrently. As
mentioned above, each of the guest operating systems is isolated and protected
from any others and is unaffected by problems or instability occurring on other
virtual machines
Fig. 4.2 Hypervisor (bare metal)
running on the same physical machine. The hypervisor
presents the guest operating systems with an abstraction of the underlying
computer system and controls the execution of the guest operating system.
M any powerful hypervisors
including KVM (Kernel-based Virtual Machine), Xen and QEMU are open source.
VMWare is currently the market leader in the
fi eld of virtualisation and many of its products are based on open
source. Amazon uses a modi fi ed version of Xen.
Figure 4.2 shows a high-level view of a hypervisor where the
machine resources of the host are shared between a number of guests, each of
which may be running applications and each has direct access to the underlying
physical resources.
4.4.8 Types of Virtualisation
Virtualisation
can be classi fi ed into two categories depending on whether or not the guest
operating system kernel needs to be modifi e d. In full virtualisation, a guest
operating system runs unmodifi e d on a hypervisor. In the case of
para-virtualisation, there is no need to emulate the entire hardware
environment. Para-virtualisation requires modifi c ations to the guest
operating system kernel so that it ‘becomes aware’ of the hypervisor and can
communicate with it. Guest operating systems using full virtualisation are
generally faster.
Figure 4.3 shows the situation where the hypervisor is
running on a hosted operating system. This is the type of hypervisor we will be
using in the exercise at the end of this chapter.
Fig. 4.3 Hosted hypervisor
4.5 MapReduce
MapReduce was
originally introduced by Google as a distributed programming model using large
server clusters to process massive (multi-terabyte) data sets. The model can be
applied to many large-scale computing problems and offers a number of
attractive features such as automatic parallelisation, load balancing, network
and disk transfer optimisation and robust handling of machine failure.
MapReduce works by breaking a large problem into smaller parts, solving each
part in parallel and then combining results to produce the fi nal answer (White 2009 ).
MapReduce runs on top of a specialised fi le system such as the Google File System
(GFS) or the Hadoop File System (HDFS). Data is loaded, partitioned into chunks
(commonly 64 MB) such that each chunk can replicate. A key feature of MapReduce
is that data processing is collocated with data storage, and as a distributed
programming paradigm, a number of advantages are evident when compared to the
traditional approach of moving the data to the computation including:
1. Scalability
2. Reliability
3. Fault
tolerance
4. Simplicity
5. Ef
fi ciency
A ll these advantages are obviously very
relevant to the cloud, where processing large amounts of data using distributed
resources is a central task. Unsurprisingly, MapReduce is a programming model
used widely in cloud computing environments for processing large data sets in a
highly parallel way.
Fig. 4.4 MapReduce overview
M apReduce was inspired by the map and
reduce functions which are commonly found in functional program languages like
LISP. In LISP, a map takes an input function and a sequence of values and then
applies the function to each value in the sequence. Reduce combines all the
elements in the sequence using an operator such as * or +. In MapReduce, the
functions are not so rigidly de fi ned, but programmers still specify the
computation in terms of the two functions map and reduce, which can be carried
out on subsets of total data under analysis in parallel. The map function is
used to generate a list of intermediate key/value pairs, and the reduce
function merges all the intermediate values associated with same intermediate
key.
When all the tasks have been completed, the
result is returned to the user. In MapReduce, input data is portioned across
multiple worker machines executing in parallel; intermediate values are output
from the map worker machines and fed to a set of ‘reduce’ machines (there may
be some intermediate steps such as sorting). It is, perhaps, useful to think of
MapReduce as representing a data fl ow
as shown in Figure 4.4 rather than a procedure. Jobs are submitted by a
user to a master node that selects idle workers and assigns each one a
MapReduce task to be performed in parallel. The process of moving map outputs
to the reducers is known as ‘shuf fl ing’.
4.5.1 MapReduce Example
The canonical MapReduce
example takes a document or a set of documents and outputs a listing of unique
words and the number of occurrences throughout the text data. The map function
takes as its key/value pair the document name and document contents. It then
reads through the text of the document and outputs an intermediate key/value
listing for each word encountered together with a value of 1. The reduce phase
then counts up each of these individual 1’s and outputs the total value for
each word. The pseudocode for the functions is shown below. The map function
takes the name of the document as the key and the contents as the value.
Fig. 4.5 MapReduce word count example
map(document_name, document_contents) { for
each word w in document_contents
emit_intermediate(w, 1)
}
reduce(a_word, intermediate_vals) { result = 0 for each value v in intermediate_vals result +
= v emit result
}
Figure 4.5 shows a simple example with the three input fi l
es shown on the left and the output word counts shown on the right.
The same kind of process can be applied to
many other problems and is particularly useful where we need to process a huge
amount of raw data, for example, from documents which have been returned by a
web crawl. To create an inverted index (see Chap. 7)
after a web crawl, we can create a map function to read each document
and output a sequence of <word, documentID> pairs. The reduce function
accepts all pairs for a given word and output pairs of <word,
documentIDList> such that for any word, we can quickly identify all the
documents in which the word occurs. The amount of data may be simply too big
for traditional systems and needs to be distributed across hundreds or
thousands of machines in order to be processed in a reasonable time frame.
Other problems which are well suited to the MapReduce approach include but are
not limited to:
1. Searching
2. Classi
fi cation and clustering
3. Machine
learning (Apache Mahout is an open source library for solving machine learning
problems with MapReduce)
4. tf - idf
W e will discuss the above tasks together
with web crawling and inverted indexes in Chap. 7 .
4.5.2 Scaling with MapReduce
To scale vertically
(scale up) refers to the process of adding resources to a single computing
resource. For example, we might add more memory, processing power or network
throughput which can then be used by virtual machines on that node. MapReduce
is designed to work on a distributed fi l e system and uses horizontal scaling
(scale out) which refers to the process of achieving scaling by the addition of
computing nodes.
T he overarching philosophy is often
summarised in the adage ‘don’t move data to workers – move workers to data’; in
other words, store the data on the local disks of nodes in the cluster and then
use the worker on the node to process its local data. For the kind of problem
typical of MapReduce, it is simply not possible to hold all the data in memory.
However, throughput can remain reasonable through use of multiple nodes, and
reliability is achieved through redundancy. Rather than use expensive ‘high
end’ machines, the approach generally benefi t s from standard commodity
hardware—a philosophy sometimes summarised as ‘scale out not up’. In any
MapReduce task, coordination is needed, and a ‘master’ is required to create
chunks, balance and replicate and communicate with the nodes.
4.5.3 Server Failure
W here processing occurs on one powerful and expensive
machine, we might reasonably expect to run the machine for several years
without experiencing hardware failure. However, in a distributed environment
using hundreds or thousands of low-cost machines, failures are an expected and
frequent occurrence. The creators of MapReduce realised they could combat
machine failure simply by replicating jobs across machines. Server failure of a
worker is managed by re-executing the task on another worker. Alternatively,
several workers can be assigned the same task, and the result is taken from the
fi r st one to complete, thus also improving execution time.
4.5.4 Programming Model
The MapReduce model
targets a distributed parallel platform with a large number of machines
communicating on a network without any explicit shared memory. The programming
model is simple and has the advantage that programmers without expertise in
distributed systems are able to create MapReduce tasks. The programmer only
needs to supply the two functions, map and reduce, both of which work with key
value pairs (often written as < k, v>). Common problems can be solved by
writing two or more MapReduce steps which feed into each other.
4.5.5 Apache Hadoop
H adoop is a popular open-source Java implementation of
MapReduce and is used to build cloud environments in a highly fault tolerant
manner. Hadoop will process web-scale data of the order of terabytes or
petabytes by connecting many commodity computers together to work in parallel.
Hadoop includes a complete distributed batch processing infrastructure capable
of scaling to hundreds or thousands of computing nodes, with advanced
scheduling and monitoring capability. Hadoop is designed to have a ‘very fl a t
scalability curve’ meaning that once a program is created and tested on a small
number of nodes, the same program can then be run on a huge cluster of machines
with minimal or no further programming required. Reliable performance growth
should then be in proportion to the number of machines available.
The Hadoop File System (HDFS) splits large
data fi les into chunks which are
managed by different nodes on the cluster so that each node is operating on a
subset of the data. This means that most data is read from the local disk
directly into the CPU, thus reducing the need to transfer data across the
network and therefore improving performance. Each chunk is also replicated
across the cluster so that a single machine failure will not result in data
becoming inaccessible.
F ault tolerance is achieved mainly
through active monitoring and restarting tasks when necessary. Individual nodes
communicate with a master node known as a ‘JobTracker’. If a node fails to
communicate with the job tracker for a period of time (typically 1 min), the
task may be restarted. A system of speculative execution is often employed such
that once most tasks have been completed, the remaining tasks are copied across
a number of nodes. Once a task has completed, the job tracker is informed, and
any other nodes working on the same tasks can be terminated.
4.5.6 A Brief History of Hadoop
Hadoop was created by
Doug Cutting (also responsible for Lucene) who named it after his son’s toy
elephant. Hadoop was originally developed to support distribution of tasks
associated with the Nutch web crawler project. In Chap. 7,
we will discuss Lucene and Nutch in more detail and will use Nutch to
perform a web crawl and create a searchable Lucene index in the end of chapter
exercise.
Y ahoo was one of the primary developers
of Hadoop, but the system is now used by many companies including Facebook,
Twitter, Amazon and most recently Microsoft. Hadoop is now a top-level Apache
project and benefi t s from a global community of contributors.
4.5.7 Amazon Elastic MapReduce
E lastic MapReduce runs a hosted Hadoop instance on an EC2
(see Chap. 5 ) instance master which is able
to provision other pre-con fi gured EC2 instances to distribute the MapReduce
tasks. Amazon currently allows you to specify up to 20 EC2 instances for data
intensive processing.
4.5.8 Mapreduce.NET
Mapreduce.NET is an implementation
of MapReduce for the .Net platform which aims to provide support for a wide
variety of compute-intensive applications. Mapreduce.Net is designed for the
Windows platform and is able to reuse many existing Windows components. An
example of Mapreduce.Net in action is found in MRPGA (MapReduce for Parallel
GAs) which is an extension of MapReduce speci fi cally for parallelizing
genetic algorithms which are widely used in the machine learning community.
4.5.9 Pig and Hive
P ig, originally developed at Yahoo research, is a high-level
platform for creating MapReduce programs using a language called ‘Pig Latin’
which compiles into physical plans which are executed on Hadoop. Pig aims to
dramatically reduce the time required for the development of data analysis jobs
when compared to creating Hadoops. The creators of Pig describe the language as
hitting ‘a sweet spot between the declarative style of SQL and the low-level,
procedural style of MapReduce’.
You will create a small Pig Latin program in the tutorial
section.
Apache Hive which
runs on Hadoop offers data warehouse services.
4.6 Chapter Summary
In this chapter, we
have investigated some of the key technology underlying computing clouds. In
particular, we have focused on web application technology, virtualisation and
the MapReduce model.
4.7 End of Chapter Exercises
Exercise 1: Create your
own VMWare virtual machine
In
this exercise, you will create your own virtual machine and install the Ubuntu
version of the Linux operating system. In later chapters, we will use this same
virtual machine to install more software and complete further exercises. Note
that the virtual
4.9 Create Your Ubuntu VM
machine is quite large and ideally you will have at least 20
GB of free disk space and at least 2 GB of memory. The machine may run with
less memory or disk space, but you may
fi nd that performance is rather slow.
4.8 A Note on the Technical Exercises (Chaps. 4 , 5 , 6 , 7 )
As you will already be
aware, the state of the cloud is highly dynamic and rapidly evolving. In these
exercises, you are going to download a number of tools and libraries, and we
are going to recommend that you choose the latest stable version so that you
will always be working at the ‘cutting edge’. The downside of this is that we
cannot guarantee that following the exercise notes will always work exactly as
described as you may well be working with a later version than the one we used
for testing. You may need to adapt the notes, refer to information on the
source web pages or use web searches to fi n d solutions to any problems or
inconsistencies you encounter. The advantage of this is that you will be
developing critical skills as you go through the process which would be an
essential part of your everyday work should you be involved in building or
working with cloud systems. Anyone who has developed software systems will know
that it can be very frustrating and at times may feel completely stuck.
However, you are often closer to a solution than you might think, and
persisting through dif fi culties to solve a problem can be very rewarding and
a huge learning opportunity.
4.9 Create Your Ubuntu VM
(1) Install
VMWare Player on your machine: http://www.vmware.com/products/ player/ You may need to
register with VMWare but VMWare Player is free.
(a) Click on download and follow the instructions until VM
Player is installed.
(2) Go
to www.ubuntu.com
(a) Go
to the Ubuntu download page.
(b) Download
the latest version of Ubuntu (e.g. ubuntu-11.10-desktop-i386.iso). (c) Save the .iso fi le to a folder on your machine.
(3) Open
VMplayer. Be aware that this process can be quite lengthy.
(a) Select
‘Create New Virtual Machine’ (Fig. 4.6 ).
(b) Select
‘Installer disc image fi l e(iso)’ and browse to the folder where you saved
your Ubuntu .iso fi le. Select next.
(c) Enter
a name for your VM and a username and password for the main user on your Linux
VM.
(d) Select
a location where your VM will be stored: You will need plenty of disk space (~5
GB).
(e) You
now need to select additional disk space for the VM to use. If you have plenty
of space you can go with the recommendation (e.g. 20 GB). It is probably best
to select at least 8 GB. Select ‘Split virtual disk into 2 GB fi les’. Select next.
Fig. 4.6 VMware Player
(f) Check
the options you have selected and select Finish. If you want to run the VM as
soon as it is ready leave the ‘Power on this virtual machine after creation’
selected.
(4) Your
machine will now be created. This will take some time depending on the con fi
guration of your underlying hardware.
(5) You
may be asked to select your country so that the keyboard layout can be set
appropriately. You can test your keyboard to check it works as expected.
4.11 Learn How to Use Ubuntu
4.10 Getting Started
Your machine is now
ready to use.
1. Once
your Ubuntu VM has loaded up inside VMWare, log on using the ID and username
you gave when creating the VM. VMWare tools should be automatically installed
and you may need to restart your machine a number of times. Once complete, you
should now be able to maximise the size of your VM screen.
2. Also
in the Virtual Machine menu you should see ‘power options’ where you can ‘power
off’, ‘reset’ or ‘suspend’ your virtual machine. ‘Suspend’ is often a useful
option as it will maintain the particular state that your virtual machine is in
and restore it to that state next time you ‘resume’ the virtual machine.
3. If
you wish to copy your virtual machine, perhaps to a USB memory stick, you
should fi rst suspend or power off the
virtual machine. You then need to locate the folder where you created the
virtual machine and copy the entire folder to the required destination. You can
then start the virtual machine on another computer, providing that VMware
Player is installed by locating the fi l e ending with .vmx and double clicking
on that fi le. Note that if something goes
wrong later on, you can switch to using this copy. You can of course take a
copy of the virtual machine at any point during your development.
4.11 Learn How to Use Ubuntu
I n this tutorial and others that follow, we will be using the
Ubuntu virtual machine. If you are not familiar with Ubuntu, it will be well
worth getting familiar with the basics. There are many guides available, and a
good starting point is the Ubuntu home page ( http://www.ubuntu.com/ubuntu
) where you can take a ‘tour’ of the important features. In your virtual
machine, there is also help available, just click on the ‘dash home’ icon on
the top left and type help and select the help icon
(Fig. 4.7 ).
S o take some time to explore the
environment and get used to Ubuntu. Actually most of the features are quite
intuitive, and you can learn by just ‘trying’. In terms of the tutorials in
this book, we will only be using a small subset of the available commands and
applications. In particular you will need to:
• Be
able to browse, copy, move, delete and edit
fi les.
• Use
the archive manager to extract zipped fi
les.
• Use
the terminal to send basic Unix commands including:
• ls
to list the contents of a folder (traditionally referred to as ‘Directory’ in
Unix)
• cd
to change directory
• pwd
to identify your location in the directory structure
Fig. 4.7 Finding assistance from the Ubuntu home screen
4.12 Install Java
We will be using Java
for a number of the tutorial exercises. Java is a freely available programming
language. We are going to install the java SDK from Oracle which is suitable
for all the tutorial material in this book. These notes are partly based on information
from the Ubuntu community at https://help.ubuntu.com/community/ Java . If you encounter
problems you may want to check the site for any updates.
1. Download
the latest Oracle JDK 7 for Linux from http://www.oracle.com/technetwork/java/javase/downloads/java-se-jdk-7-download-432154.html (e.g. jdk7-linux-i586.tar.gz) in your Ubuntu VM.
2. You
will have to agree to Oracle Binary Code License to download this.
3. Locate
the fi le by clicking on the home folder
icon and then selecting the ‘Downloads’ folder (Fig. 4.8 ).
4.12 Install Java
Fig. 4.8 Downloads folder illustrating the Oracle Java
Development Kit archive
4. Right
click on the fi le and select ‘open with
archive manager’.
5. In
the archive manager select ‘extract’ and save. Select ‘show the fi l es’ once
the extraction is complete.
6. A
new folder called ‘jdk1.7.0’ or similar should have been created. Rename this
folder to ‘java-7-oracle’ by right clicking and selecting ‘rename’.
7. In
your Ubuntu virtual machine open a terminal session. To get a terminal session
just click the Ubuntu logo in the side bar and type ‘terminal’ and then click
on the terminal icon (Fig. 4.9) . Move to the directory where you extracted the
fi l es.
8. Now
type the following commands one at a time. Enter your password if prompted and
follow on-screen instructions where required. Make sure you type these exactly,
including the case.
user@ubuntu:~$ sudo mkdir -p/usr/lib/jvm/ user@ubuntu:~$ sudo mv java-7-oracle//usr/lib/jvm/ user@ubuntu:~$ sudo add-apt-repository
ppa:nilarimogard/
webupd8
user@ubuntu:~$ sudo apt-get update user@ubuntu:~$ sudo apt-get install
update-java
user@ubuntu:~$ sudo update-java
Fig. 4.9 Invoking a
terminal session in Ubuntu
9. If
prompted, select the Java version you just updated (/usr/lib/jvm/jdk1.7.0).
10.
Answer Y
if prompted to continue with the Java install. 11. Type the following commands
to set the JAVA_HOME
user@ubuntu:~$
JAVA_HOME=/usr/lib/jvm/java-7-oracle
user@ubuntu:~$ export JAVA_HOME
user@ubuntu:~$ PATH =
$PATH:$JAVA_HOME/bin
user@ubuntu:~$ export PATH
4.13 MapReduce with Pig
I n this tutorial we are going to implement the classic word
count program using Pig Latin.
1. Open
your home folder and create a new folder called ‘Pig’.
2. In
your virtual machine, use a browser such as Firefox to go to http://www. apache.org/dyn/closer.cgi/pig
, select a suitable mirror and then download the latest stable release of pig
e.g. pig-0.9.2.tar.gz.
4.13 MapReduce with Pig
Fig. 4.10 Ubuntu download
3. Open
the home folder and then the download folder to see the downloaded pig fi le (Fig.
4.10 ).
4. Right
click on the fi le (e.g.
pig.0.9.2.tar.gz) and select ‘open with archive manager’.
5. In
the archive manager select ‘extract’ and allow the current folder as the
destination folder. Select ‘show the fi
les’ once the extraction is complete.
6. The
archive manager will extract the fi l e to a folder called pig-0.9.1 or
similar. The only fi l e we need for this tutorial is the .jar fi l e. Locate
this fi l e (e.g. pig.0.9.1.jar), right click and select ‘copy’ and paste the
fi l e into the pig folder you created earlier.
7. We
need a text fi le which we can use to
count the words. You can use any text fi
le you like but in this example we will copy over the NOTICE.txt fi le from the extracted folder to the pig
folder for testing purposes.
8. We
are now going to create our fi r st Pig Latin program. To do this, we fi r st
need a text editor so select the ‘dash home’ icon, type ‘text’ and select the
text editor.
9. Paste
in the following code (originally from http://en.wikipedia.org/wiki/Pig_
(programming_language
)) and save the fi le as wordCount.pig
A = load ‘NOTICE.txt’;
B = foreach A generate fl atten(TOKENIZE((chararray)$0)) as word;
C = fi lter B by word matches ‘\\w + ‘;
D = group C by word;
E = foreach D generate COUNT(C) as count, group
as word;
F = order E by count desc; store F into ‘wordcount.txt’;
10. Open
a terminal and navigate to the pig folder.
11. Type
in the following command to run your word count program using pig
java -Xmx512M -cp pig-0.9.1.jar
org.apache.pig.Main -x local wordCount.pig
12. Once
the program has fi nished running you
should see a new folder called wordcount.txt. Navigate into the folder and open
the contents of the fi l e. Compare your output with the text fi le you used and hopefully you will see
that the program has been successful and the frequency of the words has been
counted and placed in order.
13. Try
and run the program with another text fi
le of your choice.
14. Alter
the program so that it orders the output by word rather than frequency.
15. Alter
the program so that only words starting with a letter above ‘t’ are output.
4.14 Discussion
Y ou might be thinking that actually the program seemed to
take quite some time to complete this simple task on a small text fi l e.
However, notice that we are running the Pig Latin program using the ‘local’
mode. This is useful for testing our programs, but if we were really going to
use pig for a serious problem such as the results of a web crawl, we would
switch to ‘hadoop’ mode. We could then run the same Pig Latin program, and the
task would be automatically parallelized and distributed on a cluster of
possibly thousands of nodes. This is actually much simpler than you might
expect, especially when vendors such as Amazon offer a set of nodes precon fi
gured for MapReduce tasks using pig.
4.15 MapReduce with Cloudera
Should you wish to
experiment with MapReduce a nice way to start is to visit the Cloudera site (
http://www.cloudera.com/)
where you can download a virtual machine with Hadoop and Java
pre-installed. Cloudera also offer training and support with a range of
MapReduce related tasks.
References
References
Goldne, B.: Virtualization for Dummies. Wiley,
Chichester (2008)
Nixon, R.:
Learning PHP, MySQL, and JavaScript. O’Reilly Media, Inc, Sebastopol (2009).
This is a useful introduction to developing web applications with the popular
PHP language
White, T.: Hadoop: The Defi nitive Guide, 2nd edn.
O’Reilly Media, Sebastopol (2009)
0 komentar:
Posting Komentar