Cloud Technology

What the reader will learn:

• Essential web technology for the cloud

• How virtualisation supports the cloud

• About distributed programming and the MapReduce algorithm

• How to create a desktop virtual machine

• How to create a simple MapReduce job

4.1 Introduction

If we are asked to sum up cloud computing in four key words we might arguably choose ‘web’, ‘elasticity’, ‘utility’ and ‘scalability’. In this chapter, we are going to look at the technology underlying the cloud. Cloud applications are accessed via the web, and web technology is integral to the cloud, so we will begin with a brief review of the current state of web technology. We will then move on to virtualisation, a key cloud technology which has many bene fi ts including improved use of resources. Virtualisation can be used to provide the elasticity required to offer cloud computing as a utility. We then turn our attention to the MapReduce programming model, originally developed by the founders of Google and now used to provide scalability to many of the distributed applications which are typical of the cloud and simply too big to be handled in a user-friendly time frame by traditional systems.

At the end of this chapter, you will be guided through the process of creating your own virtual machine (VM) using VMWare and the open source Ubuntu (Linux) operating system. You will then run a simple MapReduce job using your virtual machine. We will use this VM again in later chapters.

R. Hill et al., Guide to Cloud Computing: Principles and Practice, Computer 65

Communications and Networks, DOI 10.1007/978-1-4471-4603-2_4,

4.2 Web Technology

A feature of web and cloud applications is that a number of separate technologies are required to work together. A minimum web application is likely to use HTTP, XHTML, CSS, JavaScript, XML, server programming (e.g. PHP) and some mechanism to persist data (to be examined in detail in a later chapter). We will introduce these technologies in a brief way here; if you are already familiar with web technology you can safely skip this section. On the other hand, if you are completely new to web technology, you may prefer to begin by working through the basic tutorials for HTML, CSS, JavaScript and XML at the excellent W3 site: http://www. w3schools.com/ .

4.2.1 HTTP

Cloud systems are built on the World Wide Web, and the web is built on the underlying network communication system known as HTTP or Hypertext Transfer Protocol. HTTP is the key to building cloud systems, and at a low level, each interaction in a cloud application uses HTTP (Nixon 2009 ).

Berners-Lee (who originally proposed the World Wide Web) and his team are credited with inventing the original HTTP. The fi r st version had just one method, namely, GET, which was used when a user made a request for a web page from a server. A number of other methods have since been added including:

• HEAD which asks the server for information about a resource

• PUT which stores data in a resource

• POST which sends data to a program to be processed on the server

• DELETE which deletes the speci fi ed resource

H TTP uses a simple request/response cycle which allows two parties, often referred to as the client and the server, to communicate. The client drives the communication by sending requests to the server. The server processes the request and sends responses, which may contain content such as HTML. A response may contain status information and the content requested in the message body.

HTTP requests are centred on a particular resource. We can think of a resource as anything on the network that has a name. Every HTTP request is either asking to retrieve data from a resource or sending data to a resource.

Each HTML request that you generate using a browser or application has three parts:

1. A request line consisting of the HTTP method, the URL of the resource and the protocol version (usually HTTP 1.1)

2. A series of header lines containing metadata about the resource 3. The body consisting of a stream of data A typical HTTP request might look like:

GET/cloud/book/chapter4 HTTP /1.1

User-Agent: Mozilla/5.001 (windows; U; NT4.0; en-US; rv:1.0)

Host: cloud-principles-practices.appspot.com

4.2 Web Technology

Status code	Description
200	OK
301	Moved permanently
401	Unauthorised
403	Forbidden
404	Not found
500	Internal server error

Table 4.1 Common HTTP

status codes

W hen a server responds, the fi r st line includes a 3-digit status code (frequently encountered codes are summarised below) and message indicating whether or not the request succeeded, for example,

HTTP /1.1 200 OK

Other codes, which you have probably encountered, are shown in Table 4.1

An important aspect of the HTTP is that you are unable to store mutable state; that is, we cannot assign values to variables or store anything to be shared between different calls to the same function. Everything that is needed by a server process must be sent explicitly as a parameter.

4.2.2 HTML (HyperText Markup Language) and CSS (Cascading Style Sheets)

Web pages are written using HTML also originally proposed by Berners-Lee. HTML is used to describe the structure of the page and includes structural elements such as paragraphs (e.g. <p></p>), headings (e.g. <h1></h1>) and lists (e.g. <li></li>).

CSS is a technology which allows us to create styles which de fi ne how HTML elements should look. The combination of HTML and CSS allows web page developers to separate structure from appearance. The ‘separation of concerns’ is an important principle in many areas of software engineering and is central to the development of cloud systems.

All pages on a site are usually styled in the same way, and styles are generally not speci fi c to a single web page. CSS allows us to de fi ne our style information and then apply it to many pages across a site. So, for example, should we wish to present our pages in a different way to distinct user groups, the required changes may be limited to a small number of CSS pages rather than updating all the pages on our site. The separation of structure from appearance can be translated into the separation of the logic of our cloud application to the rendering of the pages to be presented to our users.

Separating style information lets you write it once and use it on all pages instead of coding it again and again. This also relates to another important principle of software engineering, namely, ‘Don’t Repeat Yourself’ or DRY. When the DRY principle is applied successfully, modifi c ations are localised and should not occur in unrelated elements. DRY aims to reduce repetition of all kinds of information but is particularly important to multi-tier web or cloud architectures where the improvements in maintainability become critical to success.

4.2.3 XML (eXtensible Markup Language)

XML is a technology introduced by the W3C to allow data to be encoded in a textual format which is readable to both humans and machines. XML simpli fi es and standardises the exchange and storage of data and has become a standard way of representing data. In XML, the tags relate to the meaning of the enclosed text, making XML self-explanatory. XML allows for a hierarchical approach, is portable and interoperable and has been used to create a wide variety of markup languages including XHTML which is simply HTML in XML format.

T he Document Object Model (DOM) allows for machine navigation of an XML document as if it were a tree of node objects representing the document contents, and there are many tools and libraries freely available in many programming languages which will parse XML. Unsurprisingly, XML is widely used in cloud systems. XLM does, however, have some disadvantages and in particular is criticised for being ‘tag heavy’; indeed, a signifi c ant fraction of an XML fi l e can be taken up by the tags.

4.2.4 JSON (JavaScript Object Notation)

J SON is an increasingly popular lightweight alternative to XML and is directly based on data structures found in the JavaScript language (although JSON is language independent). JSON is built on two human readable structures, namely, a collection of name/value pairs or an ordered list of values.

4.2.5 JavaScript and AJAX (Asynchronous JavaScript and XML)

JavaScript is a programming language which normally runs in the client’s browser and allows the web page to be updated by traversing and updating the DOM tree. This means that the programmer is able to dynamically update an XHTML document by navigating the tree and then applying insert, delete or update operations to elements, attributes (including style information) or contents. JavaScript can also add important functionality such as the ability to validate user input before an HTTP request is sent to the server.

X MLHttpRequest is a JavaScript object which manages HTTP requests and has the useful capability of working outside the normal browser request/response cycle. The technology surrounding the XMLHttpRequest is known as AJAX and allows for partial page updates which in turn improve the user experience and allow for web applications to work in a comparable manner to desktop applications. Such web applications are sometimes referred to as Rich Internet Applications (RIA). However, the asynchronous nature of the HTTP calls adds complexity to web application programming, and as pages may be updated without changing the URL in the user’s browser, special consideration needs to be given to handling the user’s back button and bookmarking capability. Furthermore, older browsers do not support the technology so developers must check the browser type and version when using AJAX.

4.2 Web Technology

Fig. 4.1 MVC

4.2.6 Model-View-Controller (MVC)

M VC is a standard architecture for separating components in an interactive application. The architecture was fi r st described well before the World Wide Web in 1979 by Trygve Reenskaug but in recent years has proved increasingly popular in the development of web applications. MVC is interpreted and applied in slightly different ways, but here we use it to refer to the separation of application data (contained in the model) from graphical representation (the view) and inputprocessing logic (the controller). A simple summary of MVC is given in

Figure 4.1.

T he controller implements logic for processing the input, the model contains the application data and the view presents the data from the model. An MVC application consists of a set of model/view/controller triples responsible for different elements of the user interface. MVC is a natural way to write cloud applications, and so it is worth brie fl y describing the three components in more detail.

4.2.6.1 Model

W e mentioned earlier that HTTP does not allow us to assign values or store information between requests. In a cloud application based on MVC, the model is responsible for maintaining the state of the application, whether this is just for a small number of request/response cycles or if we need to persist data in a permanent way, for example, using a server database. The model is, however, more than just data as the model is responsible for the enforcement of the business rules that apply to the data. The model acts as a ‘gatekeeper’ and a data store. Note that the model is totally decoupled from the interface and works instead with the underlying data, often stored in a database. As our application is based on HTTP, the server must wait for requests from the client before it can send its response.

4.2.6.2 View

This is the visible part of the user interface and will normally be based on data in the model written out in XHTML. The view will render the XHTML when instructed to do so by the controller, but the view’s work is done once the data is displayed. The view itself never handles incoming data. There may be many views, for example, for different devices with different screen sizes and capability, which access the same model data.

4.2.6.3 Controller

T he controller (sometimes called the ‘input controller’) is the bridge between the client and the server and in a sense orchestrates the whole application. Interface actions are accepted and may be transformed into operations which can be performed on the model. The controller takes content data produced by the model and translates that data into a form the view can render. In short, a controller accepts input from the user and instructs the model and a view to perform actions based on that input.

4.3 Autonomic Computing

Autonomic computing seeks to improve computer systems by decreasing human involvement in their operation with the ultimate goal of producing systems which manage themselves and adapt to unpredictable changes. The advantages become signifi c ant in a large data centre of hundreds or thousands of nodes. In this environment, machine failures are not rare events but are regular occurrences which must be managed carefully.

Autonomous or semi-autonomous machines rely on three elements: monitoring probes and gauges, an adaption engine and on effectors to apply any required changes to the system. Decisions made by an autonomic system are usually based on a set of prede fi ned policies. IBM’s Autonomic Computing Intuitive de fi nes four properties of autonomic systems, namely, self-confi g uration, self-optimisation, selfhealing and self-protection.

4.4 Virtualisation

V irtualisation refers to the process where abstract or virtual resources are used to simulate physical resources. Virtualisation can be of great bene fi t to cloud systems as it can improve resource pooling and enable rapid and elastic resource provisioning. These benefi t s make for agile, fl e xible networks leading to signifi c ant cost reductions. In typical cloud computing applications, servers, storage and network devices may all be virtualized (Goldne 2008 ).

Virtualisation has a long history in the software industry and the term has been used in a variety of ways. The earliest type of virtual machine dates back to IBM in the 1960s which aimed to maximise use of expensive mainframe systems through multitasking. We will brie fl y introduce the main types of virtualisation in the following sections.

4.4 Virtualisation

4.4.1 Application Virtualisation

Application virtualisation, sometimes called a ‘process virtual machine’, provides a platform-independent programming environment which hides the details of the underlying hardware and operating system. The Java Virtual Machine (JVM) is a commonly cited and popular example of an application virtual machine. The JVM provides an environment in which Java bytecode can be executed. The JVM can run on many hardware platforms, and the use of the same bytecode allows for the ‘write once run anywhere’ description which is highly appealing to application developers. Java bytecode is an intermediate language which is typically compiled from Java, but a number of other languages such as Python (Jython) or Ruby (JRuby) can also be used.

4.4.2 Virtual Machine

I n the cloud context, ‘virtual machine’ is usually taken to mean a software implementation of a machine with a complete independent operating system, which executes programs like a physical machine. This type of virtualisation began with work in the Linux community but has lead to a variety of commercial and open source libraries.

The virtual machine may have a number of virtual components including:

• ‘Virtual processors’ which share time with other virtual machines on the physical processor

• ‘Virtual memory’ which are normally implemented as a slice of physical RAM on the host machine (not to be confused with common use of ‘virtual memory’ referring to the combination of various types of storage media)

• ‘Virtual hard disk’ which is typically one or more large fi l es on the physical disk

• ‘Virtual network’ based on a network interface controller (NIC)

In the next chapter, we will see how vendors such as Amazon use the power of virtualisation to offer users access to cloud-based resources such as compute power (e.g. Amazon EC2) or storage (e.g. Amazon S3) which they can rent from Amazon at an hourly rate.

4.4.3 Desktop Virtualisation

In desktop virtualisation, the user is provided with access to a virtual machine over which they may have complete control. This can be very useful, for example, in an organisation where users have very limited rights in terms of program installation and con fi guration on local machines, they can be given complete freedom to install their own programs on a virtual machine. Furthermore, the virtual machine can run an entirely different operating system to that of the host, so users can choose and con fi gure their preferred environment. Virtual machines can be copied, moved or destroyed as required, and a ‘snapshot’ can be taken of a machine at any time so that the user can save and return to a particular environment. This capability also provides for a useful and safe way of testing new upgrades and environments. Examples of the technology include VMWare Workstation, Microsoft VirtualPC and Oracle Virtual Box. At the end of this chapter, you will create your own desktop virtual machine using VMWare.

4.4.4 Server Virtualisation

Server virtualisation applies the same principle as desktop virtualisation but to the server environment. Server virtualisation allows entire operating systems, applications and accessories to be packaged into a set of fi l es that can be moved or copied and then run on common hardware in any compatible virtualized environment. Virtual machines allow us to decouple the computing infrastructure from the physical infrastructure, potentially leading to a number of important bene fi ts in the running of a data centre.

4.4.4.1 Ef fi ciency

I t has been estimated that most servers are highly underutilised and run at about 15% of their total capacity. Virtualisation allows for a more ef fi cient use of machine resources which can lead to a reduction in power consumption, cooling and space requirements (see also Chap. 3) . This improved use of resources can make a big difference to the overall energy consumption of a large data centre, and it has been suggested that virtualisation can be an important component in achieving reduced carbon emissions from the IT industry.

4.4.4.2 Isolation

V irtualisation means that secure, encapsulated clones of your entire work environment can be created for testing and backup purposes leading to improved maintenance, fl exibility and high availability. The failure of one virtual machine should not affect any of the other machines running on the same host. This has important and benefi c ial implications for vendor’s offering cloud based resources which are available to multiple tenants. Thus, for example, if a user of an Amazon EC2 machine accidentally corrupts their operating system or runs a program with an infi n ite loop, it should not have a detrimental effect on other users running virtual machines in the same data centre.

4.4.4.3 Mobility

V irtual machine mobility enables applications to migrate from one server to another even whilst they are still running. For example, with systems such as vSphere, virtual machines can be con fi gured to automatically restart on a different physical server in a cluster if a host fails. This capability can be a critical component in the provision of autonomic computing as well as leading to the following advantageous outcomes:

• A reduction in the number of servers required

• Data centre maintenance without downtime

• Improved availability, disaster avoidance and recovery

4.4 Virtualisation

• Simpli fi cation of data centre migration or consolidation

• Data centre expansion

• Workload balancing across multiple sites There are, of course, some disadvantages:

• Virtual servers require a hypervisor to keep track of virtual machine identi fi cation (see below), local policies and security, which can be a complex task.

• The integration of the hypervisor with existing network management software can be problematic and may require changes to the network hardware.

4.4.5 Storage Virtualisation

S torage virtualisation provides different servers access to the same storage space. Data centre storage is comprised of storage area networks (SAN) and network-attached storage (NAS). In a cloud computing environment, a network can be enabled to address large virtual pools of storage resources. Storage virtualisation allows for the use of all the multiple disk arrays, often made by different vendors and scattered over the network, as if they were a single storage device, which can be centrally managed.

4.4.6 Implementing Virtualisation

A s mentioned above, multiple virtual machines can run simultaneously on the same physical machine, and each virtual machine can have a different operating system. A virtual machine monitor (VMM) is used to control and manage the virtual machines on a single node. A VMM is often referred to as a hypervisor (see below). At a higher level, virtual machine infrastructure managers (VIMs) are used to manage, deploy and monitor virtual machines on a distributed pool of resources such as a data center. Cloud infrastructure managers (CIM) are web-based management solutions speci fi cally for cloud systems.

V irtualisation management is the technology that abstracts the coupling between the hardware and the operating system, allowing computing environments to be dynamically created, expanded, shrunk, archived or removed. It is therefore extremely well suited to the dynamic cloud infrastructure. Load balancing is of course an important part of this management and soon becomes critical in the prevention of system bottlenecks due to unbalanced loads.

4.4.7 Hypervisor

Generally, virtualisation is achieved by the use of a hypervisor. The hypervisor is a software that allows multiple virtual images to share a single physical machine and to logically assign and separate physical resources. The hypervisor allows the computer hardware to run multiple guest operating systems concurrently. As mentioned above, each of the guest operating systems is isolated and protected from any others and is unaffected by problems or instability occurring on other virtual machines

Fig. 4.2 Hypervisor (bare metal)

running on the same physical machine. The hypervisor presents the guest operating systems with an abstraction of the underlying computer system and controls the execution of the guest operating system.

M any powerful hypervisors including KVM (Kernel-based Virtual Machine), Xen and QEMU are open source. VMWare is currently the market leader in the fi eld of virtualisation and many of its products are based on open source. Amazon uses a modi fi ed version of Xen.

Figure 4.2 shows a high-level view of a hypervisor where the machine resources of the host are shared between a number of guests, each of which may be running applications and each has direct access to the underlying physical resources.

4.4.8 Types of Virtualisation

Virtualisation can be classi fi ed into two categories depending on whether or not the guest operating system kernel needs to be modifi e d. In full virtualisation, a guest operating system runs unmodifi e d on a hypervisor. In the case of para-virtualisation, there is no need to emulate the entire hardware environment. Para-virtualisation requires modifi c ations to the guest operating system kernel so that it ‘becomes aware’ of the hypervisor and can communicate with it. Guest operating systems using full virtualisation are generally faster.

Figure 4.3 shows the situation where the hypervisor is running on a hosted operating system. This is the type of hypervisor we will be using in the exercise at the end of this chapter.

Fig. 4.3 Hosted hypervisor

4.5 MapReduce

MapReduce was originally introduced by Google as a distributed programming model using large server clusters to process massive (multi-terabyte) data sets. The model can be applied to many large-scale computing problems and offers a number of attractive features such as automatic parallelisation, load balancing, network and disk transfer optimisation and robust handling of machine failure. MapReduce works by breaking a large problem into smaller parts, solving each part in parallel and then combining results to produce the fi nal answer (White 2009 ).

MapReduce runs on top of a specialised fi le system such as the Google File System (GFS) or the Hadoop File System (HDFS). Data is loaded, partitioned into chunks (commonly 64 MB) such that each chunk can replicate. A key feature of MapReduce is that data processing is collocated with data storage, and as a distributed programming paradigm, a number of advantages are evident when compared to the traditional approach of moving the data to the computation including:

1. Scalability

2. Reliability

3. Fault tolerance

4. Simplicity

5. Ef fi ciency

A ll these advantages are obviously very relevant to the cloud, where processing large amounts of data using distributed resources is a central task. Unsurprisingly, MapReduce is a programming model used widely in cloud computing environments for processing large data sets in a highly parallel way.

Fig. 4.4 MapReduce overview

M apReduce was inspired by the map and reduce functions which are commonly found in functional program languages like LISP. In LISP, a map takes an input function and a sequence of values and then applies the function to each value in the sequence. Reduce combines all the elements in the sequence using an operator such as * or +. In MapReduce, the functions are not so rigidly de fi ned, but programmers still specify the computation in terms of the two functions map and reduce, which can be carried out on subsets of total data under analysis in parallel. The map function is used to generate a list of intermediate key/value pairs, and the reduce function merges all the intermediate values associated with same intermediate key.

When all the tasks have been completed, the result is returned to the user. In MapReduce, input data is portioned across multiple worker machines executing in parallel; intermediate values are output from the map worker machines and fed to a set of ‘reduce’ machines (there may be some intermediate steps such as sorting). It is, perhaps, useful to think of MapReduce as representing a data fl ow as shown in Figure 4.4 rather than a procedure. Jobs are submitted by a user to a master node that selects idle workers and assigns each one a MapReduce task to be performed in parallel. The process of moving map outputs to the reducers is known as ‘shuf fl ing’.

4.5.1 MapReduce Example

The canonical MapReduce example takes a document or a set of documents and outputs a listing of unique words and the number of occurrences throughout the text data. The map function takes as its key/value pair the document name and document contents. It then reads through the text of the document and outputs an intermediate key/value listing for each word encountered together with a value of 1. The reduce phase then counts up each of these individual 1’s and outputs the total value for each word. The pseudocode for the functions is shown below. The map function takes the name of the document as the key and the contents as the value.

Fig. 4.5 MapReduce word count example

map(document_name, document_contents) { for each word w in document_contents emit_intermediate(w, 1)

}

reduce(a_word, intermediate_vals) { result = 0 for each value v in intermediate_vals result + = v emit result

}

Figure 4.5 shows a simple example with the three input fi l es shown on the left and the output word counts shown on the right.

The same kind of process can be applied to many other problems and is particularly useful where we need to process a huge amount of raw data, for example, from documents which have been returned by a web crawl. To create an inverted index (see Chap. 7) after a web crawl, we can create a map function to read each document and output a sequence of <word, documentID> pairs. The reduce function accepts all pairs for a given word and output pairs of <word, documentIDList> such that for any word, we can quickly identify all the documents in which the word occurs. The amount of data may be simply too big for traditional systems and needs to be distributed across hundreds or thousands of machines in order to be processed in a reasonable time frame. Other problems which are well suited to the MapReduce approach include but are not limited to:

1. Searching

2. Classi fi cation and clustering

3. Machine learning (Apache Mahout is an open source library for solving machine learning problems with MapReduce)

4. tf - idf

W e will discuss the above tasks together with web crawling and inverted indexes in Chap. 7 .

4.5.2 Scaling with MapReduce

To scale vertically (scale up) refers to the process of adding resources to a single computing resource. For example, we might add more memory, processing power or network throughput which can then be used by virtual machines on that node. MapReduce is designed to work on a distributed fi l e system and uses horizontal scaling (scale out) which refers to the process of achieving scaling by the addition of computing nodes.

T he overarching philosophy is often summarised in the adage ‘don’t move data to workers – move workers to data’; in other words, store the data on the local disks of nodes in the cluster and then use the worker on the node to process its local data. For the kind of problem typical of MapReduce, it is simply not possible to hold all the data in memory. However, throughput can remain reasonable through use of multiple nodes, and reliability is achieved through redundancy. Rather than use expensive ‘high end’ machines, the approach generally benefi t s from standard commodity hardware—a philosophy sometimes summarised as ‘scale out not up’. In any MapReduce task, coordination is needed, and a ‘master’ is required to create chunks, balance and replicate and communicate with the nodes.

4.5.3 Server Failure

W here processing occurs on one powerful and expensive machine, we might reasonably expect to run the machine for several years without experiencing hardware failure. However, in a distributed environment using hundreds or thousands of low-cost machines, failures are an expected and frequent occurrence. The creators of MapReduce realised they could combat machine failure simply by replicating jobs across machines. Server failure of a worker is managed by re-executing the task on another worker. Alternatively, several workers can be assigned the same task, and the result is taken from the fi r st one to complete, thus also improving execution time.

4.5.4 Programming Model

The MapReduce model targets a distributed parallel platform with a large number of machines communicating on a network without any explicit shared memory. The programming model is simple and has the advantage that programmers without expertise in distributed systems are able to create MapReduce tasks. The programmer only needs to supply the two functions, map and reduce, both of which work with key value pairs (often written as < k, v>). Common problems can be solved by writing two or more MapReduce steps which feed into each other.

4.5.5 Apache Hadoop

H adoop is a popular open-source Java implementation of MapReduce and is used to build cloud environments in a highly fault tolerant manner. Hadoop will process web-scale data of the order of terabytes or petabytes by connecting many commodity computers together to work in parallel. Hadoop includes a complete distributed batch processing infrastructure capable of scaling to hundreds or thousands of computing nodes, with advanced scheduling and monitoring capability. Hadoop is designed to have a ‘very fl a t scalability curve’ meaning that once a program is created and tested on a small number of nodes, the same program can then be run on a huge cluster of machines with minimal or no further programming required. Reliable performance growth should then be in proportion to the number of machines available.

The Hadoop File System (HDFS) splits large data fi les into chunks which are managed by different nodes on the cluster so that each node is operating on a subset of the data. This means that most data is read from the local disk directly into the CPU, thus reducing the need to transfer data across the network and therefore improving performance. Each chunk is also replicated across the cluster so that a single machine failure will not result in data becoming inaccessible.

F ault tolerance is achieved mainly through active monitoring and restarting tasks when necessary. Individual nodes communicate with a master node known as a ‘JobTracker’. If a node fails to communicate with the job tracker for a period of time (typically 1 min), the task may be restarted. A system of speculative execution is often employed such that once most tasks have been completed, the remaining tasks are copied across a number of nodes. Once a task has completed, the job tracker is informed, and any other nodes working on the same tasks can be terminated.

4.5.6 A Brief History of Hadoop

Hadoop was created by Doug Cutting (also responsible for Lucene) who named it after his son’s toy elephant. Hadoop was originally developed to support distribution of tasks associated with the Nutch web crawler project. In Chap. 7, we will discuss Lucene and Nutch in more detail and will use Nutch to perform a web crawl and create a searchable Lucene index in the end of chapter exercise.

Y ahoo was one of the primary developers of Hadoop, but the system is now used by many companies including Facebook, Twitter, Amazon and most recently Microsoft. Hadoop is now a top-level Apache project and benefi t s from a global community of contributors.

4.5.7 Amazon Elastic MapReduce

E lastic MapReduce runs a hosted Hadoop instance on an EC2 (see Chap. 5 ) instance master which is able to provision other pre-con fi gured EC2 instances to distribute the MapReduce tasks. Amazon currently allows you to specify up to 20 EC2 instances for data intensive processing.

4.5.8 Mapreduce.NET

Mapreduce.NET is an implementation of MapReduce for the .Net platform which aims to provide support for a wide variety of compute-intensive applications. Mapreduce.Net is designed for the Windows platform and is able to reuse many existing Windows components. An example of Mapreduce.Net in action is found in MRPGA (MapReduce for Parallel GAs) which is an extension of MapReduce speci fi cally for parallelizing genetic algorithms which are widely used in the machine learning community.

4.5.9 Pig and Hive

P ig, originally developed at Yahoo research, is a high-level platform for creating MapReduce programs using a language called ‘Pig Latin’ which compiles into physical plans which are executed on Hadoop. Pig aims to dramatically reduce the time required for the development of data analysis jobs when compared to creating Hadoops. The creators of Pig describe the language as hitting ‘a sweet spot between the declarative style of SQL and the low-level, procedural style of MapReduce’.

You will create a small Pig Latin program in the tutorial section.

Apache Hive which runs on Hadoop offers data warehouse services.

4.6 Chapter Summary

In this chapter, we have investigated some of the key technology underlying computing clouds. In particular, we have focused on web application technology, virtualisation and the MapReduce model.

4.7 End of Chapter Exercises

Exercise 1: Create your own VMWare virtual machine

In this exercise, you will create your own virtual machine and install the Ubuntu version of the Linux operating system. In later chapters, we will use this same virtual machine to install more software and complete further exercises. Note that the virtual

4.9 Create Your Ubuntu VM

machine is quite large and ideally you will have at least 20 GB of free disk space and at least 2 GB of memory. The machine may run with less memory or disk space, but you may fi nd that performance is rather slow.

4.8 A Note on the Technical Exercises (Chaps. 4 , 5 , 6 , 7 )

As you will already be aware, the state of the cloud is highly dynamic and rapidly evolving. In these exercises, you are going to download a number of tools and libraries, and we are going to recommend that you choose the latest stable version so that you will always be working at the ‘cutting edge’. The downside of this is that we cannot guarantee that following the exercise notes will always work exactly as described as you may well be working with a later version than the one we used for testing. You may need to adapt the notes, refer to information on the source web pages or use web searches to fi n d solutions to any problems or inconsistencies you encounter. The advantage of this is that you will be developing critical skills as you go through the process which would be an essential part of your everyday work should you be involved in building or working with cloud systems. Anyone who has developed software systems will know that it can be very frustrating and at times may feel completely stuck. However, you are often closer to a solution than you might think, and persisting through dif fi culties to solve a problem can be very rewarding and a huge learning opportunity.

4.9 Create Your Ubuntu VM

(1) Install VMWare Player on your machine: http://www.vmware.com/products/ player/ You may need to register with VMWare but VMWare Player is free.

(a) Click on download and follow the instructions until VM Player is installed.

(2) Go to www.ubuntu.com

(a) Go to the Ubuntu download page.

(b) Download the latest version of Ubuntu (e.g. ubuntu-11.10-desktop-i386.iso). (c) Save the .iso fi le to a folder on your machine.

(3) Open VMplayer. Be aware that this process can be quite lengthy.

(a) Select ‘Create New Virtual Machine’ (Fig. 4.6 ).

(b) Select ‘Installer disc image fi l e(iso)’ and browse to the folder where you saved your Ubuntu .iso fi le. Select next.

(d) Select a location where your VM will be stored: You will need plenty of disk space (~5 GB).

(e) You now need to select additional disk space for the VM to use. If you have plenty of space you can go with the recommendation (e.g. 20 GB). It is probably best to select at least 8 GB. Select ‘Split virtual disk into 2 GB fi les’. Select next.

Fig. 4.6 VMware Player

(f) Check the options you have selected and select Finish. If you want to run the VM as soon as it is ready leave the ‘Power on this virtual machine after creation’ selected.

(4) Your machine will now be created. This will take some time depending on the con fi guration of your underlying hardware.

(5) You may be asked to select your country so that the keyboard layout can be set appropriately. You can test your keyboard to check it works as expected.

4.11 Learn How to Use Ubuntu

4.10 Getting Started

Your machine is now ready to use.

1. Once your Ubuntu VM has loaded up inside VMWare, log on using the ID and username you gave when creating the VM. VMWare tools should be automatically installed and you may need to restart your machine a number of times. Once complete, you should now be able to maximise the size of your VM screen.

2. Also in the Virtual Machine menu you should see ‘power options’ where you can ‘power off’, ‘reset’ or ‘suspend’ your virtual machine. ‘Suspend’ is often a useful option as it will maintain the particular state that your virtual machine is in and restore it to that state next time you ‘resume’ the virtual machine.

3. If you wish to copy your virtual machine, perhaps to a USB memory stick, you should fi rst suspend or power off the virtual machine. You then need to locate the folder where you created the virtual machine and copy the entire folder to the required destination. You can then start the virtual machine on another computer, providing that VMware Player is installed by locating the fi l e ending with .vmx and double clicking on that fi le. Note that if something goes wrong later on, you can switch to using this copy. You can of course take a copy of the virtual machine at any point during your development.

4.11 Learn How to Use Ubuntu

I n this tutorial and others that follow, we will be using the Ubuntu virtual machine. If you are not familiar with Ubuntu, it will be well worth getting familiar with the basics. There are many guides available, and a good starting point is the Ubuntu home page ( http://www.ubuntu.com/ubuntu ) where you can take a ‘tour’ of the important features. In your virtual machine, there is also help available, just click on the ‘dash home’ icon on the top left and type help and select the help icon

(Fig. 4.7 ).

S o take some time to explore the environment and get used to Ubuntu. Actually most of the features are quite intuitive, and you can learn by just ‘trying’. In terms of the tutorials in this book, we will only be using a small subset of the available commands and applications. In particular you will need to:

• Be able to browse, copy, move, delete and edit fi les.

• Use the archive manager to extract zipped fi les.

• Use the terminal to send basic Unix commands including:

• ls to list the contents of a folder (traditionally referred to as ‘Directory’ in Unix)

• cd to change directory

• pwd to identify your location in the directory structure

Fig. 4.7 Finding assistance from the Ubuntu home screen

4.12 Install Java

We will be using Java for a number of the tutorial exercises. Java is a freely available programming language. We are going to install the java SDK from Oracle which is suitable for all the tutorial material in this book. These notes are partly based on information from the Ubuntu community at https://help.ubuntu.com/community/ Java . If you encounter problems you may want to check the site for any updates.

1. Download the latest Oracle JDK 7 for Linux from http://www.oracle.com/tech network/java/javase/downloads/java-se-jdk-7-download-432154.html (e.g. jdk7-linux-i586.tar.gz) in your Ubuntu VM.

2. You will have to agree to Oracle Binary Code License to download this.

3. Locate the fi le by clicking on the home folder icon and then selecting the ‘Downloads’ folder (Fig. 4.8 ).

4.12 Install Java

Fig. 4.8 Downloads folder illustrating the Oracle Java Development Kit archive

4. Right click on the fi le and select ‘open with archive manager’.

5. In the archive manager select ‘extract’ and save. Select ‘show the fi l es’ once the extraction is complete.

6. A new folder called ‘jdk1.7.0’ or similar should have been created. Rename this folder to ‘java-7-oracle’ by right clicking and selecting ‘rename’.

7. In your Ubuntu virtual machine open a terminal session. To get a terminal session just click the Ubuntu logo in the side bar and type ‘terminal’ and then click on the terminal icon (Fig. 4.9) . Move to the directory where you extracted the fi l es.

8. Now type the following commands one at a time. Enter your password if prompted and follow on-screen instructions where required. Make sure you type these exactly, including the case.

user@ubuntu:~$ sudo mkdir -p/usr/lib/jvm/ user@ubuntu:~$ sudo mv java-7-oracle//usr/lib/jvm/ user@ubuntu:~$ sudo add-apt-repository ppa:nilarimogard/

webupd8

user@ubuntu:~$ sudo apt-get update user@ubuntu:~$ sudo apt-get install update-java

user@ubuntu:~$ sudo update-java

Fig. 4.9 Invoking a terminal session in Ubuntu

9. If prompted, select the Java version you just updated (/usr/lib/jvm/jdk1.7.0).

10. Answer Y if prompted to continue with the Java install. 11. Type the following commands to set the JAVA_HOME

user@ubuntu:~$ JAVA_HOME=/usr/lib/jvm/java-7-oracle user@ubuntu:~$ export JAVA_HOME user@ubuntu:~$ PATH = $PATH:$JAVA_HOME/bin user@ubuntu:~$ export PATH

4.13 MapReduce with Pig

I n this tutorial we are going to implement the classic word count program using Pig Latin.

1. Open your home folder and create a new folder called ‘Pig’.

2. In your virtual machine, use a browser such as Firefox to go to http://www. apache.org/dyn/closer.cgi/pig , select a suitable mirror and then download the latest stable release of pig e.g. pig-0.9.2.tar.gz.

4.13 MapReduce with Pig

Fig. 4.10 Ubuntu download

3. Open the home folder and then the download folder to see the downloaded pig fi le (Fig. 4.10 ).

4. Right click on the fi le (e.g. pig.0.9.2.tar.gz) and select ‘open with archive manager’.

5. In the archive manager select ‘extract’ and allow the current folder as the destination folder. Select ‘show the fi les’ once the extraction is complete.

6. The archive manager will extract the fi l e to a folder called pig-0.9.1 or similar. The only fi l e we need for this tutorial is the .jar fi l e. Locate this fi l e (e.g. pig.0.9.1.jar), right click and select ‘copy’ and paste the fi l e into the pig folder you created earlier.

7. We need a text fi le which we can use to count the words. You can use any text fi le you like but in this example we will copy over the NOTICE.txt fi le from the extracted folder to the pig folder for testing purposes.

8. We are now going to create our fi r st Pig Latin program. To do this, we fi r st need a text editor so select the ‘dash home’ icon, type ‘text’ and select the text editor.

9. Paste in the following code (originally from http://en.wikipedia.org/wiki/Pig_

(programming_language )) and save the fi le as wordCount.pig

A = load ‘NOTICE.txt’;

B = foreach A generate fl atten(TOKENIZE((chararray)$0)) as word;

C = fi lter B by word matches ‘\\w + ‘;

D = group C by word;

E = foreach D generate COUNT(C) as count, group as word;

F = order E by count desc; store F into ‘wordcount.txt’;

10. Open a terminal and navigate to the pig folder.

11. Type in the following command to run your word count program using pig

java -Xmx512M -cp pig-0.9.1.jar org.apache.pig.Main -x local wordCount.pig

12. Once the program has fi nished running you should see a new folder called wordcount.txt. Navigate into the folder and open the contents of the fi l e. Compare your output with the text fi le you used and hopefully you will see that the program has been successful and the frequency of the words has been counted and placed in order.

13. Try and run the program with another text fi le of your choice.

14. Alter the program so that it orders the output by word rather than frequency.

15. Alter the program so that only words starting with a letter above ‘t’ are output.

4.14 Discussion

Y ou might be thinking that actually the program seemed to take quite some time to complete this simple task on a small text fi l e. However, notice that we are running the Pig Latin program using the ‘local’ mode. This is useful for testing our programs, but if we were really going to use pig for a serious problem such as the results of a web crawl, we would switch to ‘hadoop’ mode. We could then run the same Pig Latin program, and the task would be automatically parallelized and distributed on a cluster of possibly thousands of nodes. This is actually much simpler than you might expect, especially when vendors such as Amazon offer a set of nodes precon fi gured for MapReduce tasks using pig.

4.15 MapReduce with Cloudera

Should you wish to experiment with MapReduce a nice way to start is to visit the Cloudera site ( http://www.cloudera.com/) where you can download a virtual machine with Hadoop and Java pre-installed. Cloudera also offer training and support with a range of MapReduce related tasks.

References

Goldne, B.: Virtualization for Dummies. Wiley, Chichester (2008)

Nixon, R.: Learning PHP, MySQL, and JavaScript. O’Reilly Media, Inc, Sebastopol (2009). This is a useful introduction to developing web applications with the popular PHP language

White, T.: Hadoop: The Defi nitive Guide, 2nd edn. O’Reilly Media, Sebastopol (2009)

Niko Radino Page's

Senin, 13 Oktober 2014