Intelligence in the Cloud

What the reader will learn:

• The importance of search in web and cloud technology

• The challenges and potential of unstructured text data

• How collective intelligence is being used to enhance a variety of new applications

• An introduction to text visualisation

• An introduction to web crawling

7.1 Introduction

W e have seen how web and cloud technology allow us to easily store and process vast amounts of information, and as we saw in Chap. 6, there are many different data storage models used in the cloud. This chapter looks at how we can start to unlock the hidden potential of that data, to fi nd the ‘golden nuggets’ of truly useful information contained in the overwhelming mass of irrelevant or useless junk and to discover new knowledge via the intelligent analysis of the data. Many of the intelligent tools and techniques discussed here originated well before cloud computing. However, the nature of cloud data, its scalable access to huge resources and the sheer size of the available data means that the advantages of these tools are much more obvious. We have now reached a place where many common web-based tasks would not be possible without them.

M uch of this new information is coming directly from users. The art of tapping into both the data created by users and the interaction of users with web and cloudbased applications brings us to the fi eld of collective intelligence and ‘crowd sourcing’. Collective intelligence has roots predating the web in diverse fi e lds including biology and sociology. The application of collective intelligence techniques to web-based data has rightly been receiving a lot of attention in recent years, and it has become clear that, particularly with the rapid expansion of cloud systems, this is a fruitful area for developing new ways of intelligently accessing, synthesising and analysing our data. It has been suggested that the eventual result will be Web 3.0.

R. Hill et al., Guide to Cloud Computing: Principles and Practice, Computer 163

Communications and Networks, DOI 10.1007/978-1-4471-4603-2_7,

W e will start this chapter with a brief overview of the kind of data we are dealing with and the techniques that are already being employed to extract intelligence from user data, interaction and collaborations. We will then look in some detail at the process of searching including an appraisal of the challenges of dealing with textual data and an overview of the procedures underlying a search engine. We then move to the fi e ld of collective intelligence and what it can offer web and cloud applications. We will fi n ish by looking at the power of visualisation. At the end of this chapter, the exercises will use open source libraries to perform some of the stages of extracting and analysing online text.

7.2 Web 2.0

T en years ago the majority of websites had the feel of a lecture: The static information fl owed one way from the site to the user. Most sites are now dynamic and interactive, and the result is better described as a conversation. We should perhaps remember that HTTP itself is designed to work in a conversational manner. Web 2.0 is all about allowing and encouraging users to interact with websites. As the interaction increases, so does the amount of data. In this chapter, we will look at some of the tools we can employ to get more out of user-generated content. Often the process of using this content effectively to improve our site leads to increased interest and further activity thus creating a virtuous circle.

7.3 Relational Databases

Relational database technology is mature and well known by most software developers and benefi t s from the compact and powerful SQL language. Data stored in relational databases has a number of distinct advantages when it comes to information retrieval including:

• Data is stored in labelled fi elds.

• The fi elds (or columns) have a predetermined data type and size attributes.

• We can specify constraints on the fi elded data, for example, ‘all entered values must be unique’, ‘null values are not accepted’ or ‘entered data must fall within a particular set or range of values’.

• The well-understood normalisation technique can be applied which has been shown to reduce redundancy and provide an easy to understand logical to physical storage mapping.

7.4 Text Data

Despite the bene fi ts listed above, it has been estimated that approximately 80% of an organisations data is in an unstructured format. The data is of course multimedia, but text data has attracted the huge interest as the primary source for web mining

7.5 Natural Language Processing

Table 7.1 Example synonyms

Synonyms
Physician	Doctor
Maize	Corn

Table 7.2 Example homonyms

Homonyms
Java (country)	Java (programming language)
Board (board of directors)	Board (wooden board)

although in recent years attention is shifting towards other formats, particularly images and video where signi fi cant advances have been occurring.

Unstructured text data has the useful property of being both human and machine readable, even if it makes no ‘sense’ to a machine. The rules for text data are very different to those of relational databases. Text data may or may not be validated and is often duplicated many times. It may have some structure, such as in an academic article, or little or no structure, for example, in the case of blogs and email. In some cases, spellings and grammar are checked very carefully, but in others many mistakes, misspellings, slang words, abbreviations and acronyms are common. There is a notorious many to many relationship between words and meaning. In text data we frequently fi n d synonyms that are different words with the same meaning and homonyms that are words with the same spelling but with distinct meanings. Examples are shown in Tables 7.1 and 7.2 .

Text also has many examples of multi-word units such as ‘information retrieval’ where more than one word is referring to a single concept, new words are constantly appearing, there are many examples of multinational crossover and the meaning of words can vary over time. For these reasons extracting useful information from textual data is in many ways a harder problem than with data stored in relational databases. Typical examples of text likely to be targeted for intelligent analysis are articles, white papers, product information, reviews, blogs, wikis and message boards. One effect of Web 2.0 has been to greatly increase the amount of online text data. The cloud is providing easy access to scalable computing power with which to process this data in innovative and fruitful ways. The web itself can be thought of as one huge data store. We will look at the tools and techniques which have been developed to maximise the potential of this fantastic resource at humanity’s fi ngertips.

7.5 Natural Language Processing

N atural language processing is a broad set of techniques used to process written and spoken human languages. Natural language processing tasks often involve categorising the type of word occurring in text. A particularly useful type of data refers to things like countries, organisations and individual people. The process of automatically identifying these is known as entity extraction. A second common task is to identify the ‘part of speech’ of particular words so that, for example, nouns, adjectives and adverbs can be automatically identi fi ed. GATE ( http://gate.ac.uk/ ) is a freely available open source tool written in Java which performs the above tasks together with many more related to text processing. Natural language processing has made great advances in areas such as automatic translation, speech recognition and grammar checking. One of the long-term goals of natural language processing is to enable machines to actually understand human text, but there are still huge challenges to overcome, and we should be circumspect in the case of claims that this goal has been achieved by an existing system or that a solution is very close.

7.6 Searching

Searching was one of the fi rst applications to which we might attach the term ‘cloud’, and searching remains, for most users, the most important tool on the web. Search engines such as Google, Bing and Yahoo are well-known examples of mature cloud applications and also give a good introduction to many important concepts relating to cloud data along with text engineering and collective intelligence.

It is therefore worth a brief look ‘under the hood’ to get an introduction to the inner workings of search engines. As well as providing an insight into one of the key tools for web and cloud data, our investigation will bring to light a number of important themes and concepts that are very relevant to web and cloud intelligence and make the most of noisy, unstructured data commonly found on the web. Of course, the commercial search engine market is a big business, and vendors keep the exact working of their systems well hidden. Nonetheless, there are a number of freely available open source libraries available, and at the end of this chapter, we build our own search engine. We will be using the widely used and much praised Apache Lucene index together with related Apache projects Nutch, Tika and Solr. It is worth noting that Nutch is speci fi cally designed to run on Apache Hadoop’s implementation of MapReduce which we investigated in Chap. 4 .

7.6.1 Search Engine Overview

Search engines have three major components as shown in Fig. 7.1 :

1. Crawler

2. Indexer

3. Web interface

7.6.2 The Crawler

T he web is unlike any other database previously encountered and has many unique features requiring new tools as well as the adaptation of existing tools. Whereas previous environments such as relational databases suggested that searching should cover

7.6 Searching

Fig. 7.1 Search engine components the entire database in a precise and predictable manner, this is simply not possible when we scale to web magnitudes. Web crawlers (often called spiders or bots) browse the web in a methodical manner often with the aim of covering a signi fi cant fraction of the publicly available part of the entire web.

T o start the crawl, a good set of seed pages is needed. A common way to do this is to obtain the seeds from a project such as the Open Directory Project ( http://www. dmoz.org/) . Once the list of seed pages has been obtained, the crawler essentially works by performing a series of HTTP GET (see Chap. 4) commands. Hyperlinks found in any page are stored and added to a list of sites to be fetched. Of course, there are many complications such as scheduling of the crawl, parallelisation, prioritisation, politeness (avoiding overwhelming particular sites by respecting their policy regarding crawling) and the handling of duplicates and dead links. There are a number of open source libraries which can be used for crawling ranging from the fairly simple and easy to use HTTP Track ( http://www.httrack.com/ ) to Apache Nutch ( http://nutch.apache.org/ ) which can be used to build a whole web search engine. We look at these in our end of chapter tutorials. Crawling the whole web is a major undertaking requiring massive resources in terms of storage, computation and management. In most cases organisations will be performing a crawl of their own data or perhaps a focused or intelligent web crawl on a particular topic area.

7.6.3 The Indexer

I nstead of trying to answer the question ‘what words are contained in a particular document?’, which you could answer by simply reading the document, an indexer aims to provide a quick answer to the question ‘which documents contain this particular word?’, and for this reason the index is referred to as an ‘inverted index’.

Fig. 7.2 Text processing steps

Once the crawler returns a page, we need to process it in some way. The page could be in a wide variety of format such as HTML, XML, Microsoft Word, Adobe PDF and plain text. Generally, the fi rst task of the indexer will be to extract the text data from any of the different formats likely to be encountered. This is a complex task but luckily open source tools such as Apache Tika ( http://tika.apache.org/) are freely available. Once the text is extracted, we can start to build the index by processing the stream of text. A number of processes are normally carried out before we build the index. Note that these steps will reduce the number of words stored which is likely to be helpful when extracting intelligence from text data (Fig. 7.2 ).

7.6.3.1 Tokenisation

The task of the tokeniser is to break the stream of characters into words. In English or most European languages, this is fairly straightforward as the space character can be used to separate words. Often punctuation is removed during this stage. In languages such as Chinese where there is no direct equivalent of the space character, this is a much more challenging task.

7.6.3.2 Set to Lower Case

W hen computer programs read text data, the upper- and lower-case forms of individual characters are given separate codes, and therefore, two words such as ‘cloud’ and ‘Cloud’ would be identi fi ed as separate words. It is generally useful to set all the characters to a standard form so that the same word is counted whether, for example, it is at the start or middle of a sentence although some semantic information may be lost.

7.6.3.3 Stop Word Removal

It is often useful to remove words which are very frequent in text but which carry low semantic value. For example, the Apache Lucene ( http://lucene.apache.org ) indexing system contains the following default list of stop words:

‘a’, ‘an’, ‘and’, ‘are’, ‘as’, ‘at’, ‘be’, ‘but’, ‘by’,

‘for’, ‘if’, ‘in’, ‘into’, ‘is’, ‘it’, ‘no’, ‘not’, ‘of’, ‘on’, ‘or’, ‘such’, ‘that’, ‘the’, ‘their’, ‘then’, ‘there’,

‘these’, ‘they’, ‘this’, ‘to’, ‘was’, ‘will’, ‘with’

I ncluding, even this small set of stop words can greatly reduce the total numbers of words which are stored in the index.

7.7 Vector Space Model

7.6.3.4 Stemming

Stemming allows us to present various word forms using a single word. For example, if the text contains the words ‘process’, ‘processing’ and ‘processes’, we would consider these as the same word for indexing purposes. Again, this could also greatly reduce the total number of words stored. Various methods for stemming have been proposed; the most widely used is the Porter stemming algorithm (Lucene contains classes to perform Porter stemming). However, there are some disadvantages to stemming, and some search engines do not use stemming as part of the indexing. If stemming is used when creating an index, the same process must be applied to the words that the user types into the search interface.

7.6.4 Indexing

Once the words have been extracted, tokenised, fi ltered and possibly stemmed, they can be added to an index. The index will be used to fi nd documents relevant to users’ search queries as entered into the search interface. An index will typically be able to quickly return a list of documents which contain a particular word together with other information such as the frequency or importance of that word in a particular document. Many search engines allow the user to simply enter one or more keywords. Alternatively they may build more complex queries, for example, requiring that two words must appear before a document is returned or that a particular word does not occur. Lucene has a wide range of query types available.

7.6.5 Ranking

A search engine which indexes a collection of hundreds of documents belonging to an organisation might produce acceptable results for most searches, especially if users gain expertise in more advanced query types. However, even in this case some queries might return too many documents to be useful. In the case of the web, the numbers become overwhelming, and the situation is only made worse by the fact that the web has no central authority to accept, categorise and manage documents; anyone can publish on the web and data is not always trustworthy. One of the most straightforward and widely used solutions is to place the results of a query in order, such that the pages or documents most likely to meet the user’s requirements are at the top of the list. There are a number of ways to do this.

7.7 Vector Space Model

T he vector space model (VSM) originally developed by Salton in the 1970s is a powerful way of placing documents in order of relevance to a particular query. Each word or term remaining after stop word removal and stemming is considered to be a ‘dimension’, and each dimension is given a weight. The model takes no account of the order of words in a document and is sometimes called a ‘bag of words’ model. The weight value ( W) is related to the frequency of each term in a document (the term frequency) and is stored in the form of a vector. A vector is a useful way of recording the magnitude and the direction. The weight of each term should represent the relative importance of the term in a document. Both documents ( d ) and queries ( q ) can be stored this way:

dj =(w1, ,j w2, ,j …,wt j, )

q =(w q w1, , 2, ,q …,wt q, )

The two vectors can then be compared in a multidimensional space allowing for a ranked list of documents to be returned based on their proximity to a particular query.

We could simply store a binary value indicating the presence or absence of a word in a document or perhaps the word frequency as the weight value in the vector. However, a popular and generally more effective way of computing the values to store against each term is known as tf-idf weighting (term frequency–inverse document frequency) which is based on two empirical observations regarding collections of text:

1. The more times a word occurs in a document, the more relevant it is to the topic of the document.

2. The more times the word occurs throughout the documents in the collection, the more poorly it discriminates between documents.

It is useful to combine the term frequency ( tf ) and the inverse of the number of documents ( idf ) in the collection in which the term occurs at least once to create a weight.

tf-idf weighting assigns the weight to a word in a document in proportion to the number of occurrences of the word in the document and in inverse proportion to the number of documents in the collection for which the word occurs at least once, that is,

w i j( ), = tf i j( ). *log(N / df i( ))

T he weight of the term i in document j is the frequency of the term i in document j multiplied by the log of the total number of documents ( N ). The log is used as a way of ‘squashing’ or reducing the differences and could be omitted but has been found to improve effectiveness. Perhaps it is easier to follow with an example.

A ssume we have 500 documents in our collection and the word ‘cloud’ appears in 60 of these. Consider a particular document wherein the word ‘cloud’ appears 11 times.

To calculate the tf-idf for that document:

The term frequency ( tf ) for ‘cloud’ is 11.

The inverse document frequency ( idf ) is log(500/60) = 0.92.

The tf - idf score that is the product of these quantities is 11 × 0.92 = 10.12.

7.9 Measuring Retrieval Performance

W e could then repeat this calculation across all the documents and insert the tf- idf values into the term vectors.

T here are many variations of the above formula which take account of other factors such as the length of each document. Many indexing systems, such as Apache Lucene, will store the term vector using some form of tf - idf weighting. Documents (or web pages) can then be quickly returned in order depending on the comparison of the term vectors of the query and documents from the index. Inverted indexes, such as that created by Apache Lucene, are essentially a compact data structure in which the term vector representation of documents is stored.

7.8 Classi fi cation

W e fi n d classifi c ations in all areas of human endeavour. Classifi c ation is used as a means of organising and structuring and generally making data of all kinds accessible and manageable. Unsurprisingly, classifi c ation has been the subject of intensive research in computer science for decades and has emerged as an essential component of intelligent systems.

In relation to unstructured text data, the ability to provide automatic classi fi cation has numerous advantages including:

• Labelling search results as they appear

• Restricting searches to a particular category (reducing errors and ambiguities)

• An aid to site navigation allowing users to quickly fi nd relevant sections

• Identifying similar documents/products/users as part of a recommendation engine (see below)

• Improving communication and planning by providing a common language

(referred to as an ‘ontology’)

Classi fi cation is used extensively in enterprise search such as that provided by Autonomy ( http://www.autonomy.com/) , and many tools such as spam fi l ters are built on the principles of classi fi cation.

T he term vector representation of a document makes it relatively easy to compare two or more documents. Classi fi ers need to be supplied with a set of example training documents where the category of each document is identi fi ed. Once the classi fi er has built a model based on the training documents, the model can then be used to automatically classify new documents. Generation of the model is usually performed by a machine learning algorithm such as naive Bayes, neural networks, support vector machines (SVM) or an evolutionary algorithm. Each of the different methods has its own strengths and weaknesses and may be more applicable to particular domains. Commonly different methods of classifi e rs are combined.

7.9 Measuring Retrieval Performance

The aim of a search query is to retrieve all the documents which are relevant to the query and return no irrelevant documents. However, for a real world data set above a minimum size and complexity, the situation is more likely to be similar to that shown

Fig. 7.3 Search query results

in Fig. 7.3 where many relevant documents are missed and many of the documents retrieved are not relevant to the query.

When testing a search or classi fi cation engine, it is important to be able to give some measure of effectiveness. Recall measures how well the search system fi nds relevant documents and precision measures how well the system fi l ters out the irrelevant documents. We can only obtain values for recall and precision when a set of documents relevant to a particular query is already known.

The F1 measure is a commonly used way of combining the two complementary measures to give an overall effectiveness number and has the advantage of giving equal weight to precision and recall. The actual accuracy achieved will depend on a number of factors such as the learning algorithm used and the size and quality of the training documents. We should note that the accuracy is ultimately down to a human judgement. Even using a human classi fi er will not achieve 100% accuracy as in most domains two humans are likely to disagree about the category labels for some documents in a collection of reasonable size and complexity. An impressive result of machine learning technology lies in the fact that with a good set of example documents, it has been reported that automatic classifi e rs can achieve accuracy close to that of human experts as measured using F1 or similar.

7.10 Clustering

Clustering is another useful task, especially when performed automatically. In this case no labelled training documents are given, and the clustering algorithm is required to discover ‘natural categories’ and group the documents accordingly. Some algorithms such as the widely used k-means require that the number of categories be supplied in advance, but otherwise no prior knowledge of the collection is assumed. Again, the term vectors of documents are compared, and groups of documents created based on similarity.

By organising a collection into clusters, a number of tasks such as browsing can become more ef fi cient. As with classi fi cation, clustering the description of products/blogs/users/documents or other Web 2.0 data can help to improve the user experience. Automatic identi fi cation of similar items is an important component in a number of other tools such as recommendation engines (see below).

Classi fi cation is referred to as a ‘supervised learning’ task and is generally more accurate than clustering which is ‘unsupervised’ and does not benefi t from the set of labelled training documents. A good example of clustering is http:// search.yippy.com/ (formerly clusty.com). Go to the site and search for ‘java’. Another example is provided by Google News ( http://news.google.com/ ) which

7.11 Web Structure Mining

Fig. 7.4 Link graph

automatically collects thousands of news articles and then organises them by subject, subtopic and region and can be made to display the results in a personalised manner. Commonly, clustering is used in conjunction with other tools (e.g. classi fi cation rules) to produce the fi nal result.

7.11 Web Structure Mining

A fundamental component of the web is of course the hyperlink. The way pages link together can be used for more than just navigation in a browser, and a good deal of research has gone into investigating the topology of hyperlinks. We can consider web pages to be nodes on a directed graph (put simply where lines between nodes on a graph indicate a direction), and where there is a link from a particular page ( p ) to another page ( q) , we can consider this to be a directed edge. So for any page, we can assign it a value based on the number of links pointing to and from that page. The out degree of a node p is the number of nodes to which it links, and the in degree of p is the number of nodes that have links to p . Thus, in Fig. 7.4 page 1 has an out degree of 1, and page 2 has an in degree of 2. The link from page 2 to page 3 is considered more important than the other links shown because of the higher in degree of page 2.

In web structure mining, we look at the topology of the web and the links between pages rather than at just the content of web pages. Hyperlinks from one page to another can be considered as endorsements or recommendations.

7.11.1 HITS

In 1998 Kleinberg created the HITS algorithm which was a method of using the link structure to assign an importance value to a web page which could be factored into the ranking of query results. The algorithm had some disadvantages including the fact that the computation was performed after the query was submitted making response times rather slow.

7.11.2 PageRank

T he founders of Google (Larry Page and Sergey Brin) created the PageRank algorithm which gives an estimate of the importance of a page. PageRank looks at the number of links pointing to a web page and the relative importance of those links and therefore depends not just on the number of links but on the quality of the links. PageRank asserts that if a page has important links pointing to it, then its own links to other pages are also important. The PageRank computation can be made before the user enters the query so that response times are very fast.

7.12 Enterprise Search

Searching is not limited to whole web search, and there is a huge market for focused searching or searching with a particular domain or enterprise. Companies such as Autonomy and Recommind ( http://www.recommind.com/ ) use arti fi cial intelligence combined with advanced search technology to improve user productivity. One of the key challenges lies in the fact that much of an organisation’s data may exist in disparate repositories or database systems which have their own unique search facility. More recently it has been noted that with the provision of cloudbased systems such as Google Docs and Microsoft Live Offi c e, people are naturally storing vital information outside the organisations networks. The provision of a federated search capability where one search interface can access the maximum amount of relevant data, possibly including data stored in external cloud sites, has become critical. Technology used by whole web search engines such as PageRank is not always applicable to the enterprise case where links between documents may be minimal or non-existent.

7.13 Multimedia Search

There has been increasing interest and huge investment in the development of new technology for searching audio and visual content. In its simplest form, this is a case of giving special consideration to the text in HTML tags for multimedia data. However, many vendors have gone beyond this and now provide a ‘deep’ search’ of multimedia data including facilities such as searching with images rather than text. The ability to automatically cluster or classify multimedia data has also made major advances.

7.14 Collective Intelligence

C ollective intelligence is an active fi e ld of research but is not itself a new phenomenon, and examples are cited in biology (e.g. evolutionary processes), social science and economics. In this chapter, we are using collective intelligence to refer to the kind of intelligence that emerges from group collaboration in web- and cloud-based systems. Collective intelligence like ‘cloud’ is a term where an exact de fi nition has not been agreed although it is mostly obvious when collective intelligence is being harnessed. The Center for Collective Intelligence at the Massachusetts Institute of Technology ( http://cci.mit.edu/ ) provides a nice description by posing the question that tools based on collective intelligence might answer:

How can people and computers be connected so that collectively they act more intelligently than any individuals, groups, or computers have ever done before?

T he rise of the web and Web 2.0 has led to huge amounts of user-generated content. Users no longer need any special skills to add their own web content, and indeed Web 2.0 applications actively invite users to interact and collaborate. Cloud computing has accelerated the process as ease of access, storage availability, improved ef fi ciency, scalability and reduced cost have all increased the attractiveness of storing information on the web and also made it easier and simpler for new innovative applications seeking to exploit collective intelligence to be developed.

T here are many examples of collective intelligence where systems leverage the power of the user community. We list a few below to give a fl avour of the topic:

1. Wikipedia is an outstanding result of a huge collaboration, and each article is effectively maintained by a large collection of individuals.

2. Many sites, such as reddit.com, allow users to provide content and then decide through voting ‘what’s good and what’s junk’.

3. Google Suggest provides a list of sites as soon the user begins typing. The suggestions are based on a number of factors, but a key one is the previous searches that individuals have entered (see Fig. 7.4 ).

4 . Genius Mixes from iTunes automatically suggests songs that should go well with those already in a users’ library.

5. Facebook and other social networks provide systems for ‘ fi nding friends’ based on factors such as the links already existing between friends.

6 . Online stores such as Amazon will provide recommendations based on previous purchases.

7. Image tagging together with advanced software such as face recognition can be used to automatically label untagged images.

8. The algorithms used by movie rental site Net fl ix to recommend fi lms to users are now responsible for 60% of rentals from the site.

9 . Google+ social network allow users to +1 any site as a means of recommendation.

10. Mail systems such as Gmail will suggest people to include when sending an email (Fig. 7.5 )

In his book, Wisdom of the Crowds, James Surowiecki, suggests that ‘groups are remarkably intelligent, and are often smarter than the smartest people in them’.

Fig. 7.5 Google suggest

He goes on to argue that under the right circumstances, simply adding more people will get better results which can often be better than those produced by a small group of ‘experts’. A number of conditions are suggested for ‘wise crowds’:

1. Diversity: the crowd consists of individuals with diverse opinions.

2. Independence: the individuals feel free to express their opinions.

3. Decentralisation: people are able to specialise.

4. Aggregation: there is some mechanism to aggregate that information and use it to support decision-making.

N ote that we are not talking about consensus building or compromise but rather a competition between heterogeneous views; indeed certain parallels have been drawn with Darwinian evolution.

The PageRank algorithm discussed above is actually an example of collective intelligence as the ordering is partly the result of a kind of voting occurring in the form of links between websites, and we can see how it easily meets all four criteria set out above. The web is a hugely diverse environment, and independent individuals are free to add links to any other page from their own pages. The underlying system is famously decentralised and aggregation is central to the PageRank algorithm. Real search engines actually use a complex mix of mathematically based algorithms, text mining and machine learning techniques together with something similar to PageRank. Let’s take a look at some more examples of ‘Collective Intelligence in Action’.

7.14.1 Tagging

Tagging is a process akin to classi fi cation whereby items such as products, web pages and documents are given a label. In the case of automatic text classifi c ation, we saw that both a predefi n ed set of categories and a set of labelled example documents were required to build a classi fi er. In this case people with some expertise in the domain would be required to give each example document the appropriate category labels. One approach to tagging is to use professionals to label items, although this could take up signi fi cant resources and could become infeasible, for example, where there is a large amount of user-generated content.

A second approach is to produce tags automatically either via clustering or classifi c ation as described above or by analysing the text. Again we can use the term vector to obtain the relative weights of terms found in the text which can be presented to the user in various visual formats (see below). A third option, which has proved increasingly popular in recent years and indeed has become ubiquitous on the web, is to allow users to create their own tags, either by selecting tags from a prede fi ned list or by using whatever words or phrases they choose to label items.

This approach, sometimes referred to as a ‘folksonomy’ (a combination of ‘folk’ and ‘taxonomy’), can be much cheaper than developing a controlled taxonomy and allows users to choose the terms which they fi nd the most relevant. In a folksonomy users are creating their own classifi c ation, and as they do collective intelligence tools can gather information about the items being tagged and about the users who are creating the tags. Once a reasonable number of items have been tagged, we can then use those items as examples for a classifi e r which can then automatically fi n d similar items. Tags can help with fi nding users with similar tagged items, fi nding items which are similar to the one tagged and creating a dynamic, user-centric vocabulary for analysis and collaboration in a particular domain.

7.14.2 Recommendation Engines

Recommendation engines are a prime example of ‘Collective Intelligence in Action’. In 2006, Netfl i x held the fi r st Netfl i x Prize fi n d a program to better predict user preferences and beat its existing Netfl i x movie recommendation system by at least 10%. The prize of $1 million was won in 2009, and the recommendation engine is now reported to contribute to a large fraction of the overall pro fi ts. Perhaps the best known recommendation engine comes from Amazon. If we select the book ‘Lucene in Action’, we get the screen shown in Fig. 7.6 where two sets of suggestions are shown.

I n fact if you click on some of the suggestions you will soon pick up most of the recommended reading for this chapter. As is indicated on the page, the information is obtained largely by analysing customers’ previous procurement patterns. Many other features can be factored into the recommendations such as similarity of the items analysed, similarity between users and similarity between users and items.

7.14.3 Collective Intelligence in the Enterprise

In large organisations, employees can waste signi fi cant time in trying to locate documents from their searches. Enterprise search companies such as Recommind (h ttp:// www.recommind.com/) encourage users to tag and rate items. The information is aggregated and then fed into the search engines so that other users can quickly get a feel for the particular topic from the tags and the usefulness derived from the classifi c ations and recommendations of others. This also allows for the option of removing or reducing the ranking of poorly rated items, a process sometimes referred to as collaborative fi ltering.

7.14.4 User Ratings

Many e-commerce sites give users the option of registering their opinion regarding a product, often in the form of a simple star system. The ratings of users can be

Fig. 7.6 Amazon recommendation engine

aggregated and reported to prospective buyers. Often the users are also able to write a review which other customers can then read and many sites now offer links to social network sites such as Facebook and Twitter where products can be discussed. The ratings of a particular user can also be evaluated by comparing with those given by other users.

R ecently there has been signifi c ant interest in tools which can automatically mine textual reviews and identify the sentiment or opinion being expressed, particularly in the case of social network data. The software may not be completely accurate due to the challenges of natural language text mentioned above but as a recurring theme in collective intelligence, if the accuracy is reasonable and the number of reviews being mined above a minimal threshold there may still be great value in using the tool.

7.14.5 Personalisation

Personalisation services can help in the fi ght against information overload. Information presented to the user can be automatically fi ltered and adapted without the explicit intervention of a user. Personalisation is normally based on a user pro fi le which may be partly or completely machine generated. Both the structure and content of websites can be dynamically adapted to users based on their profi l e. A successful system must be able to identify users, gather knowledge about their preferences and then make the appropriate personalisation. Of course, a degree of confi d ence in the system is required, and it would be much better to skip personalisation functions rather than make changes based on incorrect assumptions leading to a detrimental user experience and reduced effectiveness of the site.

We have already looked at web content mining, at least in terms of textual data and web structure mining in terms of PageRank. Users also provide valuable information via interacting with websites. Sometimes the information is quite explicit as in the case of purchasing, rating, bookmarking or voting. In e-commerce, products and services can be recommended to users based not only on the purchasing actions of other users but depending on specifi c needs and preferences relating to a profi l e. Web usage mining which deals with data gathered from user visits to a site is especially useful for developing personalisation systems. Data such as frequency of visits, pages visited, time spent on each page and the route navigated through a site can be used for pattern discovery, often with the aid of machine learning algorithms, which can then be fed into the personalisation engine.

7.14.6 Crowd Sourcing

M any of the systems for harnessing collective intelligence require minimal or no additional input from the users. However, directly asking customers or interested parties to help with projects has proved a success in a number of areas. One of the earliest and most famous examples dates from 2000 when NASA started its ClickWorker study which used public volunteers to help with the identifi c ation of craters. There are a number of systems springing up such as IdeaExchange from Salesforce.com ( http://success.salesforce.com/) where customers can propose new product solutions and provide evaluations of existing products. Amazon Mechanical Turk ( https://www.mturk.com/mturk/welcome ) uses crowdsourcing to solve a variety of problems requiring human intelligence.

7.15 Text Visualisation

We fi nish this chapter with a brief introduction to the fi eld of text visualisation. It has long been known that where there is a huge amount of information to digest, visualisation can be a huge aid to users of that data. Recent years have seen a growing set of tools to automatically visualise the textual content of documents and web pages.

T ag clouds are simple visualisations that display word frequency information via font size and colour that have been in use on the web since 1997. Users have found the visualisations useful in providing an overview of the context of text documents and websites. Whereas many systems are formed using user-provided tags, there has been signifi c ant interest in ‘word clouds’ or ‘text tags’ which are automatically generated using the text found in documents or websites. For example, the popular tool Wordle has seen a steady increase in usage, and Wordle or similar diagrams are commonly seen as part of a website to give users a quick view of the important topics.

Generally the word clouds are based on frequency information after stop word removal and possibly stemming. If stemming is used, it is important to display a recognisable word (often the most frequently occurring form) rather than the stemmed form which may be confusing to users. We can think of the diagrams as a visual representation of the term vector for a document and as in the term vector representation, it is based on a ‘bag of words’, and word proximity is generally not taken into account when generating the word cloud.

If you go to the Wordle site ( http://www.wordle.net/ ), you can see examples or you can create your own word clouds by pasting in text or pointing to resources containing text. The system selects the most frequent words and then presents them using various techniques to adjust font, colour, size and position, in a way that is pleasing and useful to the user.

A n experimental alternative is available at http://txt2vz.appspot.com/ Txt2vz is similar to Wordle but also shows how words are linked in the text and has some of the features of a mind map. Two formats of the txt2vz word cloud are available and examples generated from the text of this chapter are shown below (Figs. 7.7 and 7.8 ).

W ord clouds are simple and are commonly presented on websites with little or no explanation of how they should be used or interpreted. Often the words presented are made clickable as a means to provide a dynamic navigation tool which adjusts in real time as users add content.

7.17 End of Chapter Exercise

Fig. 7.7 txt2vz.appspot.com graph of this chapter, simple graph format

7.16 Chapter Summary

In this chapter, we have taken an overview of the various ways in which we can get the most out of web- and cloud-based systems. In particular we have looked at Web 2.0 applications and the exciting possibilities of using intelligent tools to gain valuable information and to enhance the user experience.

7.17 End of Chapter Exercise

7.17.1 Task 1: Explore Visualisations

W e begin by a brief look at some examples of the many freely available visualisation tools.

1. Wordle http://www.wordle.net/ is probably the most popular of the tools and a good starting point for the exercise. Experiment with the different clouds and create your own Wordle by pasting in plain text.

Fig. 7.8 txt2vz.appspot.com graph of this chapter, radial graph format

2. Visit the many eye sites (h ttp://www-958.ibm.com/software/data/cognos/manyeyes/ ) and look at the various visualisations and try creating your own visualisations.

3. Explore the tree cloud site ( http://www2.lirmm.fr/~gambette/treecloud/ ) and create a tree cloud using the same text you used for Wordle. Explore the different options.

4. Try txt2vz ( http://txt2vz.appspot.com/) . Use the same text again and notice any similarities or differences from the other tools. Try the adjustments to fi n d the best setting for your document.

7.17.2 Task 2: Extracting Text with Apache Tika

Online text is stored in a wide variety of formats such as HTML, XML, Microsoft Offi c e, Open Offi c e, PDF and RTF. For most applications trying to extract useful information from documents, the fi r st step is the extraction of the plain text from the various formats.

Text extraction is a challenging task but can be rendered quite straightforward via the freely available Tika Java library from the Apache Software Foundation: http://tika.apache.org/. Tika can smoothly handle a wide range of fi l e formats and even extract text from multimedia data. In this tutorial we will begin by downloading

7.17 End of Chapter Exercise

Tika and writing a short Java program to extract text. It would certainly help if you have some Java knowledge before you try this tutorial and you can learn the basics of Java at http://docs.oracle.com/javase/tutorial/. However, you can run this tutorial without Java knowledge by just carefully following each of the steps below:

1. Open your Ubuntu VM.

2. Create a new folder on your VM under ‘home’ called ‘dist’.

3. Go to the Tika site http://tika.apache.org/ and locate the latest download http:// tika.apache.org/download.html

4 . Download the latest version of the jar fi l e (e.g. http://www.apache.org/dyn/ closer.cgi/tika/tika-app-1.1.jar ) to the dist folder.

5. Open eclipse, select File/New/Project, select ‘Java Project’ and select ‘next’.

6 . Give your project a name such as ‘Text Analysis’, leave all the other options and select ‘next’.

7. Click on the ‘libraries’ tab and select ‘Add External JARs’.

8. Locate the ‘dist’ folder you created above and select the Tika jar fi l e and then select ‘ fi nish’.

9. If you are asked if you would like to switch to the Java perspective, select ‘yes’.

10. In the package explorer on the left, expand your project, right click on the ‘src’ folder and select ‘New/Class’.

11. In the name fi eld, call the class ‘TikaTest’ and select ‘ fi nish’.

12. Paste in the following code replacing and code stubs automatically generated by eclipse. Change ‘testDoc.docx’ with fi l e of your choice, ensuring that the path to the document is correct on your VM.

import java.io.File; import java.io.IOException; import org.apache.tika.Tika; import org.apache.tika.exception.TikaException; public class TikaTest {

public static void main(String[] args) {

TikaTest tikatest1 = new TikaTest();

File f = new File(“/home/hirsch/testDoc.docx”); try {

String s = tikatest1.getText(f, 1000);

System.out.println(s); } catch (IOException e) { e.printStackTrace(); } catch (TikaException e) { e.printStackTrace();

}

public String getText(File f, int maxSize) throws

IOException, TikaException

{

Tika tika = new Tika();

Ntika.setMaxStringLength(maxSize);

try { return tika.parseToString(f); catch (IOException e) { return “ Error: “ + e;

}

13. Click on the save icon and then right click on the ‘TikaTest.java’ in the package explorer and select ‘Run as’ and then ‘Java Application’. If all is well, you will see the fi r st part of the text of your document appear in the console window. Test the code with several different fi l e formats such as docx, doc, rtf, pdf, odt, html and xml.

14. If you are reasonably confi d ent with Java, try to modify the program so that stop words are removed. You can search the web for a suitable set or use the following set:

‘a’, ‘an’, ‘and’, ‘are’, ‘as’, ‘at’, ‘be’, ‘but’, ‘by’,

‘for’, ‘if’, ‘in’, ‘into’, ‘is’, ‘it’, ‘no’, ‘not’, ‘of’, ‘on’, ‘or’, ‘such’, ‘that’, ‘the’, ‘their’, ‘then’, ‘there’, ‘these’, ‘they’, ‘this’, ‘to’, ‘was’, ‘will’, ‘with’

7.17.3 Advanced Task 3: Web Crawling with Nutch and Solr

Nutch is an open source library for web crawling and again uses Lucene index for storing the crawled data. This task will take some time and effort and is probably for more experienced users. At the end of it, you will be able to perform a small crawl of the web, create a Lucene index and search that index.

G o to the Nutch page and examine the documentation ( http://nutch.apache.org/ ) and follow the tutorial ( http://wiki.apache.org/nutch/NutchTutorial ). Try and run a small crawl on your virtual machine. The tutorial shows you how you can combine a web crawl with Apache Solr. Solr provides a server based solution with projects based around Lucene indexes and will allow you to view and search the results of your crawl.

References

Alag, S.: Collective Intelligence in Action. Manning, Greenwich (2009)

Langille, A.A.: Google’s PageRank and Beyond. Princeton University Press, Princeton (2006)

Lingras, R.A.: Building an Intelligent Web Theory and Practice. Jones and Bartlett, Sudbury (2008)

Marmanis, H., Babenko, D.: Algorithms of the Intelligent Web. Manning, Greenwich (2009)

McCandless, M., Hatcher, E., Gospodnetic, O.: Lucene in Action. Manning, Greenwich (2010)

Segaran, T.: Programming Collective Intelligence. O’Reilly Media, Sebastopol (2007)

Surowiecki, J.: The Wisdom of Crowds: Why the Many are Smarter than the Few and How Collective Wisdom Shapes Business, Economies, Societies, and Nations. Anchor, Port Moody

(2005)

Niko Radino Page's

Senin, 13 Oktober 2014