What the reader will learn:
• The
importance of search in web and cloud technology
• The
challenges and potential of unstructured text data
• How
collective intelligence is being used to enhance a variety of new applications
• An
introduction to text visualisation
• An
introduction to web crawling
7.1 Introduction
W e have seen how web and cloud technology allow us to easily store and process vast amounts of information, and as we saw in Chap. 6, there are many different data storage models used in the cloud. This chapter looks at how we can start to unlock the hidden potential of that data, to fi nd the ‘golden nuggets’ of truly useful information contained in the overwhelming mass of irrelevant or useless junk and to discover new knowledge via the intelligent analysis of the data. Many of the intelligent tools and techniques discussed here originated well before cloud computing. However, the nature of cloud data, its scalable access to huge resources and the sheer size of the available data means that the advantages of these tools are much more obvious. We have now reached a place where many common web-based tasks would not be possible without them.
M uch of this new information is
coming directly from users. The art of tapping into both the data created by
users and the interaction of users with web and cloudbased applications brings
us to the fi eld of collective
intelligence and ‘crowd sourcing’. Collective intelligence has roots predating the
web in diverse fi e lds including biology and sociology. The application of
collective intelligence techniques to web-based data has rightly been receiving
a lot of attention in recent years, and it has become clear that, particularly
with the rapid expansion of cloud systems, this is a fruitful area for
developing new ways of intelligently accessing, synthesising and analysing our
data. It has been suggested that the eventual result will be Web 3.0.
R. Hill et
al., Guide to Cloud Computing: Principles and Practice, Computer 163
Communications and Networks, DOI 10.1007/978-1-4471-4603-2_7,
© Springer-Verlag London 2013
W e will start this chapter with a brief
overview of the kind of data we are dealing with and the techniques that are
already being employed to extract intelligence from user data, interaction and
collaborations. We will then look in some detail at the process of searching
including an appraisal of the challenges of dealing with textual data and an
overview of the procedures underlying a search engine. We then move to the fi e
ld of collective intelligence and what it can offer web and cloud applications.
We will fi n ish by looking at the power of visualisation. At the end of this
chapter, the exercises will use open source libraries to perform some of the
stages of extracting and analysing online text.
7.2 Web 2.0
T en years ago the majority of websites had the feel of a
lecture: The static information fl owed
one way from the site to the user. Most sites are now dynamic and interactive,
and the result is better described as a conversation. We should perhaps
remember that HTTP itself is designed to work in a conversational manner. Web 2.0
is all about allowing and encouraging users to interact with websites. As the
interaction increases, so does the amount of data. In this chapter, we will
look at some of the tools we can employ to get more out of user-generated
content. Often the process of using this content effectively to improve our
site leads to increased interest and further activity thus creating a virtuous
circle.
7.3 Relational Databases
Relational database
technology is mature and well known by most software developers and benefi t s
from the compact and powerful SQL language. Data stored in relational databases
has a number of distinct advantages when it comes to information retrieval including:
• Data
is stored in labelled fi elds.
• The fi elds (or columns) have a predetermined
data type and size attributes.
• We
can specify constraints on the fi elded
data, for example, ‘all entered values must be unique’, ‘null values are not
accepted’ or ‘entered data must fall within a particular set or range of
values’.
• The
well-understood normalisation technique can be applied which has been shown to
reduce redundancy and provide an easy to understand logical to physical storage
mapping.
7.4 Text Data
Despite the bene fi ts
listed above, it has been estimated that approximately 80% of an organisations
data is in an unstructured format. The data is of course multimedia, but text
data has attracted the huge interest as the primary source for web mining
7.5 Natural Language Processing
Table 7.1 Example synonyms
Synonyms
|
|
Physician
|
Doctor
|
Maize
|
Corn
|
Table 7.2 Example homonyms
Homonyms
|
|
Java (country)
|
Java (programming
language)
|
Board (board of directors)
|
Board (wooden board)
|
although in recent years attention is shifting towards other
formats, particularly images and video where signi fi cant advances have been
occurring.
Unstructured text data has the useful property
of being both human and machine readable, even if it makes no ‘sense’ to a
machine. The rules for text data are very different to those of relational
databases. Text data may or may not be validated and is often duplicated many
times. It may have some structure, such as in an academic article, or little or
no structure, for example, in the case of blogs and email. In some cases,
spellings and grammar are checked very carefully, but in others many mistakes,
misspellings, slang words, abbreviations and acronyms are common. There is a
notorious many to many relationship between words and meaning. In text data we
frequently fi n d synonyms that are different words with the same meaning and
homonyms that are words with the same spelling but with distinct meanings.
Examples are shown in Tables 7.1 and 7.2 .
Text also has many examples of multi-word
units such as ‘information retrieval’ where more than one word is referring to
a single concept, new words are constantly appearing, there are many examples
of multinational crossover and the meaning of words can vary over time. For
these reasons extracting useful information from textual data is in many ways a
harder problem than with data stored in relational databases. Typical examples
of text likely to be targeted for intelligent analysis are articles, white papers,
product information, reviews, blogs, wikis and message boards. One effect of
Web 2.0 has been to greatly increase the amount of online text data. The cloud
is providing easy access to scalable computing power with which to process this
data in innovative and fruitful ways. The web itself can be thought of as one
huge data store. We will look at the tools and techniques which have been
developed to maximise the potential of this fantastic resource at
humanity’s fi ngertips.
7.5 Natural Language Processing
N atural language processing is a broad set of techniques used
to process written and spoken human languages. Natural language processing
tasks often involve categorising the type of word occurring in text. A
particularly useful type of data refers to things like countries, organisations
and individual people. The process of automatically identifying these is known
as entity extraction. A second common task is to identify the ‘part of speech’
of particular words so that, for example, nouns, adjectives and adverbs can be
automatically identi fi ed. GATE ( http://gate.ac.uk/ ) is a freely available
open source tool written in Java which performs the above tasks together with many
more related to text processing. Natural language processing has made great
advances in areas such as automatic translation, speech recognition and grammar
checking. One of the long-term goals of natural language processing is to
enable machines to actually understand human text, but there are still huge
challenges to overcome, and we should be circumspect in the case of claims that
this goal has been achieved by an existing system or that a solution is very
close.
7.6 Searching
Searching was one of
the fi rst applications to which we
might attach the term ‘cloud’, and searching remains, for most users, the most
important tool on the web. Search engines such as Google, Bing and Yahoo are
well-known examples of mature cloud applications and also give a good
introduction to many important concepts relating to cloud data along with text
engineering and collective intelligence.
It
is therefore worth a brief look ‘under the hood’ to get an introduction to the
inner workings of search engines. As well as providing an insight into one of
the key tools for web and cloud data, our investigation will bring to light a
number of important themes and concepts that are very relevant to web and cloud
intelligence and make the most of noisy, unstructured data commonly found on
the web. Of course, the commercial search engine market is a big business, and
vendors keep the exact working of their systems well hidden. Nonetheless, there
are a number of freely available open source libraries available, and at the
end of this chapter, we build our own search engine. We will be using the
widely used and much praised Apache Lucene index together with related Apache
projects Nutch, Tika and Solr. It is worth noting that Nutch is speci fi cally
designed to run on Apache Hadoop’s implementation of MapReduce which we
investigated in Chap. 4 .
7.6.1 Search Engine Overview
Search engines have
three major components as shown in Fig. 7.1 :
1. Crawler
2. Indexer
3. Web
interface
7.6.2 The Crawler
T he web is unlike any other database previously encountered
and has many unique features requiring new tools as well as the adaptation of
existing tools. Whereas previous environments such as relational databases
suggested that searching should cover
7.6 Searching
Fig. 7.1 Search
engine components the entire database in a precise and predictable
manner, this is simply not possible when we scale to web magnitudes. Web
crawlers (often called spiders or bots) browse the web in a methodical manner
often with the aim of covering a signi fi cant fraction of the publicly
available part of the entire web.
T o start the crawl, a good set of seed
pages is needed. A common way to do this is to obtain the seeds from a project
such as the Open Directory Project ( http://www. dmoz.org/) . Once the list of seed pages has
been obtained, the crawler essentially works by performing a series of HTTP GET
(see Chap. 4)
commands. Hyperlinks found in any page are stored and added to a list of
sites to be fetched. Of course, there are many complications such as scheduling
of the crawl, parallelisation, prioritisation, politeness (avoiding
overwhelming particular sites by respecting their policy regarding crawling)
and the handling of duplicates and dead links. There are a number of open
source libraries which can be used for crawling ranging from the fairly simple
and easy to use HTTP Track ( http://www.httrack.com/ ) to Apache Nutch ( http://nutch.apache.org/ ) which can be used
to build a whole web search engine. We look at these in our end of chapter
tutorials. Crawling the whole web is a major undertaking requiring massive
resources in terms of storage, computation and management. In most cases
organisations will be performing a crawl of their own data or perhaps a focused
or intelligent web crawl on a particular topic area.
7.6.3 The Indexer
I nstead of trying to answer the question ‘what words are
contained in a particular document?’, which you could answer by simply reading
the document, an indexer aims to provide a quick answer to the question ‘which
documents contain this particular word?’, and for this reason the index is
referred to as an ‘inverted index’.
Fig. 7.2 Text processing steps
Once the crawler returns a page, we need to process it in some
way. The page could be in a wide variety of format such as HTML, XML, Microsoft
Word, Adobe PDF and plain text. Generally, the
fi rst task of the indexer will be to extract the text data from any of
the different formats likely to be encountered. This is a complex task but
luckily open source tools such as Apache Tika ( http://tika.apache.org/) are freely available. Once the text is
extracted, we can start to build the index by processing the stream of text. A
number of processes are normally carried out before we build the index. Note
that these steps will reduce the number of words stored which is likely to be
helpful when extracting intelligence from text data (Fig. 7.2 ).
7.6.3.1 Tokenisation
The task of the
tokeniser is to break the stream of characters into words. In English or most
European languages, this is fairly straightforward as the space character can
be used to separate words. Often punctuation is removed during this stage. In
languages such as Chinese where there is no direct equivalent of the space
character, this is a much more challenging task.
7.6.3.2 Set to Lower Case
W hen computer programs read text data, the upper- and
lower-case forms of individual characters are given separate codes, and
therefore, two words such as ‘cloud’ and ‘Cloud’ would be identi fi ed as
separate words. It is generally useful to set all the characters to a standard
form so that the same word is counted whether, for example, it is at the start
or middle of a sentence although some semantic information may be lost.
7.6.3.3 Stop Word Removal
It is often useful to
remove words which are very frequent in text but which carry low semantic
value. For example, the Apache Lucene ( http://lucene.apache.org ) indexing system
contains the following default list of stop words:
‘a’, ‘an’, ‘and’, ‘are’, ‘as’, ‘at’, ‘be’, ‘but’, ‘by’,
‘for’, ‘if’, ‘in’, ‘into’, ‘is’, ‘it’,
‘no’, ‘not’, ‘of’, ‘on’, ‘or’, ‘such’, ‘that’, ‘the’, ‘their’, ‘then’, ‘there’,
‘these’, ‘they’, ‘this’, ‘to’, ‘was’,
‘will’, ‘with’
I ncluding, even this small set of stop
words can greatly reduce the total numbers of words which are stored in the
index.
7.7 Vector Space Model
7.6.3.4 Stemming
Stemming allows us to
present various word forms using a single word. For example, if the text
contains the words ‘process’, ‘processing’ and ‘processes’, we would consider
these as the same word for indexing purposes. Again, this could also greatly
reduce the total number of words stored. Various methods for stemming have been
proposed; the most widely used is the Porter stemming algorithm (Lucene
contains classes to perform Porter stemming). However, there are some
disadvantages to stemming, and some search engines do not use stemming as part
of the indexing. If stemming is used when creating an index, the same process
must be applied to the words that the user types into the search interface.
7.6.4 Indexing
Once the words have
been extracted, tokenised, fi ltered and
possibly stemmed, they can be added to an index. The index will be used to fi nd documents relevant to users’ search
queries as entered into the search interface. An index will typically be able
to quickly return a list of documents which contain a particular word together
with other information such as the frequency or importance of that word in a
particular document. Many search engines allow the user to simply enter one or
more keywords. Alternatively they may build more complex queries, for example,
requiring that two words must appear before a document is returned or that a
particular word does not occur. Lucene has a wide range of query types
available.
7.6.5 Ranking
A search engine which
indexes a collection of hundreds of documents belonging to an organisation
might produce acceptable results for most searches, especially if users gain
expertise in more advanced query types. However, even in this case some queries
might return too many documents to be useful. In the case of the web, the
numbers become overwhelming, and the situation is only made worse by the fact
that the web has no central authority to accept, categorise and manage
documents; anyone can publish on the web and data is not always trustworthy.
One of the most straightforward and widely used solutions is to place the
results of a query in order, such that the pages or documents most likely to
meet the user’s requirements are at the top of the list. There are a number of
ways to do this.
7.7 Vector Space Model
T he vector space model (VSM) originally developed by Salton
in the 1970s is a powerful way of placing documents in order of relevance to a
particular query. Each word or term remaining after stop word removal and stemming
is considered to be a ‘dimension’, and each dimension is given a weight. The
model takes no account of the order of words in a document and is sometimes
called a ‘bag of words’ model. The weight value ( W) is related to the
frequency of each term in a document (the term frequency) and is stored in the
form of a vector. A vector is a useful way of recording the magnitude and the
direction. The weight of each term should represent the relative importance of
the term in a document. Both documents ( d
) and queries ( q ) can be stored
this way:
dj =(w1, ,j w2, ,j …,wt j, )
q =(w q w1, , 2, ,q …,wt q, )
The two vectors can then be compared in a
multidimensional space allowing for a ranked list of documents to be returned
based on their proximity to a particular query.
We
could simply store a binary value indicating the presence or absence of a word
in a document or perhaps the word frequency as the weight value in the vector.
However, a popular and generally more effective way of computing the values to
store against each term is known as tf-idf weighting (term frequency–inverse
document frequency) which is based on two empirical observations regarding
collections of text:
1. The
more times a word occurs in a document, the more relevant it is to the topic of
the document.
2. The
more times the word occurs throughout the documents in the collection, the more
poorly it discriminates between documents.
It
is useful to combine the term frequency ( tf
) and the inverse of the number of documents ( idf ) in the collection in which the term occurs at least once to
create a weight.
tf-idf
weighting assigns the weight to a word in a document in proportion to the
number of occurrences of the word in the document and in inverse proportion to
the number of documents in the collection for which the word occurs at least
once, that is,
w i j( ), = tf i j( ). *log(N / df i( ))
T he weight of the term i in
document j is the frequency of the term
i in document j multiplied
by the log of the total number of documents ( N ). The log is used as a way of ‘squashing’ or reducing the
differences and could be omitted but has been found to improve effectiveness.
Perhaps it is easier to follow with an example.
A ssume we have 500 documents in our
collection and the word ‘cloud’ appears in 60 of these. Consider a particular
document wherein the word ‘cloud’ appears 11 times.
To calculate
the tf-idf
for that document:
The term frequency (
tf ) for ‘cloud’ is 11.
The inverse document
frequency ( idf ) is log(500/60) =
0.92.
The tf
- idf score that is the product of
these quantities is 11 × 0.92 = 10.12.
7.9 Measuring Retrieval Performance
W e could then repeat this calculation
across all the documents and insert the tf-
idf values into the term
vectors.
T here are many variations of the above
formula which take account of other factors such as the length of each
document. Many indexing systems, such as Apache Lucene, will store the term
vector using some form of tf - idf
weighting. Documents (or web pages) can then be quickly returned in order
depending on the comparison of the term vectors of the query and documents from
the index. Inverted indexes, such as that created by Apache Lucene, are
essentially a compact data structure in which the term vector representation of
documents is stored.
7.8 Classi fi cation
W e fi n d classifi c ations in all areas of human endeavour.
Classifi c ation is used as a means of organising and structuring and generally
making data of all kinds accessible and manageable. Unsurprisingly, classifi c
ation has been the subject of intensive research in computer science for
decades and has emerged as an essential component of intelligent systems.
In
relation to unstructured text data, the ability to provide automatic classi fi
cation has numerous advantages including:
• Labelling
search results as they appear
• Restricting
searches to a particular category (reducing errors and ambiguities)
• An
aid to site navigation allowing users to quickly fi nd relevant sections
• Identifying
similar documents/products/users as part of a recommendation engine (see below)
• Improving
communication and planning by providing a common language
(referred to as an ‘ontology’)
Classi fi cation is used extensively in
enterprise search such as that provided by Autonomy (
http://www.autonomy.com/) , and many tools such as spam fi l ters
are built on the principles of classi fi cation.
T he term vector representation of a
document makes it relatively easy to compare two or more documents. Classi fi
ers need to be supplied with a set of example training documents where the
category of each document is identi fi ed. Once the classi fi er has built a
model based on the training documents, the model can then be used to
automatically classify new documents. Generation of the model is usually
performed by a machine learning algorithm such as naive Bayes, neural networks,
support vector machines (SVM) or an evolutionary algorithm. Each of the different
methods has its own strengths and weaknesses and may be more applicable to
particular domains. Commonly different methods of classifi e rs are combined.
7.9 Measuring Retrieval Performance
The aim of a search
query is to retrieve all the documents which are relevant to the query and
return no irrelevant documents. However, for a real world data set above a
minimum size and complexity, the situation is more likely to be similar to that
shown
Fig. 7.3 Search query results
in Fig. 7.3 where many relevant documents are missed and many
of the documents retrieved are not relevant to the query.
When testing a search or classi fi cation
engine, it is important to be able to give some measure of effectiveness.
Recall measures how well the search system
fi nds relevant documents and precision measures how well the system fi
l ters out the irrelevant documents. We can only obtain values for recall and
precision when a set of documents relevant to a particular query is already
known.
The F1 measure is a commonly used way of
combining the two complementary measures to give an overall effectiveness
number and has the advantage of giving equal weight to precision and recall.
The actual accuracy achieved will depend on a number of factors such as the
learning algorithm used and the size and quality of the training documents. We
should note that the accuracy is ultimately down to a human judgement. Even
using a human classi fi er will not achieve 100% accuracy as in most domains
two humans are likely to disagree about the category labels for some documents
in a collection of reasonable size and complexity. An impressive result of
machine learning technology lies in the fact that with a good set of example
documents, it has been reported that automatic classifi e rs can achieve
accuracy close to that of human experts as measured using F1 or similar.
7.10 Clustering
Clustering is another
useful task, especially when performed automatically. In this case no labelled
training documents are given, and the clustering algorithm is required to
discover ‘natural categories’ and group the documents accordingly. Some
algorithms such as the widely used k-means require that the number of
categories be supplied in advance, but otherwise no prior knowledge of the
collection is assumed. Again, the term vectors of documents are compared, and
groups of documents created based on similarity.
By
organising a collection into clusters, a number of tasks such as browsing can
become more ef fi cient. As with classi fi cation, clustering the description
of products/blogs/users/documents or other Web 2.0 data can help to improve the
user experience. Automatic identi fi cation of similar items is an important
component in a number of other tools such as recommendation engines (see
below).
Classi fi cation is referred to as a
‘supervised learning’ task and is generally more accurate than clustering which
is ‘unsupervised’ and does not benefi t
from the set of labelled training documents. A good example of
clustering is http:// search.yippy.com/ (formerly
clusty.com). Go to the site and search for ‘java’. Another example is provided
by Google News ( http://news.google.com/ )
which
7.11 Web Structure Mining
Fig. 7.4 Link graph automatically collects
thousands of news articles and then organises them by subject, subtopic and
region and can be made to display the results in a personalised manner.
Commonly, clustering is used in conjunction with other tools (e.g. classi fi
cation rules) to produce the fi nal
result.
7.11 Web Structure Mining
A fundamental component
of the web is of course the hyperlink. The way pages link together can be used
for more than just navigation in a browser, and a good deal of research has
gone into investigating the topology of hyperlinks. We can consider web pages
to be nodes on a directed graph (put simply where lines between nodes on a
graph indicate a direction), and where there is a link from a particular page (
p ) to another page ( q) , we can consider this to be a
directed edge. So for any page, we can assign it a value based on the number of
links pointing to and from that page. The out degree of a node p is the number
of nodes to which it links, and the in degree of p is
the number of nodes that have links to p . Thus, in Fig. 7.4 page 1
has an out degree of 1, and page 2 has an in degree of 2. The link from page 2
to page 3 is considered more important than the other links shown because of
the higher in degree of page 2.
In
web structure mining, we look at the topology of the web and the links between
pages rather than at just the content of web pages. Hyperlinks from one page to
another can be considered as endorsements or recommendations.
7.11.1 HITS
In 1998 Kleinberg
created the HITS algorithm which was a method of using the link structure to
assign an importance value to a web page which could be factored into the
ranking of query results. The algorithm had some disadvantages including the
fact that the computation was performed after the query was submitted making
response times rather slow.
7.11.2 PageRank
T he founders of Google (Larry Page and Sergey Brin) created
the PageRank algorithm which gives an estimate of the importance of a page.
PageRank looks at the number of links pointing to a web page and the relative
importance of those links and therefore depends not just on the number of links
but on the quality of the links. PageRank asserts that if a page has important
links pointing to it, then its own links to other pages are also important. The
PageRank computation can be made before the user enters the query so that
response times are very fast.
7.12 Enterprise Search
Searching is not
limited to whole web search, and there is a huge market for focused searching
or searching with a particular domain or enterprise. Companies such as Autonomy
and Recommind ( http://www.recommind.com/ ) use arti fi cial
intelligence combined with advanced search technology to improve user
productivity. One of the key challenges lies in the fact that much of an
organisation’s data may exist in disparate repositories or database systems
which have their own unique search facility. More recently it has been noted
that with the provision of cloudbased systems such as Google Docs and Microsoft
Live Offi c e, people are naturally storing vital information outside the
organisations networks. The provision of a federated search capability where
one search interface can access the maximum amount of relevant data, possibly
including data stored in external cloud sites, has become critical. Technology
used by whole web search engines such as PageRank is not always applicable to
the enterprise case where links between documents may be minimal or non-existent.
7.13 Multimedia Search
There has been
increasing interest and huge investment in the development of new technology
for searching audio and visual content. In its simplest form, this is a case of
giving special consideration to the text in HTML tags for multimedia data.
However, many vendors have gone beyond this and now provide a ‘deep’ search’ of
multimedia data including facilities such as searching with images rather than
text. The ability to automatically cluster or classify multimedia data has also
made major advances.
7.14 Collective Intelligence
C ollective intelligence is an active fi e ld of research but
is not itself a new phenomenon, and examples are cited in biology (e.g.
evolutionary processes), social science and economics. In this chapter, we are
using collective intelligence to refer to the kind of intelligence that emerges
from group collaboration in web- and cloud-based systems. Collective
intelligence like ‘cloud’ is a term where an exact de fi nition has not been
agreed although it is mostly obvious when collective intelligence is being
harnessed. The Center for Collective Intelligence at the Massachusetts
Institute of Technology ( http://cci.mit.edu/ ) provides a nice
description by posing the question that tools based on collective intelligence
might answer:
How can people and computers be connected so that
collectively they act more intelligently than any individuals, groups, or
computers have ever done before?
T he rise of the web and Web 2.0 has led
to huge amounts of user-generated content. Users no longer need any special
skills to add their own web content, and indeed Web 2.0 applications actively
invite users to interact and collaborate. Cloud computing has accelerated the
process as ease of access, storage availability, improved ef fi ciency,
scalability and reduced cost have all increased the attractiveness of storing
information on the web and also made it easier and simpler for new innovative
applications seeking to exploit collective intelligence to be developed.
T here are many examples of collective
intelligence where systems leverage the power of the user community. We list a
few below to give a fl avour of the
topic:
1.
Wikipedia is an outstanding result of a huge collaboration, and each article is
effectively maintained by a large collection of individuals.
2.
Many sites, such as reddit.com, allow users to provide content and then decide
through voting ‘what’s good and what’s junk’.
3. Google Suggest provides a list of sites as
soon the user begins typing. The suggestions are based on a number of factors,
but a key one is the previous searches that individuals have entered (see
Fig. 7.4
).
4 . Genius Mixes from iTunes
automatically suggests songs that should go well with those already in a users’
library.
5. Facebook and other social networks provide
systems for ‘ fi nding friends’ based on factors such as the links already
existing between friends.
6 . Online stores such as Amazon will
provide recommendations based on previous purchases.
7. Image
tagging together with advanced software such as face recognition can be used to
automatically label untagged images.
8. The
algorithms used by movie rental site Net fl ix to recommend fi lms to users are now responsible for 60%
of rentals from the site.
9 . Google+ social network allow users to +1 any site as a
means of recommendation.
10. Mail systems such as Gmail will
suggest people to include when sending an email (Fig. 7.5 )
In
his book, Wisdom of the Crowds, James Surowiecki, suggests that ‘groups are
remarkably intelligent, and are often smarter than the smartest people in
them’.
Fig. 7.5 Google suggest
He goes on to argue that under the right circumstances, simply
adding more people will get better results which can often be better than those
produced by a small group of ‘experts’. A number of conditions are suggested
for ‘wise crowds’:
1. Diversity:
the crowd consists of individuals with diverse opinions.
2. Independence:
the individuals feel free to express their opinions.
3. Decentralisation:
people are able to specialise.
4. Aggregation:
there is some mechanism to aggregate that information and use it to support
decision-making.
N ote that we are not talking about
consensus building or compromise but rather a competition between heterogeneous
views; indeed certain parallels have been drawn with Darwinian evolution.
The PageRank algorithm discussed above is
actually an example of collective intelligence as the ordering is partly the
result of a kind of voting occurring in the form of links between websites, and
we can see how it easily meets all four criteria set out above. The web is a
hugely diverse environment, and independent individuals are free to add links
to any other page from their own pages. The underlying system is famously
decentralised and aggregation is central to the PageRank algorithm. Real search
engines actually use a complex mix of mathematically based algorithms, text
mining and machine learning techniques together with something similar to
PageRank. Let’s take a look at some more examples of ‘Collective Intelligence
in Action’.
7.14.1 Tagging
Tagging is a process
akin to classi fi cation whereby items such as products, web pages and
documents are given a label. In the case of automatic text classifi c ation, we
saw that both a predefi n ed set of categories and a set of labelled example
documents were required to build a classi fi er. In this case people with some
expertise in the domain would be required to give each example document the
appropriate category labels. One approach to tagging is to use professionals to
label items, although this could take up signi fi cant resources and could
become infeasible, for example, where there is a large amount of user-generated
content.
A
second approach is to produce tags automatically either via clustering or
classifi c ation as described above or by analysing the text. Again we can use
the term vector to obtain the relative weights of terms found in the text which
can be presented to the user in various visual formats (see below). A third
option, which has proved increasingly popular in recent years and indeed has
become ubiquitous on the web, is to allow users to create their own tags,
either by selecting tags from a prede fi ned list or by using whatever words or
phrases they choose to label items.
This approach, sometimes referred to as a
‘folksonomy’ (a combination of ‘folk’ and ‘taxonomy’), can be much cheaper than
developing a controlled taxonomy and allows users to choose the terms which
they fi nd the most relevant. In a
folksonomy users are creating their own classifi c ation, and as they do
collective intelligence tools can gather information about the items being tagged
and about the users who are creating the tags. Once a reasonable number of
items have been tagged, we can then use those items as examples for a classifi
e r which can then automatically fi n d similar items. Tags can help with fi nding users with similar tagged
items, fi nding items which are similar
to the one tagged and creating a dynamic, user-centric vocabulary for analysis
and collaboration in a particular domain.
7.14.2 Recommendation Engines
Recommendation engines
are a prime example of ‘Collective Intelligence in Action’. In 2006, Netfl i x
held the fi r st Netfl i x Prize fi n d a program to better predict user
preferences and beat its existing Netfl i x movie recommendation system by at
least 10%. The prize of $1 million was won in 2009, and the recommendation
engine is now reported to contribute to a large fraction of the overall pro fi
ts. Perhaps the best known recommendation engine comes from Amazon. If we
select the book ‘Lucene in Action’, we get the screen shown in Fig. 7.6 where two
sets of suggestions are shown.
I n fact if you click on some of the
suggestions you will soon pick up most of the recommended reading for this
chapter. As is indicated on the page, the information is obtained largely by
analysing customers’ previous procurement patterns. Many other features can be
factored into the recommendations such as similarity of the items analysed,
similarity between users and similarity between users and items.
7.14.3 Collective Intelligence in the Enterprise
In large organisations,
employees can waste signi fi cant time in trying to locate documents from their
searches. Enterprise search companies such as Recommind (h ttp:// www.recommind.com/)
encourage users to tag and rate items. The information is aggregated and
then fed into the search engines so that other users can quickly get a feel for
the particular topic from the tags and the usefulness derived from the classifi
c ations and recommendations of others. This also allows for the option of
removing or reducing the ranking of poorly rated items, a process sometimes
referred to as collaborative fi ltering.
7.14.4 User Ratings
Many e-commerce sites
give users the option of registering their opinion regarding a product, often
in the form of a simple star system. The ratings of users can be
Fig. 7.6 Amazon recommendation engine
aggregated and reported to prospective buyers. Often the users
are also able to write a review which other customers can then read and many
sites now offer links to social network sites such as Facebook and Twitter
where products can be discussed. The ratings of a particular user can also be
evaluated by comparing with those given by other users.
R ecently there has been signifi c ant
interest in tools which can automatically mine textual reviews and identify the
sentiment or opinion being expressed, particularly in the case of social
network data. The software may not be completely accurate due to the challenges
of natural language text mentioned above but as a recurring theme in collective
intelligence, if the accuracy is reasonable and the number of reviews being
mined above a minimal threshold there may still be great value in using the
tool.
7.14.5 Personalisation
Personalisation
services can help in the fi ght against
information overload. Information presented to the user can be automatically fi ltered and adapted without the explicit
intervention of a user. Personalisation is normally based on a user pro fi le
which may be partly or completely machine generated. Both the structure and
content of websites can be dynamically adapted to users based on their profi l
e. A successful system must be able to identify users, gather knowledge about
their preferences and then make the appropriate personalisation. Of course, a
degree of confi d ence in the system is required, and it would be much better to
skip personalisation functions rather than make changes based on incorrect
assumptions leading to a detrimental user experience and reduced effectiveness
of the site.
We
have already looked at web content mining, at least in terms of textual data
and web structure mining in terms of PageRank. Users also provide valuable
information via interacting with websites. Sometimes the information is quite
explicit as in the case of purchasing, rating, bookmarking or voting. In
e-commerce, products and services can be recommended to users based not only on
the purchasing actions of other users but depending on specifi c needs and preferences relating to a profi l
e. Web usage mining which deals with data gathered from user visits to a site
is especially useful for developing personalisation systems. Data such as
frequency of visits, pages visited, time spent on each page and the route
navigated through a site can be used for pattern discovery, often with the aid
of machine learning algorithms, which can then be fed into the personalisation
engine.
7.14.6 Crowd Sourcing
M any of the systems for harnessing collective intelligence
require minimal or no additional input from the users. However, directly asking
customers or interested parties to help with projects has proved a success in a
number of areas. One of the earliest and most famous examples dates from 2000
when NASA started its ClickWorker study which used public volunteers to help
with the identifi c ation of craters. There are a number of systems springing up
such as IdeaExchange from Salesforce.com (
http://success.salesforce.com/)
where customers can propose new product solutions and provide
evaluations of existing products. Amazon Mechanical Turk ( https://www.mturk.com/mturk/welcome )
uses crowdsourcing to solve a variety of problems requiring human intelligence.
7.15 Text Visualisation
We fi nish this chapter with a brief
introduction to the fi eld of text
visualisation. It has long been known that where there is a huge amount of
information to digest, visualisation can be a huge aid to users of that data.
Recent years have seen a growing set of tools to automatically visualise the
textual content of documents and web pages.
T ag clouds are simple visualisations
that display word frequency information via font size and colour that have been
in use on the web since 1997. Users have found the visualisations useful in
providing an overview of the context of text documents and websites. Whereas
many systems are formed using user-provided tags, there has been signifi c ant
interest in ‘word clouds’ or ‘text tags’ which are automatically generated
using the text found in documents or websites. For example, the popular tool
Wordle has seen a steady increase in usage, and Wordle or similar diagrams are
commonly seen as part of a website to give users a quick view of the important
topics.
Generally the word clouds are based on
frequency information after stop word removal and possibly stemming. If
stemming is used, it is important to display a recognisable word (often the
most frequently occurring form) rather than the stemmed form which may be
confusing to users. We can think of the diagrams as a visual representation of
the term vector for a document and as in the term vector representation, it is
based on a ‘bag of words’, and word proximity is generally not taken into
account when generating the word cloud.
If
you go to the Wordle site ( http://www.wordle.net/ ), you can see examples
or you can create your own word clouds by pasting in text or pointing to resources
containing text. The system selects the most frequent words and then presents
them using various techniques to adjust font, colour, size and position, in a
way that is pleasing and useful to the user.
A n experimental alternative is available
at http://txt2vz.appspot.com/ Txt2vz is similar to Wordle but also shows how words
are linked in the text and has some of the features of a mind map. Two formats
of the txt2vz word cloud are available and examples generated from the text of
this chapter are shown below (Figs. 7.7 and 7.8 ).
W ord clouds are simple and are commonly
presented on websites with little or no explanation of how they should be used
or interpreted. Often the words presented are made clickable as a means to
provide a dynamic navigation tool which adjusts in real time as users add
content.
7.17 End of Chapter Exercise
Fig. 7.7 txt2vz.appspot.com graph of this chapter, simple
graph format
7.16 Chapter Summary
In this chapter, we
have taken an overview of the various ways in which we can get the most out of
web- and cloud-based systems. In particular we have looked at Web 2.0
applications and the exciting possibilities of using intelligent tools to gain
valuable information and to enhance the user experience.
7.17 End of Chapter Exercise
7.17.1 Task 1: Explore Visualisations
W e begin by a brief look at some examples of the many freely
available visualisation tools.
1. Wordle
http://www.wordle.net/ is probably the most popular of the tools and a good
starting point for the exercise. Experiment with the different clouds and
create your own Wordle by pasting in plain text.
Fig. 7.8
txt2vz.appspot.com graph of this chapter, radial graph format
2. Visit
the many eye sites (h
ttp://www-958.ibm.com/software/data/cognos/manyeyes/ ) and look at
the various visualisations and try creating your own visualisations.
3. Explore
the tree cloud site (
http://www2.lirmm.fr/~gambette/treecloud/ ) and create a tree cloud
using the same text you used for Wordle. Explore the different options.
4. Try
txt2vz ( http://txt2vz.appspot.com/) . Use the same
text again and notice any similarities or differences from the other tools. Try
the adjustments to fi n d the best setting for your document.
7.17.2 Task 2: Extracting Text with Apache Tika
Online text is stored
in a wide variety of formats such as HTML, XML, Microsoft Offi c e, Open Offi c
e, PDF and RTF. For most applications trying to extract useful information from
documents, the fi r st step is the extraction of the plain text from the
various formats.
Text extraction is a challenging task but can
be rendered quite straightforward via the freely available Tika Java library
from the Apache Software Foundation: http://tika.apache.org/. Tika can smoothly handle a wide range of fi l
e formats and even extract text from multimedia data. In this tutorial we will
begin by downloading
7.17 End of Chapter Exercise
Tika and writing a short Java program to extract text. It
would certainly help if you have some Java knowledge before you try this
tutorial and you can learn the basics of Java at http://docs.oracle.com/javase/tutorial/. However, you can run this tutorial without
Java knowledge by just carefully following each of the steps below:
1. Open
your Ubuntu VM.
2. Create
a new folder on your VM under ‘home’ called ‘dist’.
3. Go
to the Tika site http://tika.apache.org/ and locate the latest download http:// tika.apache.org/download.html
4 . Download the latest version of the
jar fi l e (e.g. http://www.apache.org/dyn/ closer.cgi/tika/tika-app-1.1.jar
) to the dist folder.
5. Open eclipse,
select File/New/Project, select ‘Java Project’ and select ‘next’.
6 . Give your project a name such as
‘Text Analysis’, leave all the other options and select ‘next’.
7. Click on the
‘libraries’ tab and select ‘Add External JARs’.
8.
Locate the ‘dist’ folder you created above and select the Tika jar fi l e and
then select ‘ fi nish’.
9.
If you are asked if you would like to switch to the Java perspective, select
‘yes’.
10. In
the package explorer on the left, expand your project, right click on the ‘src’
folder and select ‘New/Class’.
11. In
the name fi eld, call the class
‘TikaTest’ and select ‘ fi nish’.
12. Paste
in the following code replacing and code stubs automatically generated by
eclipse. Change ‘testDoc.docx’ with fi l e of your choice, ensuring that the
path to the document is correct on your VM.
import java.io.File; import java.io.IOException; import org.apache.tika.Tika; import
org.apache.tika.exception.TikaException;
public class TikaTest {
public static void main(String[] args) {
TikaTest tikatest1 = new
TikaTest();
File f = new
File(“/home/hirsch/testDoc.docx”); try
{
String s =
tikatest1.getText(f, 1000);
System.out.println(s); } catch (IOException e) { e.printStackTrace(); } catch (TikaException e) { e.printStackTrace();
}
}
public String getText(File f, int maxSize)
throws
IOException, TikaException
{
Tika tika = new Tika();
Ntika.setMaxStringLength(maxSize);
try {
return tika.parseToString(f);
catch (IOException e) {
return “ Error: “ + e;
}
}
}
13. Click
on the save icon and then right click on the ‘TikaTest.java’ in the package
explorer and select ‘Run as’ and then ‘Java Application’. If all is well, you
will see the fi r st part of the text of your document appear in the console
window. Test the code with several different fi l e formats such as docx, doc,
rtf, pdf, odt, html and xml.
14. If
you are reasonably confi d ent with Java, try to modify the program so that
stop words are removed. You can search the web for a suitable set or use the
following set:
‘a’, ‘an’, ‘and’, ‘are’, ‘as’, ‘at’, ‘be’, ‘but’, ‘by’,
‘for’, ‘if’, ‘in’, ‘into’, ‘is’, ‘it’,
‘no’, ‘not’, ‘of’, ‘on’, ‘or’, ‘such’, ‘that’, ‘the’, ‘their’, ‘then’, ‘there’,
‘these’, ‘they’, ‘this’, ‘to’, ‘was’, ‘will’, ‘with’
7.17.3 Advanced Task 3: Web Crawling with Nutch and Solr
Nutch is an open source
library for web crawling and again uses Lucene index for storing the crawled
data. This task will take some time and effort and is probably for more
experienced users. At the end of it, you will be able to perform a small crawl
of the web, create a Lucene index and search that index.
G o to the Nutch page and examine the
documentation ( http://nutch.apache.org/ ) and follow the
tutorial (
http://wiki.apache.org/nutch/NutchTutorial ). Try and run a small
crawl on your virtual machine. The tutorial shows you how you can combine a web
crawl with Apache Solr. Solr provides a server based solution with projects
based around Lucene indexes and will allow you to view and search the results
of your crawl.
References
Alag, S.: Collective Intelligence in Action. Manning,
Greenwich (2009)
Langille, A.A.: Google’s PageRank and Beyond.
Princeton University Press, Princeton (2006)
Lingras,
R.A.: Building an Intelligent Web Theory and Practice. Jones and Bartlett,
Sudbury (2008)
Marmanis, H., Babenko, D.: Algorithms of the
Intelligent Web. Manning, Greenwich (2009)
McCandless, M., Hatcher, E., Gospodnetic, O.: Lucene
in Action. Manning, Greenwich (2010)
Segaran, T.: Programming Collective Intelligence.
O’Reilly Media, Sebastopol (2007)
Surowiecki,
J.: The Wisdom of Crowds: Why the Many are Smarter than the Few and How
Collective Wisdom Shapes Business, Economies, Societies, and Nations. Anchor,
Port Moody
(2005)
0 komentar:
Posting Komentar