What the reader will learn:
• That
there are many types of data storage available, meeting different needs
• That
the web and cloud provide special challenges to database users but also provide
opportunities to design data-centric solutions that are not just relational
• Some
hands-on experience of using NoSQL databases and understanding of the business
needs addressed by each solution
• An
appreciation of the other data-centric uses the cloud can help with, such as
backup and recovery and disaster recovery
6.1 Historic Review of
Database Storage Methods
These days we
tend to think only of computer-based databases, but databases have been with us
for centuries. The ancient Egyptians stored data on stone tables in columns and
rows. More recently, libraries used a card index system to allow users to
retrieve an ISBN number from an indexed drawer, full of cards. The key attributes of a database could be
seen as:
• They
contain data (though not necessarily information).
• They
allow the data to be read and can allow data to be updated, deleted and
inserted.
• There
are often mechanisms to (rapidly) search for data items.
In the modern era, the fi rst universally accepted standard model
was that of the hierarchical database. They stored data in a tree-like
hierarchy. When querying the data, you navigated down the hierarchy to get to
the record you needed (see Fig. 6.1 ).
U sers of current relational
databases may recognise the idea as very similar to that of a B-tree index,
which is usually the default form of index storage in current Relational
Database Management System (RDBMS). This is not accidental. The speed with which
you get to an item of data is potentially much faster than other search
algorithms and is always a known, manageable number of data reads.
R. Hill et
al., Guide to Cloud Computing: Principles and Practice, Computer 121
Communications and Networks, DOI 10.1007/978-1-4471-4603-2_6,
© Springer-Verlag London 2013
Fig. 6.1 The hierarchical data storage model
6.2 Relational Is the New Hoover
Rapid though
certain queries can be in hierarchical databases, they do have some signifi cant downsides. Especially from the
perspective of today when we have extremely
fl exible RDBMS available, the rigidity of the hierarchy means that
accessing data in any other way than through the tree structure is very
resource costly and slow.
U nlike today, at the time when
these databases were in common use, hard disk space was very expensive. The
only way to provide some alternative routes through the data was effectively to
duplicate the data within different trees. This was very costly.
A d hoc queries of the type we take
for granted now were not easy either. In the example in Fig. 6.1 , we
could probably already guess that we would have a product, buyer and salesman
table in a relational schema for the same data. Once stored, we can select our
required data at will, by forming joins between tables.
At a time when hard disk drives were expensive
and managers wanting to understand the data in the system would often have to
wait days, or even weeks, for the query to be coded, punched and run, it is not
surprising that Codd’s seminal works on the relational model in the 1970s
caused such an upheaval in the database world.
SQL, the standardised language which comes
with relational databases, is, provided the database is designed properly,
extremely fl exible in what it allows a
user to query. The query language itself, whilst it can challenge a newcomer,
is a far cry from the diffi c ult alternative of writing your own data access
mechanisms in a programming language. Its near-English nature means that many
managers became free of the constraints of having to use programmers to get the
data they needed to understand their businesses better.
Note
: It is the change in the way which we store the data that changes the way we
can access it. No one storage model is perfect. Hierarchical still works well
if you only ever query data using a known tree structure. Other models have
coexisted with relational over the last couple of decades. Object and object
relational are perhaps
6.5 Backup or Disaster
Recovery?
still the most common, for example. It is the change in storage
brought about by web and cloud technologies that have now generated a number of
alternative ‘NoSQL’ databases that we explore later in this chapter.
6.3 Database as a Service
I n previous chapters, we have described a variety of cloud
service models. One that is important in terms of this chapter is the idea of
databases being available as a service.
Cloud computing provides opportunities for ef
fi cient data management where users defi n e, update and query a database
which is running in the cloud environment. As with the other ‘as-a-service’
services, it has the familiar characteristics of pay-per-use and an on-demand
model. Instead of having the necessary resources for deploying a database
management system, companies may prefer to employ DaaS mainly for fl exibility and costs.
M any organisations have invested heavily
in client–server-based database systems over the last two decades. Names like
Oracle, SQL Server and DB2 are in everyday parlance for most corporate data
professionals. However, some heavy investment has been made to open up the
pay-for-use approach to storing data. The big players include Amazon RDS and
Microsoft Azure, although most vendors now include cloud versions of their
RDBMS. There are new players in the smaller-scale database fi eld too, such as Amazon’s SimpleDb.com.
6.4 Data Storage in the Cloud
The cloud has not only
enabled database management systems to become pay-asyou-go; it has also
provided organisations with the opportunity to store their raw data, or perhaps
copies of it, off-site. By storing multiple copies of the data on several
servers, these sites have built in ‘redundancy’, that is, the objects you store
in them are replicated. This means a single point of failure will not cause the
loss of any data overall. Organisations can use services such as Amazon’s S3 or
more specialist backup and disaster recovery vendors like Backup Technology to
keep copies of data in the cloud, rather than having an expensive second,
standby server on-site.
6.5 Backup
or Disaster Recovery?
It is to be hoped that the computer-using reader will
understand the concept of backing up their valuable data! Hard drives do go
wrong, and when they do, they can destroy information. For many organisations,
this information is vital to their livelihood. Losing it would not be a minor
inconvenience but could actually put them out of business.
I n database systems, it is normal for
database administrators (DBA) to be responsible for ensuring no data is ever
lost. The difference between backup and DR can be seen as a matter of scale. We
can describe a backup strategy as an approach to ensuring no data is
permanently lost. At its simplest, this means taking copies of the changes that
are made to the database over a certain period (a week for example) and storing
those changes externally to the database. Should data go missing, as a result
of some disk failure, for example, a recovery mechanism comes into play to
allow the DBA to ‘replay’ the changes to the database since a point in time
when the entire database was consistent. Depending upon how much data is lost
and which parts of the database are effected, this can be a relatively simple
process or can mean retrieving the last full backup copy of the database and
then replaying all changes since that backup.
B ackup and recovery normally works on
the premise that the server and DBMS are up and running and that what needs
repairing are holes in the database itself. Disaster recovery is focused on how
to keep a business’s critical systems available when calamities happen:
earthquakes, fi res, fl oods, accidental deletion of tables or
other user mistake—things that can affect the very fabric of our IT systems.
DR
often requires the duplication of systems so that if the master system goes
down, a parallel system in a different site kicks in, allowing users to
continue using the system. This is, of course, costly, as it means duplicating
expensive IT equipment. The speed with which a system becomes available again
is measured as the Mean Time to Recovery (MTTR). Better MTTR will normally
require expensive solutions. Parallel systems connected by high-speed networks
mean that the system out time could be measured in seconds but do mean
virtually doubling the hardware and maintenance costs of an organisation’s IT
systems.
L ess costly alternatives exist. A
standby server can, for example, be separate from the active system but has a
complete backup of the database stored on it and has changes applied to it on a
regular basis, often overnight.
N ow cloud brings other alternatives and
is causing organisations to rethink their backup and DR strategies. Traditional
backup and recovery methods have largely been based around magnetic tape as a
medium for storing data. The advantage of this medium is that it is relatively
cheap. The disadvantage is that it is slow to access and can be quite labour
intensive. Systems where the user simply clicks a button to send their backup
information to one of several servers somewhere in the cloud are fast pressuring
the traditional approach.
However, there are issues, real and perceived,
about the security of storing your data in the cloud. Many decision-makers are
just more comfortable with knowing where their data is physically stored and
putting in place a security mechanism to protect that data. Putting data in the
cloud, they fear, makes it open to hacking.
Of
course, data on your own server is not necessarily always safe. Encrypting data
before it is passed into the backup environment is one way to add extra layers
of protection around the data. True though these facts may be, to the
frustration of the providers of disk in the cloud, the perception is still
sometimes that you are safer having control of your own data on your own
servers. We cover security and governance in the cloud in more detail in Chap. 10 .
6.7 Business Drivers for the
Adoption of Different Data Models
6.6 If You Only
Have a Hammer – Or Why Relational
May Not Always Be the Right Answer
R elational databases (RDBMS) have dramatically changed the
way we deal with organisational data. They allow nonprogrammers to access data
relatively easily on an ad hoc basis. They are also very safe, and it is given
that the data you look at when you query the database is correct, valid and
current. This is no accident. RDBMS are designed around the ACID
characteristics; that is, data held in them have:
• Atomicity . All changes must be ‘all or
nothing’ to allow transactions to work (see below).
• Consistency . All data must always
follow the rules as de fi ned in the schema.
• Isolation . Any data being affected by
one transaction must not be used by another transaction until the fi rst is complete.
• Durability. Once a transaction is committed, it should
become permanent and unalterable by machine failure of any type.
The business need addressed by an RDBMS is
often based around the concept of a transaction as opposed to a process. As a
simple example, if I was to want to transfer £100 from my account to Richard’s
to settle a debt, we can see two basic processes:
1. Reduce
my account by £100
2. Increase
Richard’s account by £100
Once both processes have completed, the
transaction is complete, and we can say the changes have been committed to the
database. However, should the database crash after step (1), but before step
(2), the transaction is not complete. If we were only dealing with successful
processes, then my account would be £100 lighter, but Richard would still think
that I owed him £100.
Both in the world of work and on university
courses, the ACID model is often taken for granted as a requirement for any
database. We can lose sight of the fact that sometimes our data needs are not
transactionally focused, and therefore, sometimes, the relational model isn’t
the most appropriate for what we need to do with the data we are storing. This
statement may come as a shock to people for whom a database is a relational
database!
6.7 Business Drivers for the Adoption of Different Data Models
R elational databases have been at the centre of most business
systems for decades. For large organisations, with large volumes of data, they
are essential. However, as web-driven systems began to expand, it became clear
that the relational model is not good at everything. Google eventually took the
decision to write their own database system to support their global requirement
for information searching, for example.
T he story behind BigTable is well
documented, especially by Chang et al. (2008) . Its basic data model is not relational, as
that model did not allow Google to provide rapid text-based searches across the
vast volumes of data stored in the web. It is rather a sparse, distributed
multidimensional database. Although it is Google’s own product and not openly
available, other such databases do exist. HBase is an open source database that
claims to have a similar data model to that of BigTable, for example.
Why would Google feel it is necessary to put
so much effort and resource to create their own database? Simply because
relational databases, with their large processing overhead in terms of
maintaining the ACID attributes of the data they store and their reliance on
potentially processor hungry joins, is not the right tool for the task they had
before them: quickly fi n ding relevant data from terabytes of unstructured data
(web content) which may be stored across thousands of geographically desperate
nodes. In short, they found that relational does not scale well for this type
of data.
6.8 You Can’t Have Everything
DBAs have long had to
balance a number of business drivers. And DBAs have long realised that there
are balances to be found between competing drivers. Availability (the fact that
the database and systems using it should ideally be always ready to accept work)
is one such driver, for example. A bit like grandma’s apple pie, everyone wants
their database to be fi ve 9s (99.999%)
available. But that sort of availability costs both money and time. Moreover,
particularly in an ACID environment, built-in redundancy (having multiple
copies of the data in case one node fails) is also likely to slow performance
(another key driver for a DBA) as more writes are required to keep the system
consistent.
As
we have alluded to already, some of today’s data processing tasks involve data
which is duplicated (for redundancy and performance reasons) across potentially
many different nodes in the web. This can be when RDBMS start to struggle,
especially with large volumes of such distributed data.
The DBA of such an RDBMS may start by
examining ways that he or she can speed retrieval up by using secondary
indexing. The problem is that whilst this may speed a read operation, it is
also likely to slow every write operation. Further, they may fi nd that the RDBMS, in a quest to maintain
ACID properties, struggles with keeping every node consistent.
There have been some interesting papers
published which attempt to justify the NoSQL approach by referring to CAP
theorem. Brewer (2012) makes the point that there is a three-way pull when
designing distributed systems in general and for the web in particular. The
three forces competing are the needs for the data to be:
• Consistent
• Available
• Partition
tolerant (coping with breaks in physical network connections)
M ost traditional RDBMS would guarantee
that all the values in all our nodes are identical before it allows another
user to read the values. But as we have seen, that is at a signi fi cant cost
in terms of performance.
6.10 So What Alternative Ways to Store Data Are
There?
6.9 Basically
Available, Soft State, Eventually
Consistent (BASE)
BASE is the NoSQL
operating premise, in the same way that traditional transactionally focused
databases use ACID. You will note that we move from a world of certainty in
terms of data consistency to a world where all we are promised is that all
copies of the data will, at some point, be the same.
Even here there is no single approach. The
eventual state of consistency can be provided as a result of a read-repair,
where any outdated data is refreshed with the latest version of the data as a
result of the system detecting stale data during a read operation. Another
approach is that of weak consistency. In this case, the read operation will
return the fi r st value found, not checking for staleness. Any stale nodes
discovered are simply marked for updating at some stage in the future. This is
a performance-focused approach but has the associated risk that data retrieved
may not be the most current.
Again, some people used to traditional RDBMS
environments will fi nd it hard to
handle the last sentence! In some applications, however, it is not critical for
every user to have identical data at all times. Think, for example, of a
discussion board application:
In
a thread, let’s say 100 answers exist as a result of a user, A, asking a
question. User B decides to add comment 101. A user C, logging on to review the
threads after user B has committed their comment, should expect to see 101
answers. But if one node (node N) fails to get rapidly updated and user C
retrieves their data from N, they will only see 100 answers. But does this
really matter? User C is in the same position as if they had viewed the thread
a minute ago. It’s easy to say it does matter, but if we give you the option of
free access to an uncertain number of answers or charge you £10 to access a
guaranteed up-to-date list, you may start to think it matters less!
6.10 So
What Alternative Ways to Store Data Are There?
A s we have seen, the ACID requirements that have guided
relational database designers for decades are not sacrosanct. Moreover,
implementing the procedures needed to ensure that the ACID properties can be
guaranteed is a large overhead on the DBMS—an overhead that costs in terms of
performance.
There may, for example, be occasions when the
database user would be willing to forgo a guarantee that every user sees the
same version of the data (consistency) eventually for a rule that everyone can
see a version of the data immediately. Techniques used to ensure consistency in
RDBMS environments typically include storing old values in an undo area and
preventing other users seeing data by locking rows, or even tables, whilst
transactions happen. For the locked-out user, this seems like the database is
performing poorly or, worse, is not functioning properly.
6.11 Column Oriented
W e have become very used to thinking of rows of data. Many of
the database professionals’ standard tools force this way of looking at data
upon us. Not only do tables hold rows, but it can be exported as a CSV
(comma-separated value) or fi xed length
text fi le. For many applications, this
model is entirely appropriate.
H
owever, in some circumstances, the majority of
activity against the data hap-pens in a column-focused way. Often in data
analysis, for example, we will be taking averages, looking for high and low
values or looking for trends in one particular column.
I
n those circumstances, it is rather ineffi c
ient (in terms of disk reads and CPU usage) to have to fi nd each row, locate the column concerned,
check if it contains data and then read the value. Many high-end RDBMS have
invented methods for speeding this readand-search process, but, nonetheless, it
is wasteful if all you want to do is read the same attribute from each row. We
should remember that reading from disk is a slow process and limiting disk
reads is an important aspect of any DBMS optimisation process.
T he column-oriented approach improves
read effi c iency (though not necessarily write effi c iency) by not having any
row identifi e r and packing data to allow compression schemes to reduce the
volume of disk storage required. This allows for more rapid analysis of those
column values but does mean that full tuples (or records) have to be recreated
by using positional information to look up the other data items from a row, and
that is inef fi cient. Column based is therefore better suited to implementations
where analysis of one column is needed.
M oreover, because the best performance,
in terms of compression, will happen when similar values are adjacent to each
other, the values of a column should ideally be known at the outset, so they
can be written down to disk in the most effi c ient way. One insert could
result in many rewrites as the data is shuf fl ed round to accommodate the new
value, making low data volatility essential.
Examples include Ingres Vectorwise, Apache
HBase and Cassandra. We will be using the latter as an example of the type for
you to experiment with later in this chapter.
6.12 Document Oriented
R elational database design techniques all work on the
assumption that the data you wish to store is prede fi nable and will always
contain the same type of data. But in life, we sometimes collect data which is
less well structured.
Take as an example the following contact
information from a CRM database:
FirstName = “George”,
Surname=”Smith” ,
Pets=[{Type:”Dog”, Name:”Flint”,Age:5},
{Type:”Cat”, Name:”Jade”,Age:7}], Tel=”0111223344”.
FirstName = “Mary”, Surname=”Songo” ,
Pets=[{Type:”Dog”, Name:”Ajax”,Age:2}],
Mob=”077777777”, Tel=”0211225555”.
Both of these are entirely realistic data
about the customer, but the data items are not
identical. Whilst a relational database would have a schema which included
columns to store all potential data items, document-oriented databases have the
6.14 When to Use Which Type of Data Storage?
fl e xibility to allow you to store this sort of
semi-structured data in a schema-less environment.
M any of the document-centric databases
don’t allow data to be locked in the way that is required for atomic
transactions in RDBMS. Since locking is a signifi c ant performance overhead,
this enables them to claim performance advantages over traditional databases.
The downside, of course, is that some applications absolutely require secure
transactional locking mechanisms. Document oriented is probably not the right
vehicle for highly structured data in a transaction-dependent environment, as
occurs in many OLTP systems.
The performance advantages that accrue from
document-centric implementations mean that they are often used when there are
large volumes of semi-structured data, such as web page content, comment
storage and event logging. They may well also support sharding (spreading a
table’s rows) of data across multiple nodes, again as a performance device.
Examples include CouchDb and MongoDB. We will
be using the latter as an example of the type for you to experiment with later
in this chapter.
6.13 Key–Value Stores (K–V Store)
I n some situations, the database user needs are even more
basic than either of the two examples above. Sometimes you just want to store
some information against a particular key (or index). If you always know the
key value when you want to get data out from the store and you don’t need the
rule enforcement that comes with an RDBMS, then a K–V store will probably be
the fastest option. See the section about MapReduce in Chap. 5 .
Because of their light footprints, such
databases can often be built into applications, removing the need for any DBMS.
As
with all the data storage options mentioned here, the real driver for the move
from RDBMS being the single answer to all database needs has been the web and
latterly the cloud. The enormous amounts of data for applications like Facebook
and Twitter will be stored in data centres across many different geographic
locations.
Horizontal scaling is the process of having
that data spread across many nodes, and many of these newer, NoSQL-type
database systems are designed from the start to be able to cope with this
physical aspect of the data store. RDBMS on the other hand, with query
responses typically created by joining data from two or more tables, can
struggle with managing distributed data that is spread across many nodes.
Examples include
Voldemort and Redis.
6.14 When
to Use Which Type of Data Storage?
The fi rst point to make is that the various
types of database need not be mutually exclusive. This is not, as some websites
argue, a great battle to the death for the future of databases. Relational
databases certainly handle transactional data well, and, for the foreseeable
future, especially given the huge levels of investment in RDBMS made by most
organisations, they will remain the most frequently used type of database.
When data professionals are looking at
querying large datasets that may be distributed across the web, however, some
other tools have entered their armoury. In undertaking the tutorials later this
chapter, you will be able to gauge for yourself what the relative strengths and
weaknesses are. And it is important to say that every application will be
different, but as a rule of thumb, we can say:
1. Column-based
databases allow for rapid location and return of data from one particular
attribute. They are potentially very slow with writing, however, since data may
need to be shuffl e d around to allow a new data item to be inserted. As a
rough guide then, traditional transactionally orientated databases will
probably fair better in an RDBMS. Column based will probably thrive in areas
where speed of access to non-volatile data is important, for example, in some
decision support applications. You only need to review marketing material from
commercial contenders, like Ingres Vectorwise, to see that business analytics
is seen as the key market and speed of data access the main product differentiator.
2. Document-centric
databases are good where the data is diffi c ult to structure. Web pages and
blog entries are two oft-quoted examples. Unlike RDBMS, which impose structure
by their very nature, document-centric databases allow free-form data to be stored.
The onus is then on the data retriever to make sense of the data that is
stored.
3. If
you do not need large and complex data structures and can always access your
data using a known key, then key–value stores have a performance advantage over
most RDBMS. Oracle has a feature within their RDBMS that allows you to defi n e
a table at an index-organised table (IOT), and this works in a similar way.
However, you do still have the overhead of consistency checking, and these IOTs
are often just a small part of a larger schema. RDBMS have a reputation for
poor scaling in distributed systems, and this is where key/value stores can be
a distinct advantage.
6.15 Summary
I n the previous chapters in this part two, we have looked at
cloud from the application perspective. In this chapter, we have seen that
there are many different data storage methods available for a database. The
decision as to which is the right one is driven by the task at hand. How we
estimate the potential benefi t s of a move to cloud is covered in more detail
in Chap. 8 .
We
have also seen how cloud can help with typical database tasks such as backup
and recovery and disaster recovery whilst reviewing the business drivers that
would come into play for using the cloud for these tasks.
T his chapter is different to the others
in that the following tutorials are longer. Many readers will have had some
experience of the relational model and SQL already, but we felt that readers
need to appreciate the different approaches on offer. In the following
tutorials, therefore, we concentrate on two of the more modern approaches—one
column based and one document based.
6.19 The Column-Based Approach
6.16 Further Reading
The tutorials that
follow provide pointers to more resources about the Cassandra and MongoDB.
“Transforming relational database into HBase:
a case study” by Chongxin Li is an interesting discussion of the complexities
of mapping relation to Hbase. Chongxin Li: Transforming relational database
into HBase: a case study. 2010 IEEE International Conference on Software
Engineering and Service Sciences (ICSESS), pp. 683–687 (16–18 July 2010). doi: 10.1109/ICSESS.2010.5552465 (Chang
et al.
2008 ).
6.17 Tutorials
In this section, we ask
you to try examples of NoSQL databases. The business case for these trials is
given below, and sample datasets with further work can be downloaded from our
book website.
6.18 BookCo
In this case study, we
are looking at data from a chain of booksellers called BookCo. You are the
chief information offi c er of this company that sells books and other related
products via the Internet. You have many information needs to address, but you
feel you need to assess some new tools which may help with some of those tasks.
BookCo has already invested heavily in a leading client/ server database
solution, and that will remain at the centre of most transactional business
needs. However, there are two areas that you feel are worthy of further
investigation:
1. One
task that has been problematic in the past has been the statistical analysis of
the sales data. In particular, you need to increase the speed at which the
analysts get data back on trends and averages. You decide to investigate
column-based approach as you have heard that this can dramatically increase
query performance.
2. You
capture communications from customers from a variety of sources. As the BookCo
chain is international, the different sources may well be generated anywhere in
the world. You need to investigate if it would be useful to allow this data to
be stored using a document-centric data store, rather than an RDBMS.
6.19 The Column-Based Approach
There are a number of
column-based solutions available. To get a feel for the approach, you have
decided to give Cassandra a try. It is an open source solution, so there are no
costs for this experiment.
6.20 Cassandra Tutorial
6.20.1 Installation and Con fi guration
The tutorial
demonstrates the use of Cassandra on Ubuntu Linux. Cassandra will also run on
other Linux servers and can be installed on Windows.
This guide was created using Ubuntu Linux
11.10. As at April 2012, the current stable release is 1.0.9. It is expected
that basic Unix commands are known in order to install the system. If you are
unsure, there are many web-based Linux command guides. If you don’t mind being
called ‘clueless’, this is a good starting point: http:// www.unixguide.net/linux/linuxshortcuts.shtml
T he fi r st step is to download the
latest version of Cassandra. ( http://cassandra. apache.org/download/) .
Make sure it is the latest stable version. Once you have the download page in
your browser:
• Select
to download the latest bin version. As at April 2012, the current version is
apache-cassandra-1.0.9-bin.tar.gz
• Save
the binary to preferred location, for example, your desktop.
• Open
a Linux terminal window and navigate to the folder where the fi l e was saved.
• Decompress
the .gz fi le using the following
process.
#Assuming that the fi le was saved in folder Desktop
$called Desktop
$tar xzf apache-cassandra-1.0.9-bin.tar.gz
#where 1.0.9 is the version should be changed to the version
you have downloaded
C
assandra needs you to install the JRE dependency if you don’t already have it.
Use the following command:
$sudo apt-get install openjdk-6-jre
#wait for the download and installation
process to fi nish
Moreover, it is required to create some
directories for Cassandra to store data and logs. Appropriate permissions
should be granted to users. This task is a once-only task. Open a terminal
session and perform the following operations. (Replace USERNAME with the OS authentication
username.)
$sudo mkdir -p /var/log/cassandra
$sudo chown -R USERNAME /var/log/cassandra
$sudo mkdir -p /var/lib/cassandra
$sudo chown -R USERNAME /var/lib/cassandra
The next step is to run the server. This has
to be done through a terminal session. This step must be repeated each time you
reboot your OS.
N avigate again to the folder into which
you have extracted Cassandra. Then, navigate to the bin directory.
To start the server,
use the ‘ cassandra ’ command.
$cd apache-cassandra-0.8.4
$cd bin
$./cassandra
Note: It is important not to close this terminal.
N ow that the server is running, the
client interface should be started. Open a new terminal session and execute the
command below, after navigating to the bin folder as before:
$cd apache-cassandra-0.8.4
$cd bin
$./cassandra-cli
Welcome to the Cassandra CLI. Type ‘help;’ or ‘?’ for help.
Type ‘quit;’ or ‘exit;’ to quit.
[default@unknown]
Once the client interface is open, we need to
connect to the local server you started earlier by using the connect/command.
By default, the port is 9160, and notice that commands should end with a
semicolon ‘; ’ . By default, the system will connect the ‘ test ’
cluster/keyspace.
[default@unknown] connect localhost/9160 ;
Connected to: “Test Cluster” on localhost/9160
T o recap, we have downloaded and
installed Cassandra. We then started the Cassandra server in a terminal
session. In a separate session, we then started the client. Finally, we used
the client to connect to our default ‘ test cluster ’ using port 9160.
Note: There are alternative graphical
user interface (GUI) tools you could use with Cassandra instead of the
Cassandra CLI. However, the CLI makes the technical underpinning far more
visible and therefore gives the new user a better insight into how Cassandra
works.
6.20.2 Data Model and Types
B efore we get our hands dirty with our new toy, we need to
understand some of Cassandra’s basic concepts.
A keyspace in Cassandra can be seen as the
equivalent to a database schema in a relational DBMS, and it is a group of
related column families. With an attribute called placement_strategy, you can
determine how you will distribute replicas around the nodes you will be using.
A column family is very broadly equivalent
to a table in an RDBMS. Each such family is stored in a separate fi l e. The fi
l e itself is sorted into key order. It is therefore important, in terms of
reducing disk reads and therefore improving performance, that columns which are
regularly accessed together should be kept together, within the same column
family, and that data types within a family should ideally be the same.
A
column is the smallest unit of
storage in Cassandra. A standard column is composed of the tuple of a unique
name (key), value and a timestamp. The key identi fi es a row in a column
family. The timestamp corresponds to the last update time.
A supercolumn is a list of columns used to
store multiple items of a similar type.
Supercolumns do not have timestamps.
Put another way, the
building blocks we will be reviewing in Cassandra are:
• A
column, which is a key–value pair.
• A
supercolumn which is a collection of associated columns that are sorted.
• A
column family, which is a container for columns which are sorted. These columns
can be referenced through a row id.
Examples may help
clarify the difference between standard and supercolumns. This is an example of a standard column in
use:
{
“example” :
{
column : “emailaddress” value : “example@example.com” timestamp: 123456
}
}
And here is an example of a Super-Column in
use:
{
name: “homeaddress” value: {
street : {name: “street” , value : “Oxford”, time-stamp
: 12345}, postcode: {name: “postcode”, value: “SO234”,
timestamp : 123456}
}
}
Note how there are a variety of values in the
supercolumn but one, and only one, value in the column.
Now we have Cassandra installed, we should get
used to the basic commands. One of the
fi rst useful commands to learn is the show command. If we get into the
situation where we forget which cluster we are working on, we can ask Cassandra
by using the ‘ show cluster name; ’ command.
And if we need to
know which keyspaces there are, we again use show:
[default@unknown] show cluster name;
Test Cluster
[default@unknown]show keyspaces; Keyspace: system:
....other names would appear here
6.20.3 Working with Keyspaces
The output of the last
command displays various information about the ‘ system ’ keyspace. This is
used mainly for system tasks and is not dissimilar to the system tablespace in
Oracle, for example. Once a new keyspace is created, it would also be displayed
as an output of the command.
Initially, a new keyspace (analogous to a
relational database) has to be created
using the ‘ create keyspace
KeyspaceName [with att1=”value1”] [and att2=”value2”] ’ where K
eyspaceName corresponds to the name you want to give to the keyspace. Two
attributes can be added to keyspace:
1. The
replication
2. Replica
placement strategy
Until we get a feel for Cassandra, however,
these are not required as we are using Cassandra locally instead of on many
machines. But don’t lose sight of the fact that one of the main reasons that
Cassandra was invented was to be able to cope with large volumes of data in a
distributed, replicated environment.
After creation, to connect to a keyspace, we
use the use Keyspacename where the Keyspacename is the name given to
the created keyspace.
For the BookCo sales records, we are going to
create a keyspace with the name sales. When doing this, we will see that
Cassandra will respond with a unique UUID number:
[default@unknown] create keyspace sales;
89478600-c772-11e0-0000-242d50cf1fbf #UUID
Waiting for schema agreement…
…
schemas agree across the cluster
[default@unknown] use sales;
Authenticated to keyspace: sales Working with Column Families
To
create a column family (analogous to a relational table), we should provide a
name and a comparator. The comparator is required by Cassandra because it needs
to know the order in which to store data, and the comparator is responsible for
this sorting of the columns within the column family. Once any new column is
added, it is compared with others and is sorted accordingly—during the insert.
This is important as it is not possible to sort results during a query. One way
of dealing with the lightweight query facility is to have multiple column
families, one for each expected query return type.
C olumns in Cassandra are sorted based on
their name. So the comparator can be used to tell Cassandra how to order
columns inside their row and, in extent, how the columns will be sorted when
they are returned. The comparator can also be used in subcolumns (of
supercolumns). These are the comparators that are available out of box:
Type
|
Explanation
|
BytesType
|
It is the default type that sorts based on
bytes. If it is not required for validation process, then choose instead of
UTF8Type
|
AsciiType
|
Sorting by using the character encoding
|
UTF8Type
|
Sorting based on a string that uses UTF-8 as
character encoder. It gives the advantage to allow data validation
|
LongType
|
Sorting based on 64-bit long numbers
|
TimeUUIDType
|
Sorting based on a 16-byte timestamp
|
A dvanced users can, if this list doesn’t
cover an application’s needs, write their own comparators using Java. The
comparator is a required element for a valid column family. However, the other
column metadata is optional:
• Data validation . Remember, validation
is a costly overhead, but we may need, for some applications, to test the data
being entered. The types are the same as in the table above but are used for
validation purposes instead of sorting. Prede fi nition is not required, but if
you are aware of the columns that need validation, such rules should be
implemented at family creation time. The ‘schema’ is fl e xible so new columns
can be added later, if required.
• Index keys. This is another optional
attribute. With it, you can create indexing keys which allow search queries
against the indexed values. If no indexes exist, querying based on certain
values will not be possible. An example of creating a primary index key will be
demonstrated later in these notes. More than one column can be indexed in a
column family, and these are called secondary indexes. Secondary indexes are
recommended for columns with low cardinality because rewriting the indexes is a
very costly overhead, and so, where possible, we avoid indexing volatile data.
The command for
creating a column family is
create column family ColumnFamilyName with
comparator = ‘Type’ and column_metadata = [ {column_name: ColumnName,
validation_class: ‘Type’, index_type: KEYS }, …];
Note: When referring to types, look at
the types table and add them without the single quotes. As mentioned before,
the column_metadata is optional.
F or the BookCo sales, we should create a
‘s
ale ’ standard column family by using the UTF8Type and then again this
type for validation class, without indexing any key at the moment.
[default@sales] create column family sale with
comparator = UTF8Type … AND
column_metadata =
…
[ {column_name: sale_id, validation_class: UTF8Type},
…
{column_name: book_id, validation_class: UTF8Type},
…
{column_name: seller, validation_class: UTF8Type},
…
{column_name: buyer, validation_class: UTF8Type},
…
{column_name: price, validation_class: UTF8Type},
…
{column_name: payment_method, validation_class:
UTF8Type},
…
{column_name: date , validation_class:UTF8Type}];
8760e289-c84f-11e0-9aeb-e700f669bcfc
Waiting for schema agreement…
…
schemas agree across the cluster
Now we are going to create another column
family about novelratings and recommendations.
To do so, we should create a new SuperColumn column family using
the
create
column family ColumnFamilyName with comparator = ‘Type’ AND column_type=’Super’
AND subcomparator= ‘Type’ [AND column_metadata = [{ column_name : Col-umnName ,
validation_class: ‘Type’ , index_type: KEYS }, … ]; command.
Everything after the subcomparator attribute could have been
omitted.
You should notice the ‘column_type=Super’ attribute and ‘subcomparator= Type’ which refers to the
comparator used for the subcolumns of this type of column family.
[default@sales] create column family Novelrating with comparator =
‘UTF8Type’
…
AND column_type=’Super’ AND subcomparator= ‘UTF8Type’ AND column_metadata =
…
[{column_name : author , validation_class : UTF8Type
},
…
{column_name : comment , validation_class : UTF8Type
},
… {column_name : rate , validation_class :
UTF8Type }];
c12796d9-c8f3-11e0-af3e-e700f669bcfc
Waiting for schema agreement…
…
schemas agree across the cluster
A
very important operation with column families is Update . We may need to update a column family
with the specifi e d attribute(s) and value(s) or even add index keys (see
previous paragraphs).
T o do so, we use
the update column family ColumnFamilyName
with comparator = ‘Type’ AND column_metadata = [{ column_ name : ColumnName ,
validation_class: ‘Type’}, index_type :
KEYS …]; command. This is similar to the Create command but with the changed
attributes and values.
If, for example, we want to create an index
key on the sale_id. Follow the example on how to use the update command:
[default@sales] update column family sale with comparator =update column
family sale with comparator = ‘UTF8Type’ and default_validation_class=UTF8Type
and key_validation_class=UTF8Type and column_metadata= [ {column_name: sale_id, validation_class:
UTF8Type, index_type: KEYS},
{column_name: book_id, validation_class:
UTF8Type},
{column_name: seller, validation_class:
UTF8Type},
{column_name: buyer, validation_class:
UTF8Type},
{column_name: price, validation_class:
UTF8Type},
{column_name: payment_method,
validation_class:
UTF8Type},
{column_name: date ,
validation_class:UTF8Type}];
58f6aa1b-c9a2-11e0-af3e-e700f669bcfc
Waiting for schema agreement…
…
schemas agree across the cluster
6.20.4 Working with Columns
6.20.4.1 Inserting and Updating Values
A s we pointed out above, the column is the base storage unit
in Cassandra. We will fi n d ourselves inserting to, deleting from, updating
and creating columns extensively as we build our applications.
T o
add or update data in a column, use the ColumnFamilyName[‘key’]
[‘ColumnName’] = ‘value’; command. (cf =
column family name)
We
mentioned that a column also has a timestamp associated with it. This is added
or updated automatically by Cassandra.
The key refers to the row key. An example of
this command being used is below.
We are going to create a new entry for sale1:
[default@sales] set sale[‘sale1’][‘sale_id’] =
‘1’; Value inserted.
[default@sales] set sale[‘sale1’][‘book_id’] =
‘123’; Value inserted.
[default@sales] set sale[‘sale1’][‘seller’] =
‘mike’; Value inserted.
[default@sales] set sale[‘sale1’][‘buyer’] =
‘john’; Value inserted.
[ default@sales] set
sale[‘sale1’][‘price’] = ‘15’; Value
inserted.
[ default@sales] set sale[‘sale1’][‘payment_method’]
=
‘PayPal’;
Value inserted.
[
default@sales] set sale[‘sale1’][‘date’] =
‘10Aug2011’;
The column family,
as it stands at the moment, is shown in below.
Now we will populate the sale column family
with more data. You can use these examples, but do try to add more of your own:
#Creating a new row for sale 2 set sale[‘sale2’][‘sale_id’] = ‘2’; set sale[‘sale2’][‘book_id’] = ‘143’; set sale[‘sale2’][‘seller’] = ‘peter’; set sale[‘sale2’][‘buyer’] = ‘john’; set sale[‘sale2’][‘price’] = ‘25’; set sale[‘sale2’][‘payment_method’] =
‘PayPal’; set sale[‘sale2’][‘date’] =
‘12Aug2011’; #Creating a new row for sale 3
set sale[‘sale3’][‘sale_id’] = ‘3’;
set sale[‘sale3’][‘book_id’] = ‘144’;
set sale[‘sale3’][‘seller’] = ‘paul’;
set sale[‘sale3’][‘buyer’] = ‘peter’;
set sale[‘sale3’][‘price’] = ‘43’;
set sale[‘sale3’][‘payment_method’] = ‘Visa’; set
sale[‘sale3’][‘date’] = ‘13Aug2011’;
#Creating a new row for sale 4 set sale[‘sale4’][‘sale_id’]= ‘4’;
set sale[‘sale4’][‘book_id’] = ‘127’;
set sale[‘sale4’][‘seller’] = ‘josh’;
set sale[‘sale4’][‘buyer’] = ‘alex’;
set sale[‘sale4’][‘price’] = ‘45’;
set sale[‘sale4’][‘payment_method’] = ‘Visa’; set sale[‘sale4’][‘date’] = ‘12Aug2011’;
To add data in a supercolumn and its
subcolumns, use the set ColumnFamily
Name[‘key’][‘superColumnName’][‘col’] = ‘value’; The key
refers to the row key. The example here adds for a novel called ‘ Alpha
Mike’ .
[default@sales] set Novelrating[‘Alpha Mike’][‘entry1’]
[‘author’] = ‘Peter Pond’; Value inserted.
[default@sales] set Novelrating [‘Alpha
Mike’][‘entry1’]
[‘comment’] = ‘Alpha Mike is a very good
novel’; Value inserted.
[default@sales] set Novelrating [‘Alpha
Mike’][‘entry1’]
[‘rate’] = ‘5/5’;
[default@sales] set Novelrating [‘Alpha
Mike’][‘entry2’]
[‘author’] = ‘Jason Stream’; Value inserted.
[default@sales] set Novelrating [‘Alpha
Mike’][‘entry2’]
[‘comment’] = ‘Good storyline’; Value inserted.
[default@sales] set Novelrating [‘Alpha
Mike’][‘entry2’]
[‘rate’] = ‘4/5’; Value inserted.
Now we will populate the novelrating column
family with more data. You can use these examples, but do try to add more of
your own:
#Some more recommendations for Mike. set Novelrating[‘Alpha
Mike’][‘entry3’][‘author’] =
‘Phil Nastis’; set Novelrating[‘Alpha
Mike’][‘entry3’][‘comment’] =
‘Excellent thriller’;
set Novelrating[‘Alpha
Mike’][‘entry3’][‘rate’] =
‘5/5’;
#Creating a new row for a Novel called Catch
22 and adding 2 entries for recommendations
set Novelrating[‘Catch 22’][‘entry1’][‘author’] =
‘Heller’; set Novelrating[‘Catch
22’][‘entry1’][‘comment’] =
‘Brilliant
book’; set Novelrating[‘Catch
22’][‘entry1’][‘rate’] = ‘5/5’; set
Novelrating[‘Catch 22’][‘entry2’][‘author’] =
‘Heller’; set Novelrating[‘Catch
22’][‘entry2’][‘comment’] =
‘Dated,but still good’; set Novelrating[‘Catch 22’][‘entry2’][‘rate’]
= ‘4/5’;
1.
6.20.4.2 Reading Data from Column Values
There are different
ways to retrieve data from an existing keyspace. Initially, we can use the c
ount ColumnFamilyName[‘key’];
command to count the number of columns stored in the key.
To do the same thing for subcolumns of the
supercolumn, use the count
ColumnFamilyName[‘key’][‘superColumnName’];
.
Note: The ‘s ale1 ’ is the row key and not
the column. In the second command, ‘A lpha Mike ’ is the key and the
supercolumn is ‘e ntry1 ’.
[default@sales] count sale[‘sale1’];
7 columns
[default@sales] count
Novelrating[‘mike’][‘entry1’];
3
columns
#Try to retrieve a key that does not exist
[default@sales] count sale[‘sale10’];
0
columns
To retrieve all
the information about
‘sale1’ , use the get
ColumnFamilyName[‘key’]; command. Notice that the records have an added
‘timestamp’ value that is automatically added or updated when the record is
added or updated:
[default@sales] get sale[‘sale1’]; => (column=book_id, value=123,
time-stamp=1313532897200000) =>
(column=buyer, value=john, time-stamp=1313532920840000) => (column=date, value=10Aug2011, time-stamp=1313532967504000)
=> (column=payment_method, value=PayPal,
time-stamp=1313532956880000)
=> (column=price, value=15,
time-stamp=1313532931944000) =>
(column=sale_id, value=1, time-stamp=1313532878948000) => (column=seller, value=mike,
time-stamp=1313532909544000) Returned 7
results.
[default@sales]
Now we could try to retrieve the information
from the supercolumns we have created previously. To retrieve all the
information, the command is the same as above. Try the command for yourself to
see the full output:
[default@sales] get Novelrating[‘Alpha Mike’];
=> (super_column=entry1,
(column=author, value=Peter Pond,
time-stamp=1313601352679000)
(column=comment, value=Alpha Mike is a very
good Novel, timestamp=1313601369614000)
(column=rate, value=5/5,
timestamp=1313601388350000)) => ….
And so on …..
W e can
also retrieve only the value of a column by using the get
ColumnFamilyName [‘key’] [‘columnName’]
command.
To retrieve a value from a subcolumn of a
supercolumn, we use the get
ColumnFamilyName [‘key’]
[‘superColumnName’] [‘column-
Name’] command. Examples are
[default@sales] get sale[‘sale1’][‘book_id’];
=> (column=book_id, value=123,
time-stamp=1313532897200000)
[default@sales] get Novelrating[‘Alpha
Mike’][‘entry1’]
[‘rate’];
=> (column=rate, value=5/5,
time-stamp=1313601388350000)
We
can put conditions on our queries, just like a where in an SQL command.
However, in order to use this command properly, we need to add indexes to
support specifi c queries. For example,
to retrieve data based on the index created for the
‘s ale_id ’, we would use the get
CF where column = ‘value’ [and column = ‘value’
and …] command.
T o help understand why indexing key is
important, let us initially retrieve data based on book_id which was not previously
indexed.
[default@sales] get sale where book_id =
‘143’;
No indexed columns
present in index clause with operator EQ.
The system responds that the book_id column is not indexed. Now try with
the sale_id column which was indexed
previously during the update command demonstration:
[default@sales] get sale where sale_id = ‘1’;
-------------------
RowKey: sale1
=> (column=book_id,
value=123, time-stamp=
1313532897200000) ...........
1
Row Returned.
T o allow the retrieval of books
information using the book_id
, we can update the column family to add a secondary index as shown
below. We will also add an index on the price column:
[default@sales] update column family sale with
comparator = ‘UTF8Type’ and column_metadata=
…
[ {column_name: sale_id, validation_class: UTF8Type, index_type: KEYS},
…
{column_name: book_id, validation_class: UTF8Type, index_type: KEYS},
…
{column_name: seller, validation_class: UTF8Type},
…
{column_name: buyer, validation_class: UTF8Type}, … {column_name: price, validation_class:
UTF8Type, index_ type: KEYS},
…
{column_name: payment_method, validation_class:
UTF8Type},
… {column_name: date ,
validation_class:UTF8Type}];
b6cee09c-c9a6-11e0-af3e-e700f669bcfc
Waiting for schema agreement…
…
schemas agree across the cluster
N ow we can try to retrieve information based on the book_id again. Then, we will use
price and then combine both. Just as with SQL, with the price column, we can
use comparison operators such as greater or smaller than:
[ default@sales] get sale where
book_id=’143’;
-------------------
RowKey: sale2
=> (column=book_id,
value=143, time-stamp=
1313673572425000) ...........
1
Row Returned.
[default@sales] get sale where price = 15;
-------------------
RowKey: sale1
=> (column=book_id,
value=123, time-stamp=
1313532897200000) ...........
1
Row Returned.
[default@sales] get sale where book_id=’143’
and price >
25;
0 Row
Returned.
[default@sales] get sale where book_id=’143’
and price <
35;
-------------------
RowKey: sale2
=> (column=book_id,
value=143, time-stamp=
1313673572425000) ...........
1 Row
Returned.
Another way to retrieve all the rows in a
column family is the list ColumnFamilyName;
command:
[default@sales] list Novelrating;
B efore you try this, imagine what the
output might be like to make sure you understand the data being stored. Once
you run it, did you get the data back you expected?
List can also be used with the limit condition
to control the number of records that will be retrieved. The returned results
indicate that random records are returned instead of ordered.
[default@sales] list sale LIMIT 2;
-------------------
RowKey: sale3
=> (column=book_id,
value=144, time-stamp=
1313673579083000) ................
-------------------
RowKey: sale2
=> (column=book_id,
value=143, time-stamp=
1313673572425000) ................
2
Rows Returned.
6.20.4.3 Deleting Rows and Columns
I n keeping our data current, we may need to delete a column
from a row. For example, to delete the ‘date’ column
from ‘sale1’ , we would use the del
ColumnFamilyName[‘key’][‘col’]; command.
To delete the entire row of ‘sale1’ we would use the del
ColumnFamilyName[‘key’]; command.
To
delete a supercolumn, we would use the del
ColumnFamilyName[‘key’] [‘super’]; command.
To delete a subcolumn, we would use the del ColumnFamilyName[‘key’]
[‘super’][‘col’]; command.
Examples of all
these are given below:
[default@sales] del
sale[‘sale1’][‘date’]; column removed.
[default@sales] get sale[‘sale1’][‘date’];
Value was not found
[default@sales] del sale[‘sale1’]; row removed.
[default@sales] get sale[‘sale1’]; Return 0 results.
[default@sales] del Novelrating[‘mike’][‘entry1’]
[‘rate’]; column removed.
[default@sales] del
Novelrating[‘mike’][‘entry1’]; column
removed.
[default@sales] get
Novelrating[‘mike’][‘entry1’]; Returned
0 results.
6.20.4.4 Drop Column Families and Keyspace
To drop a column
family, we use the drop
column family ; command.
T o drop a Cassandra keyspace, we
use drop keyspace keyspacename; command.
[default@sales] drop column family sale; ade3bc44-236f-11e0-8410-56547f39a44b #To drop the keyspace follow the example
[default@sales] drop keyspace sales;
4eb7c292-c841-11e0-9aeb-e700f669bcfc
Waiting for schema agreement…
…
schemas agree across the cluster
6.20.5 Shutdown
At the end of our busy
day, we may need to bring down the Cassandra client. To do this, we simply use
the quit; or exit; command from the client terminal session.
If we also want to
shutdown the Cassandra server, we use the Ctrl + C keys.
6.20.6 Using a Command-Line Script
Y ou can build scripts which create column families and insert
data. This is very useful in the prototyping phase when we typically need to
keep recreating elements. In the example below, we are going to create a new
family called tsale in the sales keyspace; don’t forget to have your
authentication as the fi rst line.
Open a new editor
page and add this into it:
use sales ;
create column family tsale with comparator = ‘UTF8Type’ and
default_validation_class=UTF8Type and key_validation_class=UTF8Type and
column_metadata=
[
{column_name: sale_id, validation_class: UTF8Type, index_type: KEYS},
{column_name: book_id, validation_class:
UTF8Type},
{column_name: Novel, validation_class:
UTF8Type},
{column_name: buyer, validation_class: UTF8Type},
{column_name: price, validation_class:
UTF8Type},
{column_name: payment_method,
validation_class: UTF8Type},
{column_name: date , validation_class:UTF8Type}]; set tsale[‘sale1’][‘sale_id’] = ‘1’; set tsale[‘sale1’][‘book_id’] = ‘123’; set tsale[‘sale1’][‘Novel’] = ‘mike’; set tsale[‘sale1’][‘buyer’] = ‘john’; set tsale[‘sale1’][‘price’] = ‘15’; set tsale[‘sale1’][‘payment_method’] =
‘PayPal’; set tsale[‘sale1’][‘date’] =
‘10Aug2011’;
Now save this as
cass.txt into your Cassandra\bin directory directory.
Open a new terminal session and CD to the
Cassandra\bin directory. Do not run cassandra-cli at this stage. Instead, type
./cassandra-cli -host localhost -port 9160 -f cass.txt
This will, if it can, run the content of the
script fi le called cass.txt that you
have just created.
6.20.7 Useful Extra Resources
We have only just got
started with Cassandra with these notes. We hope that there is enough here to
give you an understanding of the key principles and to help identify how
Cassandra is different to an RDBMS.
If you wish to fi nd out more, some useful resources
include:
• This
book’s website (link). There are a bigger dataset and some more detailed
queries to give you practice in using the product.
• Hewitt,
E. (2010) Cassandra: The De fi nitive Guide, published by O’Reilly Media.
• Apache
Cassandra (2011) Of fi cial Site. Last accessed on 1 Sept 2011 at http:// cassandra.apache.org
• Apache
Wiki (2011) Wiki Frontpage. Last accessed on 1 Sept 2011 at http:// wiki.apache.org/cassandra/
6.20.8 The Document-Based Approach
T here are a number of document-based solutions available. To
get a feel for the approach, you have decided to give MongoDb a try. It is an
open source solution, so there are no costs for this experiment.
6.21 MongoDB Tutorial
6.21.1 Installation and Con fi guration
T his guide was made and tested with Ubuntu Linux 11.10.
Several Linux systems include package managers that can help to install
MongoDB. As at the time of writing, packages are available for Ubuntu, Debian,
CentOS, Fedora and Red Hat. It is expected that basic Unix commands are known.
If you are unsure, there are many web-based Linux command guides. If you don’t
mind being called ‘clueless’, this is a good starting point: http://www.unixguide.net/linux/linuxshortcuts.shtml
Follow the following steps:
1. Go
to MongoDB downloads page (http://www.mongodb.org/downloads) and select the
Linux 32-bit or Linux 64-bit product version.
2. Save
the binary to preferred location.
3. Open
a terminal window and navigate to the folder where the fi le was saved.
4. Decompress
the .tgz fi le using the following
process:
#Assuming that the fi le was saved in folder Desktop
$called Desktop
$tar
zxf mongodb-linux-i686-2.0.4.tgz
#where 2.0.4 is the current version
$cd mongodb-linux-i686-2.0.4
Important tip: By default, MongoDB stores data
in data/db directory. However, this has to be created manually but only once.
To do so, in a terminal type:
$sudo mkdir -p /data/db/
#We have to grant access for the directory to
our username
#
Where USERNAME is the Operating System Username
$sudo chown username /data/db
The next step is to run the server. This has
to be done through a terminal window. This step is essential each time you
reboot your OS:
#In a terminal, navigate to the previously
extracted mongodb folder. For this example it is on the Desktop $cd Desktop
$cd mongodb-linux-i686-2.0.4
#Remember that the version number may vary
#Navigate to BIN directory and start the
mongod server
$cd bin
$./mongod
You should see a series of feedback lines. If
you receive any errors indicating that the binary cannot be executed properly,
you probably downloaded a 64-bit version which is not supported by your OS, or
vice versa. Repeat the process by downloading the correct version.
Do not close the
current terminal window. Finally, in a new terminal window, start the MongoDB
shell:
#Navigate again in the BIN directory and start
the shell
$cd Desktop/mongodb-linux-i686-2.0.4/bin
$./mongo
B y default, MongoDB each time connects
to the ‘test’ database. This will be explained in the next section.
6.21.2 Documents, Data Types and Basic Commands
T o recap, MongoDB as a document-oriented storage system
replaces tables, rows and columns with documents. Each document includes one or
more key–value pairs. Each document is part of a collection, and in a database,
more than one collection can exist. A collection can be seen as similar to a
table in an RDBMS.
Below is an example of the key–value pairs
that constitute the documents.
A document can have any structure as we can see in the example
below.
{“key” : “value”}
{“name” : “George”}
#The previous example shows a single key
“name” with the value of “George”. A document may include multiple key/ value
pairs as below {“name” : “George”, “surname”: “James”}
{“name” : “John”, “surname”: “King”,”email”:
“example@ example.com”}
#Example documents of a collection with
different fi elds
6.21.3 Data Types
M any data types can be part of a document in a MongoDB database,
and some of them are listed below:
Data type
|
Explanation and examples
|
Boolean
|
{“x” : true}. A boolean value can be either true or false
|
Null
|
{“x” : null}. It can be used to express a null value
|
Numbers
|
{“x” : 3.14} {“x” : 3}
|
String
|
{“x”
: “George”}. Any string can be
used with this type
|
Date
|
{“x” : new Date()}. Dates stored as milliseconds
|
Array
|
{“x” : [“1”,”2”,”3”]}. Lists of values can be stored as an array
|
Embedded document
|
{“x”: {“name” : “George”}}. Documents can
contain embedded documents
in them, a document inside another document
|
6.21.4 Embedding and Referencing
B y embedding a document, it is meant that a document is
embedded within another document as explained in the table above. Referencing
refers to storing the data in a separate collection (say collection A) and
referencing it in another collection (say collection B). However, the MongoDB
manual suggests that embedding documents inside a document is preferred of the
two approaches. The usage of references requires extra queries to retrieve data
from other collections, slowing the query.
Any of these data types can be used in a
document. As there is no schema in a document, data types do not have to be
predefi n ed. Examples will be shown later.
H aving started the database server and
the shell, we can now start using the database. But fi rst, here are some general navigational
tips to start with:
1. Commands
related to collections have the syntax of
db.COLLECTION_NAME.
COMMAND(). The db variable is the
currently used database.
2. To
switch between databases, we use the command
use
Name_of_Database.
3. To
see which database is currently in use, execute the db command.
4. The
database is automatically created upon the creation of the fi rst document.
5. To
identify which collections are in the a database, we can use db. getCollectionNames().
6. To
see a list of useful database commands, type ‘help’ in the shell script. R
eview examples of these commands below and then execute them for yourself in your
terminal session. First of all, switch to the ‘products’ database, which will
be used for our case study.
Note:
In all the examples, commands typed in the shell are indicated by ‘>’.
Responses from the system do not include
that sign. cloud@project:~/Desktop/mongodb-linux-i686-2.0.4/bin$
./mongo
MongoDB shell version: 2.0.4 connecting to: test
> db
test
> use products switched to db products
> db
products
> db.getCollectionNames();
[
]
Note: The last command returns an empty array
as at the moment no collection exists in the database.
A s with most databases, there are four
basic types of operation we will need to complete. They are often described as
CRUDing: create, read, update and delete. We will use MongoDB to do each of
these tasks.
6.21.4.1 Create
T his operation adds a new document to a collection. To enter
some new components in the database, we can create a local variable (one for
each new document) and then insert it in the collection.
A s the shell is using JavaScript
objects, we are going to use a variable to represent the document. In this
example, we need to create two documents corresponding to two books that the
company sells.
To create a
variable, use this syntax:
book1 = {“isbn” : “123” , “type”: “fi
ctional”, “price”
: “50”, “description” : “A fi ctional book”} .
I n the
fi g ure below, the shell returns the details of the variable once it is
created.
For this example, create the two variables as shown below.
Then, use the
db.COLLECTION_NAME.insert(variable) command
to store it as a document in a collection.
Note: Data can also be inserted directly
through the shell without de fi ning the document as a variable. This will be
covered later in the more advanced command section. (Note: Depending on your
terminal settings, when you paste this in, you can only see the end of the line
in the terminal.)
#
Creating variable for book1
> book1 = {“isbn” : “123” , “type”:
“fi ctional”, “price” : “50”,
“description” : “A fi ctional book”}
#System response indicating that variable has
been created
{
“isbn” : “123”,
“type” : “fi ctional”,
“price” : “50”,
“description” : “A fi ctional book”
}
#
Creating variable for component_two
> book2 = {“isbn” : “124” , “type”: “tech”,
“price” : “40”, “description” : “Book for tech”, “pages” : “150”} #System response indicating that variable has
been created
{
“isbn” : “124”,
“type” : “tech”,
“price” : “40”,
“description” : “Book for tech”
“pages” : “150”,
}
# Inserting variables as documents in a
collection // The collection will be automatically created once the fi rst insert occurs
> db.products.insert(book1);
> db.products.insert(book2);
At
the moment, we have created two documents in the product collection that was
created as a result of the fi r st insert. You may have noticed that the
schema-free feature of MongoDB allows us to use different data elements in each
document. The fi rst document does not
have the ‘page’ element which exists in the second document, for example.
6.21.4.2 Read
To start with, let’s
use the count operation to see how many documents are in a collection. The
command is db.COLLECTION_NAME.count()
. To retrieve the documents themselves, we can use ‘ db.COLLECTION_NAME.fi nd() ’. This command returns all the
documents in a collection. db.COLLECTION_
NAME.fi ndOne() returns only one
document from a collection.
W e can also fi l ter our results by
specifying selectors as parameters to fi n d. The parameter can be expressed in
the form of {x : x} which means
‘ where
x == x ’. This will be shown in more depth later in this tutorial.
#Counting the number of documents in a
collection
> db.products.count();
2
#Retrieving all the documents in a collection
> db.products.fi nd();
{
“_id” : ObjectId(“4e2c2a88ae1a6bc450311480”), “isbn” : “123” , “type”: “fi
ctional”, “price” : “50”, “description” :
“A fi
ctional book” }
{ “_id” :
ObjectId(“4e2c2a99ae1a6bc450311481”), “isbn” : “124” , “type”: “tech”, “price”
: “40”, “description” : “Book for tech”, “pages” : “150” }
#Retrieving the fi rst document in a collection. Alternativelly
, we can use parameters to specify the result and remove “We will see it
later”.
> db.products.fi ndOne();
{
“_id” : ObjectId(“4e2c2a88ae1a6bc450311480”),
“isbn” : “123”,
“type” : “fi
ctional”,
“price” : “50”,
“description” : “A fi ctional book”
}
#Retrieving the document in a collection
with isbn equals 123
> db.products.fi nd({“isbn” : “123”});
{
“_id” : ObjectId(“4e2c2a88ae1a6bc450311480”), “isbn” : “123” , “type”: “fi ctional”, “price” : “50”, “description” :
“A fi
ctional book” }
As
you can see, the document has an added
fi eld _id. This is automatically generated by the system once a
document is inserted in the database to ensure that each document has a unique
identity similar to a primary key in relational databases. If we wish, we can
de fi ne that object id manually, but for the moment, we will make do with the
system-generated version.
6.21.4.3 Update
O f course we may need to update the product details in our
database at some time. For example, the company buyer may wish to add stock
quantity information for a book. Using the update operation, we need to pass
two parameters:
(i) The
document to update
(ii) The
changes to be made
T he command is db.COLLECTION_NAME.update({documentToChange},
newVariable)
Initially, we have to use an object as before.
We can use the same one (book1) or a new one which will include the same
information.
R eview the example below and execute the
same commands for yourself. Then, try some for changes of your own.
#Here we add a new element in the document
with the key ‘quantity’ and value ‘3’.
> book1.quantity=3
3
#Now we have to update the document which has ‘serial’
123 with the updated object
#
db.COLLECTION_NAME.update({parameter1},parameter2)
> db.products.update({“isbn”:”123”},book1)
#Use the fi
nd command to query the changed document. In this example we are using a
selector to fi lter our results.
Further explanation will be given later.
> db.products.fi nd({isbn:”123”});
{
“_id” : ObjectId(“4e2c2a88ae1a6bc450311480”),
“isbn” : “123”,
“type” : “fi ctional”,
“price” : “50”,
“description” : “A fi nctional book”
“quantity” : 3
}
#
the document has been successfully updated
6.21.4.4 Delete
We may need to remove
documents from a collection. For this, we use db. COLLECTION_NAME.remove() .
Parameters may be passed to specify the document to be deleted or (CAUTION!)
without parameters to remove all the documents within a collection. We can also
use the drop operation to remove a collection.
Follow the sequence
of the commands:
#Removing the document with isbn 123
> db.products.remove({isbn:”123”});
#Query the collection to defi ne whether the document has been removed
>
db.products.fi
ndOne({isbn:”123”}); null
#As there is no entry for the document with
isbn ‘123’, the shell returns a null value.
#Now check the collections that are currently
in the database and then drop the “products”.
> db.getCollectionNames();
[
“products”, “system.indexes” ]
> db.products.drop();
#System confi
rmation of the action true
> db.getCollectionNames();
[
“system.indexes” ]
6.21.5 Advanced Commands and Queries
T his section contains examples of more advanced usage of the
MongoDB command set. From the case study, we will try to implement these
requirements:
Each product has its own ISBN, description,
author and other characteristics as well as a number of available sellers for
it. Moreover, each book may be associated to a number of tags for easier
searching. User reviews and comments are available. Finally, users should be
able to fi nd a book based on its
characteristics, prices and other fi
ltering options.
6.21.6 More CRUDing
We have dropped the
collection, so a new one has to be created. Moreover, we need to create at
least 20 new documents to represent books. This tutorial gives 20 examples that
can be used, but it is recommended that you create your own documents in order
to familiarise yourself with the concept.
A t this stage, data will be inserted
directly in the shell without defi n ing the document as an object. You may
still create objects and insert them in the ‘products’ collection if you wish.
A n example of how the different data
types are used is shown below. Then, ten documents are listed after the code
box.
{
“isbn”:123,
“title”:”Book1”,
“author”:”Mike”,
“coathor”:”Jessy”,
“description”:”A fi ctional book”,
“pages” : 150,
“publicationdate”
: new Date(‘10-06-2010’), #Date
“publisher” : { “name”: “Willeyy”, “location”:
“UK” },
#Embedded Document
“tag-category”: [“fi ctional”,”science”],
#Array
“sellers” : [ #Array of Documents
{“sellername”: “bookstore” , “price” : 50},
{“sellername”: “seller4523” ,”price” : 60} ],
“comments”: [
{“email”: “test@example.com”, “comment” : “
test test”}
]
}
#Now Insert it in the collection following the
in-structions below. The insert command is in red:
db.product.insert({“isbn” : 123 , “title”:
“Book 1”, “author” : “Mike”,”coathor” : “Jessy”, “description” : “A fi ctional book”,”pages” : 150,
“publicationdate” : new Date(‘10-06-2010’), “publisher” : { “name”: “Springer”,
“location”: “UK”},”tag-category”: [“ fi ction-al”,”science”], “sellers” :
[{“sellername”: “bookstore” , “price” :
50},{“sellername”: “seller4523”, ”price”
: 60}], “comments”: [ {“email”: “test@example.com”, “com-ment” : “ test
test”}]});
6.21.7 Sample Data Set
I nsert these data sets in your MongoDB collection. You should
then create at least another ten similar documents for yourself. We are using
the book ISBN as the primary key.
Book 1 = db.product.insert({“isbn” : 123 , “title”: “Book 1”, “author” :
“Mike”,”coathor” : “Jessy”,
“description” : “A fi ctional book”,”pages” : 150,
“publicationdate” : new Date(‘10-06-2010’), “publisher”
: { “name”: “Willeyy”, “location”:
“UK”},”tag-category”: [“fi
ctional”,”science”], “sellers” : [{“sellername”: “bookstore” , “price” :
50},{“sellername”: “seller4523” ,”price” : 60}], “comments”: [ {“email”:
“test@example.
com”, “comment” : “ test test”}]});
Book 2 = db.product.insert({“isbn” : 124 ,
“title”: “Book 2”, “author” : “George”, “description” : “Comedy book
about”,”pages” : 120, “publicationdate” : new Date(‘12-06-2010’), “publisher” :
{ “name”: “Wil”, “location”: “USA”},”tag-category”: [“comedy”,”family”],
“sellers” : [{“sellername”: “bookstore” , “price” : 50}], “comments”: [
{“email”: “test1@example.com”, “comment” : “ test test”}]});
Book 3 = db.product.insert({“isbn” : 125 ,
“title”: “Book 3”, “author” : [“Tony”,”James”], “description” :
“geographic”,”pages” : 80,
“publicationdate” : new
Date(‘14-07-2010’), “publisher” : {
“name”: “Wil”, “location”: “USA”},”tag-category”: [“geographic”], “sellers” :
[{“sellername”: “John” , “price” : 45},
{“sellername”: “Joe” , “price” : 55}],
“comments”: [ {“email”: “test1@example.com”, “comment” : “ test
test”},{“email”: “bookreviewer@example.com”, “comment” : “very good book”}]});
Book 4 = db.product.insert({“isbn” : 128 ,
“title”: “Book 4”, “author” : [“Mike”,”Jack”], “description” : “A physics
book”,”pages” : 150, “publicationdate” : new Date(‘10-06-2010’), “publisher” :
{ “name”: “Mike”,
“location”: “CHINA”},”tag-category”: [“physics”,
”science”], “sellers” : [{“sellername”: “Mikee” , “price”
: 30},{“sellername”: “jackk” , ”price” :
10}]});
Book 5 =
db.product.insert({“isbn” : 133 , “title”:
“Book 5”, “author” : “Bob”,”description”
: “chemistry”,
“pages” : 100, “publicationdate” : new
Date(‘2006-0501’), “publisher” : { “name”: “Jason”, “location”:
“India”},”tag-category”: [“chemistry”], “sellers” : [{“sellername”: “take” ,
“price” : 80}]});
Book 6 = db.product.insert({“isbn” : 137 ,
“title”: “Book 6”, “author” : [“Chuck”,”Paul”],”description” :
“Mathematics Computer Science”, “pages”
: 430,
”publicationdate” : new
Date(‘3-04-2001’), “publisher” : { “name”: “Willey”, “location”:
“UK”},”tag-category”: [“mathematics”,”computers”,”science”], “sellers” :
[{“sellername”: “pricess” , “price” : 20}], “comments”:
[ {“email”: “test1@example.com”,
“comment” : “not good”}, {“email”: “test2@example.com”, “comment” : “this book
is great”}]});
Book 7 = db.product.insert({“isbn” : 136,
“title”: “Book 7”, “author” : “George”, “description” : “Crime” ,”pages” : 120,
“publicationdate” : new Date(‘20-05-2000’), “publisher” : { “name”: “Willey”,
“location”: “UK”},
”tag-category”: [“crime”,”action”],
“sellers” : [{“sellername”: “bookstore3” , “price” : 40},{“sellername”:
“book1231” , “price” : 50}], “comments”: [ {“email”: “test1@example.
com”, “comment” : “good book”}]});
Book 8 = db.product.insert({“isbn” : 138,
“title”: “Book 8”, “author” : [“Tony”,”Jack”], “description” : “Drama”,”pages”
: 80, “publicationdate” : new Date
(‘24-07-2010’), “publisher” : { “name”:
“John”, “location”:
“Canada”},”tag-category”:
[“drama”,”thriller”], “sellers” : [{“sellername”: “John” , “price” :
55},{“sellername”: “Joesw” , “price” : 65}], “comments”: [ {“email”:
“tasdfas1@example.com”, “comment” : “
test is good”}, {“email”: “bookreviewer@example.com”, “comment” : “very good
book”}]});
Book 9 = db.product.insert({“isbn” : 140 ,
“title”: “Book 9”, “author” : [“Mike”,”George”], “description” :
“Architect”,”pages” : 150,
“publicationdate” : new
Date(‘10-06-2010’), “publisher” : {
“name”: “Take”, “location”: “CHINA”},”tag-category”: [“architecture”,
”science”], “sellers” : [{“sellername”:
“Mike” , “price” : 30},{“sellername”: “jack” ,”price” : 10}]});
Book 10 = db.product.insert({“isbn” : 141 ,
“title”: “Book 10”, “author” : “Bob”,”description” : “English”, “pages” : 100,
“publicationdate” : new Date(‘2006-05-01’), “publisher” : { “name”: “Mikes”,
“location”: “Sweden”},
”tag-category”: [“English”,”language”],
“sellers” :
[{“sellername”: “Paul” , “price” :
60}]});
6.21.8 More on Deleting Documents
I t is important to understand that once data is removed,
there is no undo of the action or recovery of the removed documents.
A n example we could use from the case
study is that the books of the author “Tony” have been suspended and are not available
for sale anymore. This example will delete any documents from the product
collection where the author is “Tony”. If you have inspected the data set
above, you should have noticed that two documents include this author: books 3
and 8.
>db.produ ct.count();
10
>db.product.remove({“author”: “Tony”});
>db.product.count();
8
A t the beginning of the process, there
were ten documents in the collection. After removing those where the value of
the author was “Tony”, only eight documents were left in the collection.
Despite that the authors were listed inside an array, so the query mechanism
worked ef fi ciently.
Over to you: To familiarise yourself with the
process, use any other criteria to remove documents. Once fi nished, reinsert the removed documents.
6.21.9 More on Updating Documents
A document that is
stored in the database can be changed anytime. As previously explained, the
update method requires two parameters: There are many ways to replace a
document, and these are explained in this section:
• The
simplest way to update a document was illustrated in the basic commands. That
method was useful as it was necessary to make changes to the schema of the
document.
• The
second way is to use modifi e rs in cases where only specifi c parts of the document will be updated. This
section presents the most important modi fi ers.
6.21.10 The Modi fi ers
6.21.10.1
$set/ $unset
T hese are used to set the value of a key, and if the key does
not exist, it is created automatically. Unset is used to delete a given fi eld. Let us see some examples:
“
db.COLLECTION_NAME.update({“documentToChange”}, { “$set”
{“values...”})”
“db.COLLECTION_NAME.update({“documentToChange”}, { “$unset”
: {fi eldToRemove : 1});”
Note: Number 1 in
the fi eldToRemove is required as a
positive action.
When a book is sold, appropriate actions
should be taken. Consider the need for an update where one stock item is sold.
For this, we can use $set
. Moreover, “$set” can be used to change the type of a value, for
instance, changing a string to an array. We can use “$unset” to delete
the comment fi e ld just for demonstration purposes.
>db.product.fi nd({“isbn” : 128});
{
“_id” : ObjectId(“4e302e849d72be29a7d749e9”), “isbn” :
128, “title” : “Book 4”, “author” : [
“Mike”, “Jack” ],
“description” : “A physics book”,
“pages” : 150,
“publicationdate” :
ISODate(“1970-01-01T00:00:00Z”), “publisher” : { “name” : “Mike”, “location” :
“CHINA” }, “tag-category” : [ “physics”, “science” ], “sellers” : [ {
“sellername” : “Mikee”, “price” : 30 }, { “sellername” : “jackk”, “price” : 10
} ] }
#The document with isbn 128 has two sellers at
the beginning
#Once a book is sold the associated entry
should be removed like as illustrated below
>db.product.update({“isbn” : 128}, {“$set”
: {“sellers”
: [ { “sellername” : “Mikee”, “price” :
30 }]}});
>db.product.fi nd({“isbn” : 128});
{ “isbn”
: 128,
“_id” :
Objec-tId
(“4e302e849d72be29a7d749e9”), “author” :
[ “Mike”, “Jack” ], “description” : “A physics book”, “pages” : 150,
“publicationdate” : ISODate(“1970-01-01T00:00:00Z”), “publisher” : { “name” :
“Mike”, “location” : “CHINA” }, “sellers” : [ { “sellername” : “Mikee”, “price”
: 30 } ], “tag-category” : [ “physics”, “science” ], “title” :
“Book 4” }
#The query returns the updated and correct
document
#Now we will remove the comments fi eld from another document with $unset
command. First retrieve the docu-ment to see that the comments fi eld exists and then use the
$unset command
>
db.product.fi nd({“isbn” : 124})
{“isbn”
: 124 , “title”: “Book 2”, “author” : “George”,
“description” : “Comedy book
about”,”pages” : 120, “publicationdate” : new Date(‘12-06-2010’), “publisher”
: { “name”: “Wil”, “location”:
“USA”},”tag-category”:
[“comedy”,”family”], “sellers” : [{“sellername”:
“bookstore” , “price” : 50}], “comments”: [ {“email”:
“test1@example.com”, “comment” : “ test
test”}]}
> db.product.update({“isbn” :
124},{“$unset” : {comments
:1}});
> db.product.fi nd({“isbn” : 124})
{ “isbn”
: 124,
“_id” :
ObjectId
(“4e30801e9889f6cdcf761296”), “author” :
“George”, “description” : “Comedy book about”, “pages” : 120, “publicationdate”
: ISODate(“1970-01-01T00:00:00Z”), “publisher” : { “name” : “Wil”, “location” :
“USA” }, “sellers” : [ { “sellername” : “bookstore”, “price” : 50 } ],
“tag-category” : [ “comedy”, “family” ], “title” : “Book 2” }
#
The comments fi eld was completely
removed from the document
T ry to update at least two more
documents by modifying (adding/removing) the different sellers or changing the
names of author(s) or others.
6.21.10.2 $inc
I t is similar to $set, but it is strictly used only for numbers. It
increases an existing number by adding to it a given number .It is regularly
used for analytics such as the update visitor counter of a website. The case
study does not refer to these issues, but a suitable example could be to change
the pages of a book:
#Using the update query to
increment the pages of a particular book. The pages were 150 (see the previous
query) and now incrementing by 205
> db.product.update({“isbn” :
128},{“$inc” : {pages
:205}});
> db.product.fi nd({“isbn”:128});
{ “isbn”
: 128,
“_id” :
ObjectId
(“4e302e849d72be29a7d749e9”), “author” :
[ “Mike”, “Jack” ], “description” : “A physics book”, “pages” : 355, “publicationdate”
: ISODate(“1970-01-01T00:00:00Z”), “publisher” : { “name” : “Mike”, “location”
: “CHINA” }, “sellers” : [ { “sellername” : “Mikee”, “price” : 30 } ],
“tag-category” : [ “physics”, “science” ], “title” :
“Book 4” }
#The update was successful and the pages
currently are
355
6.21.11 Querying Documents
T he fi n d method is used to search for documents in a
collection. It is possible to de fi ne more than one key/value pair as a
parameter to search. For instance, a customer wants to fi n d all the books
that are in the ‘science’ category and then one of them which was written by
George. Moreover, we can specify which keys will be returned. For this example,
we will return only the title, ISBN and author(s) of the book(s).
The command we will
use is db.COLLECTION_NAME.fi nd({“condition”}); For specifying which keys to return, use the
extended command:
“ db.COLLECTION_NAME.fi nd({“condition”},{“key” : 1});
Number 1 is a true/false setting used to de fi
ne that a key should be returned; number 0 is used to de fi ne that a key
should not be returned.
#
In this query we want to fi nd science
books, and present only their isbn,author and title
> db.product.fi nd({“tag-category” : “science”},{“isbn” :
1 , “author” : 1 , “title” : 1, “_id”:
0});
{
“isbn” : 123, “title” : “Book 1”, “author” : “Mike” }
{
“isbn” : 128, “title” : “Book 4”, “author” : [ “Mike”, “Jack” ] }
{
“isbn” : 137, “title” : “Book 6”, “author” : [ “Chuck”, “Paul” ] }
{
“isbn” : 140, “title” : “Book 9”, “author” : [ “Mike”, “George” ] }
#
Then we are using another condition for querying the books in science category
and those written by George.
In this example we don’t want to return
the “_id” key.
> db.product.fi nd({“tag-category” : “science”, “author” :
“George”},{“isbn” : 1 , “author” : 1 ,
“title” : 1 , “_id” : 0});
{
“isbn” : 140, “title” : “Book 9”, “author” : [ “Mike”, “George” ] }
6.21.11.1 Query Criteria and Operators
Query Condition
|
Explanation
|
$lt
< ,
|
less than
|
$lte <= ,
|
less than or equal
|
$gt >
,
|
greater than
|
$gte >= ,
|
greater than or equal
|
$ne
|
not equal
|
$in
|
in between
|
$or
|
one or other
condition
|
Using the query criteria and operators (or
combinations) in the table above, examples are presented below to demonstrate
their usage. All the examples relate to the case study, whilst customers use
the search engine based on different criteria to defi n e their books.
#Using $lte and $gte to look for book price
greater and equal to 20 , and less than and equal to 80. Also specifying what
to return.
> db.product.fi nd({“pages” : {“$gte” : 20, “$lte”:
80}},{“isbn” : 1, “_id”: 0}); { “isbn” : 125 }
{
“isbn” : 138 }
#Create a variable with time and use it to
fi nd the books published before that
date > thedate = new
Date(“01/01/2009”);
ISODate(“2009-01-01T00:00:00Z”)
>db.product.fi nd({“publicationdate” : {“$lt” :
thedate}})
#Using $in to match pages of books with books
>db.product.fi nd({“pages” : {“$in” :[80,100]}},{“isbn”
:1,”_id” : 0}); {“isbn” : 125 } { “isbn” : 133 }
{
“isbn” : 138 }
{
“isbn” : 141 }
#Using $or to fi nd books that are written from Tony or are
science category
>db.product.fi
nd({“$or”:[{“author”:”Tony”},{“tagcategory”:”science”}]},{“isbn”
:1,”_id”:0}); { “isbn” : 123 }
{
“isbn” : 125 }
{
“isbn” : 128 }
{
“isbn” : 137 }
{
“isbn” : 138 }
{
“isbn” : 140 }
Limits, Skips and Sorting
T o extend the query options, this section presents limits,
skips and sorting on the returned results of a query.
Limit sets the
maximum number of results that will be obtained.
Skip skips the given number of results and
then returns the rest of the results. If the collection has fewer documents
than the given number, then nothing will be returned.
Sort sorts documents
in given ascending (1) or descending (−1) order.
See the examples
below and then try some similar queries for yourself:
#Requiring all results (therefore the {}) from
the collection and limit to 3
> db.product.fi nd({},{“isbn”:1, “_id”:0}).limit(3); { “isbn” : 123 }
{
“isbn” : 128 }
{
“isbn” : 140 }
#Requiring all results from the collection and
skipping the fi rst 5 while returning
the rest in the collection
> db.product.fi nd({},{“isbn”:1, “_id”:0}).skip(5);
{
“isbn” : 137 }
6.22 Review Questions
{
“isbn” : 136 }
{
“isbn” : 141 }
{
“isbn” : 125 }
{
“isbn” : 138 }
#Requiring all results (therefore the
{}) from the collection sorted by book title descending and number of pages
ascending
> db.product. fi
nd({},{“title”:1,”pages”:1, “_id”:0}). sort({pages: -1});
{
“title” : “Book 6”, “pages” : 430 }
{
“title” : “Book 1”, “pages” : 150 }
{
“title” : “Book 4”, “pages” : 150 }
{
“title” : “Book 9”, “pages” : 150 } .........
{
“title” : “Book 3”, “pages” : 80 }
{
“title” : “Book 8”, “pages” : 80 }
#Finally, we will combine all the query
options to-gether
> db.product. fi
nd({},{“title”:1,”pages”:1, “_id”:0}).
sort({pages: -1}).limit(3).skip(2); { “title” : “Book 4”, “pages” : 150 }
{
“title” : “Book 9”, “pages” : 150 }
{
“title” : “Book 2”, “pages” : 120 }
Shutdown
To shutdown MongoDB
client, use the ‘exit;’ command.
To shutdown the
MongoDB server, use the Ctrl + C keys.
6.21.11.2 Useful Extra Resources
We have only just
got started with MongoDB with these notes. We hope that there is enough here to
give you an understanding of the key principles and to help identify how
MongoDB is different to an RDBMS.
If you wish to fi nd out more, some useful resources
include:
• Chodorow,
K. and Dirolf, M.: MongoDB: The Defi n itive Guide Powerful and Scalable Data
Storage. O’Reilly Media, Sebastopol (2010)
6.22 Review Questions
The answers to these
questions can be found in the text of this chapter.
• What
does ACID stand for, and how sacrosanct are each of the elements in NoSQL
databases?
• What
was Cassandra designed for in terms of the types of data it stores?
• What
is meant by horizontal scaling?
• How
might the cloud be used as part of a disaster recovery strategy?
• What
types of data are ideal for MongoDB, and what for Cassandra?
6.23 Group Work Research Activities
These activities
require you to research beyond the contents of the book and can be tackled
individually or as a discussion group.
6.24 Discussion Topic 1
Once you have
become conversant with one or other of the databases in the tutorials of this
chapter, attempt to draw up a SWOT analysis to see what the strengths and
weaknesses, threats and opportunities there may be from adopting the database
in an organisation.
6.25 Discussion Topic 2
T ry to think of useful criteria for comparing
different types of database to help you decide which might be the most
appropriate for a given application. Two examples of criteria might be
scalability and performance which we have already referred to in the text
above.
Acknowledgements
Much of the tutorial material was initially
created by Antonis Petrou as an output from his MSc dissertation at Shef fi eld
Hallam University. He was an excellent student to supervise and achieved an
excellent fi n al mark for his work. I am grateful for his permission to use
portions of his work. Peter Lake
References
Cassandra:
The Defi nitive Guide By: Eben Hewitt Publisher:O’Reilly Media, Print: November
(2010). Ebook: November 2010, Print ISBN:978-1-4493-9041-9 ISBN 10:1-4493-9041-2, Ebook
ISBN:978-1-4493-9042-6 | ISBN 10:1-4493-9042-0
Chang, F.,
Dean, J., Ghemawat, S., Hsieh, W.C., Wallach, D.A., Burrows, M., Chandra, T.,
Fikes, A., Gruber, R.E.: Bigtable: a distributed storage system for structured
data. ACM Trans. Comput. Syst. 26 (2), 26 (2008). doi:1 0.1145/1365815.1365816. h ttp://doi.acm.org/10.1145/1365815.1365816 , Article 4
(June 2008)
Brewer, E.: Pushing the CAP: strategies for
consistency and availability. Computer 45 (2), 23–29
0 komentar:
Posting Komentar