14
CHAPTER
Database Technology
T he changing characteristics of data and need for new data processing models as well
as file systems in high-performance computing environment have been discussed in
previous chapters. The large data-sets produce not only structured data but un-structured data
too. Storage requirement for unstructured data is entirely different from that of structured
data. There is a need to maintain quicker data storage, search and retrieval as these are basic
requirements in high-performance computing.
Cloud computing not only provides support for traditional DBMSs, modern data processing
requirements are also catered to as well. This chapter focuses on the characteristics of this new
type of databases and discusses how un-structured data are stored in those databases for efficient
processing. Apart from these, the chapter also discusses about different forms of database
solutions available on the high-performance cloud computing environment.
Data storage and database on the cloud is intimately tied with one another and that provides
the scope for suitable solutions to optimize the database performance. This has changed the
way how database is managed. Many cloud computing vendors have developed new methods
of storing data objects which are different from the traditional methods of storing data.
Data storage and database on cloud like high-performance systems are often intimately tied
with one another for efficient processing of large volume unstructured data-sets.
14.1 DATABASE IN CLOUD
Consumers can avail database facility in cloud in two forms. First one is the general database
solution that is implemented through installation of some database solution on IaaS (virtual
machine delivered as IaaS). The other one is delivered by service providers as database-as-
a-service where the vendor fully manages the backend administration jobs like installation,
security management and resource assignment tasks.
In the first approach, the users can deploy database applications on cloud virtual machines
like any other applications software. Apart from this, the ready-made machine images supplied
by the vendors are also available with per-installed and pre-configured databases. For example,
Amazon provides ready-made EC2 machine image with pre-installed Oracle Database.
In the Database-as-a-Service (DBaaS) model, the operational burden of provisioning,
configuration, backup facilities are managed by the service operators. While the earlier
described (first) approach is similar implementation of database solutions over cloud
infrastructure where the users used to deploy them over traditional computers; the DBaaS
approach provides the actual flavour of cloud computing to database users.
239
240 Cloud Computing
Users can deploy and manage database solutions of their choices over IaaS facility. On the
other hand, the Database-as-a-Service is a PaaS offering being delivered by provider.
14.2 DATA MODELS
The traditional database systems with relational models deal with the structured data. But,
there has been a rapid change in the characteristic of data in last one decade or so. Huge volume
of unstructured data are being generated and stored every day. The storage and processing
requirements of these data are different and it cannot be managed by traditional systems with
relational models efficiently.
As a result, new data models have been introduced for use in different database solutions
to fulfill the need of unstructured data being apart from continuing with traditional form
of database solutions for structured data. Query languages used in these two categories of
databases are also different. The data models used in these two categories are as SQL data
model and NoSQL data model respectively.
14.2.1 SQL Model or Relational Model
This is the data model used in traditional database systems that process structured data sets. But
this relational data model has a limitation. This model is not made for distributed data storage
and thus makes the scaling of a database difficult. Hence relational data models are not natively
suited to cloud environment where scaling is an essential attribute. Databases built following
this data model is called as relational database or SQL database. Oracle Database, Microsoft SQL
Server, Open-source MySQL or Open-source PostgreSQL come under this category.
14.2.2 NoSQL Model or Non-relational Model
Distributed data storage is inherent characteristic of this data model. Thus, this model is suitable
for building scalable systems. Databases built following this data model are known as NoSQL
databases. Such databases are cloud native and able to scale efficiently. NoSQL database is built
to serve heavy read-write loads and suitable for retrieval of the storages for unstructured data
sets. Amazon SimpleDB, Google Datastore and Apache Cassandra are few examples of NoSQL
database systems.
NoSQL data model is more cloud native than relational data models. Relational or SQL data
model makes scaling difficult for databases.
14.3 DATABASE-AS-A-SERVICE
Database-as-a-Service (DBaaS) is a cloud service offering which is managed by cloud service
providers. DBaaS has all of the characteristics of cloud services like scaling, metered billing
capabilities and else. It is offered on a pay-per-usage basis that provides on-demand access to
Database Technology 241
database for the storage of data. Database-as-a-Service allows storing and retrieving of data
without having to deal with lower-level database administration functionalities. It provides
significant benefits in terms of automated provisioning, monitoring, increased security and
management of database in comparison to the traditional architectures.
The DBaaS offering of cloud computing is available to serve the processing requirements of
both the structured and unstructured data. For structured data the early DBaaS efforts include
Amazon RDS and Microsoft SQL Azure. Example of DBaaS for unstructured data include
Amazon SimpleDB, Google Datastore and Apache Cassandra.
Database-as-a-Service comes under Platform-as-a-Service offering category.
14.4 RELATIONAL DBMS IN CLOUD
There are two ways to use RDBMS on cloud. In the first way, the consumers of IaaS service can
deploy some traditional relational database (like Oracle or SQL server) on cloud server, where
the consumers have full responsibilities of installation and management of such databases. The
second way is to use the available ready-to-use relational database services offered by cloud
service providers. Such services include Amazon Relational Database Service (Amazon RDS),
Google Cloud SQL, Azure SQL Database and others. Two types as mentioned above can be
categorized as
■■ Relational database deployment on cloud
■■ Relational Database-as-a-Service or fully-managed RDBMS
14.4.1 Relational Database Deployment on Cloud
Users can avail the facility of traditional RDBMS deployment over cloud infrastructure. Even
third-party database applications can be deployed over virtual machines in cloud. There are
two ways of deploying it. Firstly, the users can install such database applications on cloud
servers just like they do it in local computers. Secondly, many cloud services provide ready-
made machine images that already include an installation of a database. The major cloud
vendors mostly support the common RDBMS applications like Oracle Server, Microsoft’s SQL
Server or open-source MySQL Server.
Amazon’s cloud computing platform provides support for both Microsoft’s SQL Server and
Oracle Server. In fact, AWS provides an ideal platform for running many other traditional,
third-party relational database systems including IBM DB2 on their virtual machines. Users
can run their own relational database on Amazon EC2 VM instances. Rackspace also provides
deployment support for Oracle Server, SQL Server and MySQL Server. Microsoft Azure
supports the deployment option for Oracle database and SQL Server using Azure VM.
When users deploy database on cloud, they get full control over the database administration,
including backups and recovery. They can install the database application over operating system
of their choice and can also tune the operating system and database parameters according to
their requirements.
242 Cloud Computing
Deploying some relational database on cloud server is the ideal choice for users who require
absolute control over the management of the database.
14.4.2 Relational Database-as-a-Service
Similar to other computing services like machine, storage, network etc., the cloud providers
also provide relational DBMS as a service referred as DBaaS. Many cloud service providers
offer the customary relational database systems as fully-managed services which provide
functionalities similar to what is found in Oracle Server, SQL Server or MySQL Servers.
Management of such services is the responsibility of provider who handle the routine database
tasks like provisioning, patching, backup, recovery, failure detection and repair. Following
section briefly focuses on few such popular services.
Relational database management system offerings are fully-managed by cloud providers.
[Link] Amazon RDS
Amazon Relational Database Service or Amazon RDS is a relational database service available
with AWS. Amazon RDS was first released in 2009 supporting functionalities of open-source
MySQL database. Later on, it added capabilities of Oracle Server, Microsoft SQL Server and
Open-source PostgreSQL. In 2014, AWS launched a MySQL-compatible relational database
called as Amazon Aurora and added it as the fifth database engine available to customers
through Amazon RDS. Amazon Aurora provides much better performance than MySQL at
lesser price.
Two different pricing options are available with Amazon RDS as reserved pricing and
on-demand pricing. These two schemes are known as Reserved DB Instances and On-Demand
DB Instances. Reserved pricing provides the option for one-time payment and offers three
different DB Instance types (for light, medium and heavy utilization). This scheme is useful
for extensive database use in long run. On-Demand DB Instances provide the opportunity of
hourly payments with no long-term commitments.
[Link] Google Cloud SQL
Google Cloud SQL is a MySQL database that lives in Google’s cloud and fully managed by
Google. It is very simple to use and integrates very well with Google App Engine applications
written in Java, Python, PHP and Go. Google Cloud SQL is also accessible through MySQL
client and other tools those works with MySQL databases. Google Cloud SQL offers updated
releases of MySQL.
Google offers two different billing options for Cloud SQL, namely as Packages and Per Use.
Packages option is suitable for users who extensively use the database per month; otherwise
hourly-basis billing is preferable which is available in Per Use option. In both of the schemes,
the billing amount depends mainly on RAM, storage usages and number of I/O operations.
Database Technology 243
[Link] Azure SQL Database
Microsoft Azure SQL Database (formerly SQL Azure) is a relational Database-as-a-Service
offering functionalities of Microsoft SQL Server. Like Microsoft SQL Server, SQL Database
uses T-SQL as the query language. SQL Database is available under three service tiers like basic,
standard and premium with different hourly-basis pricing options. The premium tier supports
mission-critical, high-transactional volume and many concurrent users whereas the basic tier
is for small databases with a single operation at a given point in time.
There are many more such fully-managed relational database services available in the market
offered by different providers. But this is still an emerging field and many other companies are
in the process of developing their own services with more functionalities.
Amazon RDS, Google Cloud SQL and Azure SQL Databases deliver RDBMS as-a-Service.
14.5 NON-RELATIONAL DBMS IN CLOUD
Non-relational database system is another unique offering in the field of data-intensive
computing. Using this database model, the new age non-relational data-sets can be processed
more efficiently than the traditional database with relational systems. The new age high-
performance computing environments like cloud computing systems are heavily dependent
on these non-relational database systems for efficient storage and retrieval of data. This section
focuses on the need of such a database, discusses its storage architecture and also briefs few
such popular databases.
14.5.1 Emergence of Large Volume of Unstructured Data-sets
Earlier data managed by enterprise applications were structured in nature and less in volume.
But with the introduction of web based portals during the end of last century, the nature of
web content or data started changing. Volume of data started to grow exponentially and data
became unstructured in character. Such data-sets were classified later and their characteristics
were identified. This type of data or data-set is referred as ‘Big data’.
[Link] Big data
Big data is used to describe both structured and unstructured data that is massive in volume.
It also considers data those are too diverse in nature and highly dynamic (very fast-changing).
Differently put, the new age data whose volume, velocity or variety is too great are termed as
Big data. Three said characteristics of Big data are described below.
Volume: A typical PC probably had 10 gigabytes of storage in the year of 2000. During
that time, excessive data volume was a storage issue as storage was not so cheap
like today. Today social networking sites use to generate few thousand terabytes of data
every day.
244 Cloud Computing
Velocity: Data streaming nowadays are happening at unprecedented rate as well as with
speed. So things must be dealt in a timely manner. Quick response to customers’ action is
a business challenge for any organization.
Variety: Data of all formats are important today. Structured or unstructured texts, audio,
video, image, 3D data and others are all being produced every day.
The above characteristics cause variability and complexity in terms of managing big data.
Variability in the sense that data flow can be highly inconsistent with periodic peaks as they are
in social media or in e-commerce portals. The complexity often comes with it when it becomes
difficult to connect and correlate data or define their hierarchies.
It is to be noted that big data is not only about the volume of data; rather it considers other
characteristics of new age data, like their variety or speed of generation.
14.5.2 Time Appeared for an Alternative Database Model
Since the emergence of relational database, enterprise applications started using it since 1980s.
Those relational database systems were developed to store and process structured data-sets.
But the database system started facing challenge as the volume of data started increasing
exponentially from the end of the last century and the situation worsened after the introduction
of web based social networking and e-commerce portals. Soon, the concept of big data emerged.
Online Transaction Processing (OLTP) applications flooded the web with very high volume
of data from the beginning of the current century. These applications needed to function under
stiff latency constraints to provide consistent performance to a very large number of users as
hundreds of millions of clients throughout the world were accessing such applications. These
sites were experiencing massive variations in traffic also. Some of these hikes were due to
predictable events like New Year, business release or sporting event, but most of others were
unpredictable events which becomes more difficult to manage. Data were being accessed
more frequently and needed to be processed more intensively.
Relational databases are appropriate for a wide range of tasks but not for every task.
The basic operations at any database are read and write. Read operations can be scaled by
distributing and replicating data to multiple servers. But inconsistency in data may happen
when write or update operation takes place. And with the new age data, the number of writers
are often much larger than the number of readers, especially in popular social networking sites.
One solution to this problem is to exclusively partition the data during distribution. But with
that also the distributed unions (of data from database tables) may become slower and harder
to implement if the underlying storage architecture is not supported for doing so.
Here, the main problem was that the traditional SQL databases with relational systems
do not scale well. Traditional DBMSs can only ‘scale up’ (vertical scaling) or increase the
resources on a central server. But, efficient processing of big data require an excellent ‘scale out’
(horizontal scaling) capability.
Database Technology 245
Web applications were moving towards cloud computing model, and it did not take very long
to the pioneer of cloud computing services like Google, Amazon, and many other e-commerce
and social networking companies as well as technologists to realize that traditional relational
databases are no more enough for handling the new age data. They started to look for a suitable
database solution.
Traditional SQL databases do not fit well with the concept of horizontal scaling and horizontal
scalability is the only way to scale them indefinitely.
[Link] Modern Age Database Requirements
Horizontal scaling appeared as one of the necessary attributes of database system to keep pace
with the processing needs of large data-sets. It appeared impossible to deliver high-performance
without distributing those data among multiple nodes and processing them in parallel. The other
major concern was the latency associated with transactions. This latency could be reduced by
caching frequently-used data in-memory on dedicated servers, instead of fetching them every
time required. These facilities had to be incorporated in the new age database systems to reduce
the response time and enhance the performance of applications. The databases had to be highly-
optimized for simple retrieval and appending operations. These things, along with many other
issues, worked as the driving forces behind the development of an alternative database system.
[Link] Role of Cloud Storage System
The characteristics of storage system had changed during this time. From the earlier concern
regarding cost of storage space, the cost of storage management was gradually becoming the
dominant element of storage systems. That opened the opportunity for replication of files into
storage across different geographic locations and hence the uses of distributed file systems
became widespread.
In such a scenario, the evolution of storage strategy started introducing many different
models of distributed file systems like General Parallel File System (GPFS), Google File System
(GFS) or Hadoop Distributed File System (HDFS) and else. All of these works well in high-
performance computing environments. Characteristics of such file systems and their storage
strategy suited well with cloud’s dynamic architecture. This created opportunity of developing
scalable database systems (over these file systems) to store and manage the modern age data.
Cloud native databases are facilitated by distributed storage systems and they are closely
associated with one another. Hence, the storage and database system often overlaps.
14.5.3 NoSQL DBMS
NoSQL is a class of database management system that does not follow all of the rules of a
relational DBMS. The term NoSQL can be interpreted as ‘Not Only SQL’ as it is not a replacement
but rather it is a complementary addition to RDBMS. This class of database uses some SQL
246 Cloud Computing
like query languages to make queries but does not use the traditional SQL (structured query
language).
The term NoSQL was coined by Carlo Strozzi in the year of 1998 to name the file-based
open-source relational database he was going to develop which did not have an SQL interface.
However, this initial usage of the term NoSQL is not directly linked with the NoSQL being
used at present. The term drew attention in 2009 when Eric Evans (an employee of a cloud
hosting company, Rackspace) used it in a conference to represent the surge of developing non-
relational distributed databases then.
NoSQL is not against SQL and it was developed to handle unstructured big data in an efficient
way to provide maximum business value.
[Link] The Evolution
The NoSQL movement slowly started in the early years of current century as the IT industry
started to realize the need of new database system in order to support web-based applications.
The initial advances got its space when computing majors Google and Amazon published two
papers successively in 2006 and 2007.
[Link] The BigTable Revolution
In 2004, Google employed a team to develop a storage system to manage Big data. BigTable is
outcome of that. It is a proprietary distributed storage system built by Google on GFS and is
in use from 2005. The storage system was built to manage large structured data-sets and
was designed to scale to a very large size. It is structured as large table which may be peta-
bytes in size and distributed among tens of thousands of machines. BigTable has successfully
provided a flexible, high-performance solution for Google products like Google Earth, Google
Analytics and Orkut.
Later, this BigTable has had a large impact on NoSQL database design when Google publicly
disclosed the details of it in a technical paper in 2006. This opened the scope to the technologists
for an Open-source development of BigTable like database. Thus, HBase database developed
by Apache Foundation and Cassandra developed at Facebook were surfaced in the market.
Meanwhile, during all of these developments, Amazon also published a paper on their Dynamo
storage system in 2007 which was also built to address the challenges of working with big data.
Big Table, although built as storage, resembles database system in many ways. It also shares
many implementation strategies of database technologies.
The NoSQL database development process remained closely associated with the developments
in the field of cloud native file systems (or, cloud storage systems) during those days. Soon, many
other players of web services started working on the technology and in a short period of time,
starting around the year of 2008, all of these developments became the source of a technology
revolutions. The NoSQL database became prominent after 2009 as the general terminology
‘NoSQL’ was adopted to set apart these new databases or more correctly for the file systems.
Database Technology 247
NoSQL database development has been closely associated with scalable file system
development in computing.
[Link] CAP Theorem
The abbreviation CAP stands for Consistency, Availability and Partition tolerance of data.
CAP theorem (also known as Brewer’s theorem) says that it is impossible for a distributed
computer system to meet all of three aspects of CAP simultaneously. Eric Brewer of University
of California, Berkeley presented the theorem in the ACM (Association of Computing
Machinery) conference in 2000.
■■ Consistency: This means that data in the database remains consistent after execution of an
operation. For example, once a data is written or updated, all of the future read requests will
see that data.
■■ Availability: It guarantees that the database always remains available without any downtime.
■■ Partition tolerance: Here the database should be partitioned in such a way that if one part of
the database becomes unavailable, other parts remain unaffected and can function properly.
This ensures availability of information.
Any database system must follow this ‘two-of-three’ philosophy. Thus, the relational database
which focuses highly on consistency issue sacrifices the ‘partition tolerance’ attribute of CAP
(Figure 14.1). It is already discussed that one of the primary goals of NoSQL systems is to
boost horizontal scalability. To scale horizontally, a system needs strong network partition
tolerance which needs to give up either ‘consistency’ or ‘availability’ attribute of CAP. Thus, all
of the NoSQL databases follow either combinations of CP (consistency-partition tolerance) or
RDBMS NoSQL Database
Column-Family Category
(HBase, MongoDB)
Consistency
CA CP
Availability Partition
AP tolerance
NoSQL Database
Document-Oriented & Key-Value Categories
(DynamoDB, Cassandra, CouchDB)
FIG 14.1: ‘Two-of-three’ combination of CAP philosophy
248 Cloud Computing
AP (availability-partition tolerance) from the attributes of the CAP theorem. This means some
of the NoSQL databases even drops consistency as an essential attribute. For example, while
HBase maintains CP criteria the other popular database Cassandra maintains AP criteria.
Some of the NoSQL databases even choose to relax the ‘consistency’ issue from the CAP
criteria and this philosophy suits well in certain distributed applications.
Different combinations of CAP criteria are to serve different kinds of requirements. Database
designers analyze specific data processing requirements before choosing one.
CA: It is suitable for systems being designed to run over cluster on a single site so that all
of the nodes always remain in contact. Hence, the worry of network partitioning problem
almost disappears. But, if partition occurs, the system fails.
CP: This model is tolerant to network partitioning problem, but suitable for systems where
24 × 7 availability is not a critical issue. Some data may become inaccessible for a while but
the rest remains consistent or accurate.
AP: This model is also tolerant to network partitioning problem as partitions are designed
to work independently. 24 × 7 availability of data is also assured but sometimes some of the
data returned may be inaccurate.
Partition tolerance is an essential criteria for NoSQL databases as one of their primary goals is
the horizontal scalability.
[Link] BASE Theorem
Relational database system treats consistency and availability issues as essential criteria.
Fulfillments of these criteria are ensured by following the ACID (Atomicity, Consistency,
Isolation and Durability) properties in RDBMS. NoSQL database tackles the consistency issue
in a different way. It is not so stringent on consistency issue; rather it focuses on partition
tolerance and availability. Hence, NoSQL database no more need to follow the ACID rule.
NoSQL database should be much easier to scale out (horizontal scaling) and capable of
handling large volume of unstructured data. To achieve these, NoSQL databases usually follow
BASE principle which stands for ‘Basically Available, Soft state, Eventual consistency’. The
BASE theorem was also defined by Eric Brewer who is known for formulating the CAP theorem.
The three criteria of BASE are explained below:
■■ Basically Available: This principle states that data should remain available even in the
presence of multiple node failures. This is achieved by using a highly-distributed approach
with multiple replications in the database management.
■■ Eventual Consistency: This principle states that immediately after operation, data may look
like inconsistent but ultimately they should converge to a consistent state in future. For
example, two users querying for same data immediately after a transaction (on that data)
may get different values. But finally, the consistency will be regained.
Database Technology 249
■■ Soft State: The eventual consistency model allows the database to be inconsistent for some
time. But to bring it back to consistent state, the system should allow change in state over
time even without any input. This is known as Soft state of system.
BASE does not address the consistency issue. The AP region of Figure 14.1 follows the BASE
theory. The idea behind this is that data consistency is application developer’s problem and
should be handled by developer through appropriate programming techniques. Database will
no more handle the consistency issue. This philosophy helps to achieve the scalability goal.
To satisfy the scalability and data distribution demands in NoSQL, it was no longer possible to
meet all the four criteria of ACID simultaneously. Hence, BASE theorem was proposed as an
alternative.
14.5.4 Features of NoSQL Database
NoSQL database introduces many new features in comparison with relational databases. Few
of those features oppose relational-DBMS concept. They can be listed as schema-free, non-
relational, horizontally scalable and distributed.
[Link] Flexible Schemas
Relational database system cannot address data whose structure is not known in advance. They
need to define the schema of the database and tables before storing any data in it. But, with this
schema-based design, it becomes difficult to manage agile data sets. When at the middle of the
business, it needs to introduce a new field (column) in some table, then it becomes extremely
disruptive as that require alteration of the schema. This is a very slow process and involves
significant downtime.
NoSQL databases are designed to allow insertion of data without a pre-defined schema. This
makes it very easy to incorporate real-time changes in application at the time of requirement as
that does not cause service interruption.
Unlike relational database, NoSQL database is schema-free.
[Link] Non-relational
NoSQL database can manage non-relational data efficiently along with relational data. The
relational constraints of RDBMS are not applicable in this database. This makes it easier to
manage non-relational data using NoSQL database.
[Link] Scalability
Relational databases are designed to scale vertically. But vertical scaling has its own
limitations as it does not allow new servers to be introduced into the system to share the load.
250 Cloud Computing
Horizontal scalability is the only way to scale indefinitely and that is also cheaper than vertical
scaling. NoSQL database is designed to scale horizontally with minimum effort.
[Link] Auto-distribution
Distributed relational databases allow fragmentation and distribution of a database across
multiple servers. But, that does not happen automatically as it is a manual process to be handled
by application making it difficult to manage. On the other hand, the distribution happens
automatically in NoSQL databases. Application developers need not to worry about anything.
All of these distribution and load balancing acts are automated in the database itself.
Distribution and replication of data segments are not inherent features of relational database;
these are responsibilities of application developers. In NoSQL, these happen automatically.
[Link] Auto-replication
Not only fragmentation and distribution, replication of database fragments are also an
automatic process in NoSQL. No external programming is required to replicate fragments
across multiple servers. Replication ensures high availability of data and supports recovery.
[Link] Integrated Caching
NoSQL database often provides integrated caching capability. This feature reduces latency
and increases throughput by keeping frequently-used data in system memory as much as
possible. In relational database, a separate caching layer needs to be maintained to achieve this
performance goal.
But one thing needs to be added here that although NoSQL database offers many advantages
over relational database, it fails to provide the rich reporting and analytical functionality like
RDBMS in some specific scenarios.
Despite many benefits, NoSQL fails to provide the rich analytical functionality in specific cases
as RDBMS serves.
14.5.5 NoSQL Database Types
There are four different types of NoSQL databases. Each of them is designed to address the need of
some particular classes of problems. Various NoSQL database service providers try to offer solution
for different types of problems. Following sections describe four different NoSQL database types.
[Link] Key-Value Database
The Key-Value Database (or KV Store) is the simplest among the various NoSQL databases.
It pairs up data with a key and maintains the database like a hash-table where data values are
Database Technology 251
referred by the keys. The main benefit of such pairing makes it easily scalable. However, it is not
suitable where queries are based on the value rather than on the key. Amazon’s DynamoDB,
Azure Table Storage and CouchDB are few popular examples of this type of NoSQL databases.
[Link] Document-Oriented Database
A Document Oriented Database (or Document Store) is an application where data is stored in
documents. It is similar to the Key-Value stores with the values stored in structured documents.
The documents are addressed and can be retrieved from the database using key. This key can
be a path, a URI or a simple string.
The documents are schema-free and can be of any format as long as the database application
can understand its internal structure. Generally document-oriented databases use some of the
XML, JSON (JavaScript Object Notation) or Binary JSON (BSON) formats.
One document can be referred by multiple keys and a document can refer to other
documents by storing their keys. But each document is treated as stand-alone and there is no
constraint to enforce relational integrity. MongoDB, Apache CouchDB, Couchbase are few
popular examples of document-oriented databases.
[Link] Column-Family Database
A Column-Family Database (or Wide-Column Data Store/Column Store) stores data grouped
in columns. Each column consists of three elements as name, value and a time-stamp. Name
is used to refer the column and time-stamp is used to identify actual required content. For
example, the time-stamp is useful in finding up-to-date content. A similar type of columns
together forms a column family which are often accessed together. A column family can
contain virtually unlimited number of columns.
In relational databases, each row is stored as a continuous disk entry. Different rows may
get stored in different places on disk. Contrary to this, in column-family database, all of the
cells corresponding to a column are stored as a continuous disk entry. This makes the access
of data faster. For example, searching of a particular title from a record of million books
stored in relational data model is an intense task as that will cause millions of accesses to disk.
On the other hand, using column-family data model, the title can be found with single disk
access only.
The difference between column stores and key-value stores is that column stores are
optimized to handle data along columns. Column stores show better analytical power and
provide improved performance by imposing a certain amount of rigidity to a database schema.
In some ways, the column stores are an intermediate solution between traditional RDBMSs and
key-value stores. Hadoop’s Hbase is the best example of popular column store-based database.
[Link] Graph Database
In the Graph Database (or Graph Store) data is stored as graph structures using nodes and
edges. The entities are represented as nodes and the relationship between entities as edges.
Graph database follows index-free adjacency where every node directly points to its adjacent
nodes. In this set up, the cost of a hop or tour remains same as the number of nodes increases.
252 Cloud Computing
This is useful to store information about relationships when number of elements are huge, such
as social connections. Twitter uses such database to track who is following whom. Examples of
popular graph-based databases include Neo4J, Info-Grid, Infinite Graph and few others.
14.5.6 Selecting the Suitable NoSQL Database Solution
Each of the NoSQL database types has its own strength and weaknesses. They are designed to
serve different kind of data storage requirements and hence are not comparable to each other.
The ‘one-size-fits-all’ philosophy of relational databases is not applicable in NoSQL database
domain. Here the users have the flexibility of choosing of multiple options after analyzing the
requirements of their applications.
Selecting the NoSQL database strategy is not a one-time decision. First, one will have
to identify the requirements of an application which are not met by relational database
systems. Then suitable NoSQL database solution has to be identified to meet those unfulfilled
requirements. Even more than one NoSQL database types may be used to meet all of these
necessities.
Sometime a single application may provide optimized performance when more than one
NoSQL database types are employed together. In such case, multi-model NoSQL database can be
used which is designed to support multiple data models from four primary NoSQL data models.
The days when one DBMS was used to fit all needs are over. Now, a single application may use
several different data stores at the back-end.
14.5.7 Commercial NoSQL Databases
Commercial NoSQL databases started surfacing two years after the publication of Google’s
paper on BigTable in 2006 and Amazon’s paper on Dynamo in 2007. After these publications,
many initiatives were taken up both for open-source and close-source developments of NoSQL
databases. By the end of 2009, there were several releases including BigTable-inspired HBase,
Dynamo-inspired Riak and Cassandra. The following section briefs some of the popular
NoSQL databases.
[Link] Apache’s HBase
HBase is an Open-source NoSQL database system written in Java. It was developed by Apache
Software Foundation as part of their Hadoop project. HBase’s design architecture has been
inspired by Google’s internal storage system BigTable. As Google’s BigTable uses GFS, Hadoop’s
HBase uses HDFS as underlying file system. HBase is a column-oriented database management
system.
[Link] Amazon’s DynamoDB
DynamoDB is a key-value NoSQL database developed by Amazon. It derives its name from
Dynamo which is Amazon’s internal storage system and was launched in 2012. The database
Database Technology 253
service is fully-managed by Amazon and offered as part of the Amazon’s Web Services portfolio.
DynamoDB is useful specifically for supporting a large volume of concurrent updates and suits
well for shopping-cart like operations.
[Link] Apache’s Cassandra
Cassandra is an open-source NoSQL database management system developed in Java. It was
initially developed at Facebook and then was released as an open-source project in 2008 with
the goal of further advancements. Although Facebook’s kingdom was largely dependent on
Cassandra, they still released it as an open-source project, possibly having assured on that it might
be too late for others to use the technology to knock its castle down. Cassandra became an Apache
Incubator project in 2009. Cassandra is a hybrid of column-oriented and key-value data store
being suitable to be deployed over both across many commodity servers and cloud infrastructure.
[Link] Google Cloud Datastore
Cloud Datastore is developed by Google and is available as a fully-managed NoSQL database
service. Cloud Datastore is very easy to use and supports SQL-like queries being called as GQL.
The Datastore is a NoSQL key-value database where users can store data as key-value pairs.
Cloud Datastore also supports ACID transactions using optimistic concurrency control.
[Link] MongoDB
MongoDB is a popular document-oriented open-source NoSQL database. It is developed by
New York City-based MongoDB Inc. and was first released as a product in 2009. It is written
in C++, JavaScript and C programming languages and uses GridFS as built-in distributed file
system. MongoDB runs well on many cloud based environments including Amazon EC2.
[Link] Amazon’s SimpleDB
SimpleDB is a fully-managed NoSQL data store offered by Amazon. It is a key-value store and
actually not a full database implementation. SimpleDB was first announced on December 2007
and works with both Amazon EC2 and Amazon S3.
[Link] Apache’s CouchDB
CouchDB is an open-source document-oriented NoSQL database. CouchDB was first developed
in 2005 by a former developer of IBM. Later in 2008, it was adopted as an Apache Incubator
project. Soon in 2010, the first stable version of CouchDB was released and it became popular.
[Link] Neo4j
Neo4j is an open-source graph database. It is developed in Java. Neo4j was developed by Neo
Technology of United States and was initially released in 2007. But its stable versions started
appearing from the year of 2010.
254 Cloud Computing
Apart from these few NoSQL databases as mentioned above, there are numerous of other
products available in the market. New developments are happening in this field and many
new products are being launched. Database management application giant like Oracle has also
launched their own NoSQL database. Hence, much more of new advancements are expected in
this domain in coming years.
SUMMARY
❖❖ Database facility in cloud can be availed in two forms. One is by installing available database
applications over cloud servers. The other one is offered by the providers as fully-managed
Database-as-a-Service delivery.
❖❖ Most popular virtual machines provide support for the common RDBMS applications like
Oracle Server, Microsoft’s SQL Server or open-source MySQL Database.
❖❖ Amazon RDS, Google Cloud SQL, Azure SQL Database are some popular examples of relational
Database-as-a-Service offerings. Such database services are fully-managed by the providers.
❖❖ New age data are huge in volume, unstructured in nature, come in various formats and are
accessed or produced more frequently. Such data sets were identified as Big data.
❖❖ The data-intensive computing requirements of large volume unstructured data cannot
be fulfilled with schematic relational databases. Distribution of data does not happen
automatically in them and it becomes the programmer’s responsibility. Hence it becomes
difficult to develop the scalable applications.
❖❖ Performance became an issue when such large volumes of data were being accessed
frequently from all over the world. Horizontal scaling of the database was the only solution to
this problem.
❖❖ Non-relational database systems (NoSQL database) emerged to fill this gap. NoSQL scales
automatically and is able to fulfill the processing and storage requirements of large
unstructured data-sets.
❖❖ NoSQL databases do not follow the ACID properties. They follow BASE theorem.
❖❖ Four different types of NoSQL databases are present in the market. Each one of them are built
to serve specific purposes. One application can use multiple databases if required.
❖❖ Google made significant contributions in the development of NoSQL database systems as
it introduced a data-store called BigTable. Later, many open-source and proprietary NoSQL
databases were released by different vendors.
REVIEW QUESTIONS
How is accessing the Azure SQL Database-as-a-Service different from deploying SQL
Server Application on Microsoft Azure cloud and using it?
Azure SQL is a database-as-a-Service (DBaaS) facility offered by Microsoft. It is the implementation of
Microsoft SQL Server over Azure cloud. This is a fully-managed service which users can access like any