0% found this document useful (0 votes)
10 views82 pages

Slides PDF - Module 3

The document discusses NoSQL Big Data management, focusing on distributed systems, MongoDB, and Cassandra. It outlines the features and advantages of NoSQL databases, including flexibility, scalability, and performance, while also addressing the differences between traditional SQL and NoSQL systems. Additionally, it covers various NoSQL data architectures, such as key-value stores and document stores, along with their characteristics and typical use cases.

Uploaded by

sudeepmansh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views82 pages

Slides PDF - Module 3

The document discusses NoSQL Big Data management, focusing on distributed systems, MongoDB, and Cassandra. It outlines the features and advantages of NoSQL databases, including flexibility, scalability, and performance, while also addressing the differences between traditional SQL and NoSQL systems. Additionally, it covers various NoSQL data architectures, such as key-value stores and document stores, along with their characteristics and typical use cases.

Uploaded by

sudeepmansh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Big Data Analytics

18CS72

Module 3
NoSQL Big Data management,
MongoDB and Cassandra
Sudeep Manohar
Department of ISE
JNNCE, Shivamogga
3.1 Introduction
▪ Big Data uses distributed systems.
▪ A distributed system consists of multiple data nodes at clusters
of machines and distributed software components.
▪ The tasks execute in parallel with data at nodes in clusters.
▪ The computing nodes communicate with the applications
through a network.
Following are the features of distributed-computing architecture
l. Increased reliability and fault tolerance: The important
advantage of distributed computing system is reliability. If a
segment of machines in a cluster fails then the rest of the
machines continue work. When the datasets replicate at number
of data nodes, the fault tolerance increases further. The dataset
in remaining segments continue the same computations as being
done at failed segment machines.
2. Flexibility makes it very easy to install, implement and debug
new services in a distributed environment.
3. Sharding is storing the different parts of data onto different
sets of data nodes, clusters or servers. For example, university
students huge database, on sharding divides into smaller parts
called shards. Each shard may correspond to a database for an
individual course and year. Each shard stores at different nodes
or servers.
4. Speed: Computing power increases in a distributed computing
system as shards run parallelly on individual data nodes in
clusters independently (no data sharing between shards).
5. Scalability: Consider sharding of a large database into a
number of shards, distributed for computing in different systems.
When the database expands further, then adding more machines
and increasing the number of shards provides horizontal
scalability. Increased computing power and running number of
algorithms on the same machines provides vertical scalability.
6. Resource sharing: Shared resources of memory, machines
and network architecture reduce the cost
7. Open system makes the service accessible to all nodes
8. Performance: The collection of processors in the system
provides higher performance than a centralized computer, due to
lesser cost of communication among machines (Cost means time
taken up in communication).
3.2 NOSQL Data store
▪ SQL is a programming language based on relational algebra.
▪ It is a declarative language and it defines the data schema.
▪ SQL creates databases and RDBMS s.
▪ RDBMS uses tabular data store with relational algebra, precisely
defined operators with relations as the operands.
▪ Relations are a set of tuples.
▪ A tuple identifies uniquely by keys called candidate keys.
ACID Properties in SQL Transactions

▪ Atomicity of transaction means all operations in the transaction


must complete, and if interrupted, then must be undone (rolled
back). For example, if a customer withdraws an amount then
the bank in first operation enters the withdrawn amount in the
table and in the next operation modifies the balance with new
amount available. Atomicity means both should be completed,
else undone if interrupted in between.
▪ Consistency in transactions means that a transaction must
maintain the integrity constraint, and follow the consistency
principle. For example, the difference of sum of deposited
amounts and withdrawn amounts in a bank account must equal
the last balance. All three data need to be consistent.
▪ Isolation of transactions means two transactions of the
database must be isolated from each other and done separately.
▪ Durability means a transaction must persist once completed

SQL supports triggers, views, schedules and joins

▪ Trigger is a special stored procedure which executes when a


specific action occur within a database
▪ View refers to a logical construct which saves a division of a
complex query instructions that reduces the complexity
▪ Schedule refers to a chronological sequence of instructions
which execute concurrently
▪ A join operation does pairing of two tuples obtained from
different relational expressions.
3.2.1 NOSQL

▪ A new category of data stores is NoSQL (means Not Only SQL)


data stores.
▪ NoSQL is an altogether new approach of thinking about
databases, such as schema flexibility, simple relationships,
dynamic schemas, auto sharding, replication, integrated
caching, horizontal scalability of shards, distributable tuples,
semi-structures data and flexibility in approach.
▪ Issues with NoSQL data stores are lack of standardization in
approaches, processing difficulties for complex queries,
dependence on eventually consistent results in place of
consistency in all states.
Big Data NoSQL

▪ NoSQL records are in non-relational data store systems.


▪ They use flexible data models.
▪ The records use multiple schemas.
▪ NoSQL data stores are considered as semi-structured data.
▪ Big Data Store uses NoSQL.
NoSQL data store characteristics are as follows:
1. NoSQL is a class of non-relational data storage system with
flexible data model. Examples are key-value pairs, name /value
pairs, column family big-data store, tabular data store, Cassandra
Hbase, hash table, unordered keys using JSON (CouchDB), JSON
(PNUTS), JSON (MongoDB), Graph data store, object store,
ordered keys and semi structured data storage systems.
2. NoSQL not necessarily has a fixed schema, such as table; do
not use the concept of Joins (in distributed data storage
systems); Data written at one node can be replicated to multiple
nodes. Data store is thus fault tolerant. The store can be
partitioned into unshared shards.
Features in NoSQL Transactions

NoSQL transactions have following features:


1. Relax one or more of the ACID properties.
2. Characterize by two out of three properties (consistency,
availability and partitions) of CAP theorem, two are at least
present for the application/ service /process.
3. Can be characterized by BASE properties (BASE stands for
Basically Available, Soft State, and Eventual Consistency)
Big Data NoSQL Solutions

NoSQL DBs are needed for Big Data solutions. The below gives
the examples of widely used NoSQL data stores.
Apache's Hbase HDFS compatible, open-source and non-relational data store
written inJava; A column-family based NoSQL data store, data
store providing BigTable-like capabilities; scalability, strong
consistency, versioning, configuring and maintaining data store
characteristics
Apache's HDFS compatible; master-slave distribution model, document-
MongoDB oriented data store withJSON-like documents and dynamic
schemas; open-source, NoSQL, scalable and non-relational
database; used by Websites Craigslist, eBay, Foursquare at the
backend
Apache's HDFS compatible DBs; decentralized distribution peer-to-peer
Cassandra model; open source; NoSQL; scalable, non-relational, column-
family based, fault-tolerant and tuneable consistency used by
Facebook and Instagram
Apache's A project of Apache which is also widely used database
CouchDB for the web. CouchDB consists of Document Store. It
uses the JSON data exchange format to store its
documents, JavaScript for indexing, combining and
transforming documents, and HTTP Apis
Oracle NoSQL Step towards NoSQL data store; distributed key-value
data store; provides transactional semantics for data
manipulation , horizontal scalability, simple
administration and monitoring
Riak An open-source key-value store; high availability (using
replication concept), fault tolerance, operational
simplicity, scalability and written in Erlang
CAP Theorem

Among C, A and P, two are at least present for the


application/service /process.
▪ Consistency means all copies have the same value like in
traditional DBs.
▪ Availability means at least one copy is available in case a
partition becomes inactive or fails. For example, in web
applications, the other copy in the other partition is available.
▪ Partition means division of a large database into different
databases without affecting the operations on them by adopting
specified procedures.
Partition tolerance: Refers to continuation of operations as a whole even in
case of message loss, node failure or node not reachable.
3.2.2 Schema Less Database

▪ Schema of a database system refers to designing of a structure


for datasets and data structures for storing into the database.
▪ NoSQL data not necessarily have a fixed table schema.
3.2.3 Increasing Flexibility for Data Manipulation

▪ NoSQL data store possess characteristic of increasing flexibility


for data manipulation.
▪ The new attributes to database can be increasingly added.
▪ Late binding of them is also permitted

BASE Properties

BA stands for basic availability, S stands for soft state and E


stands for eventual consistency.
▪ Basic ensures by distribution of shards (many
availability
partitions of huge data store) across many data nodes with a
high degree of replication. Then, a segment failure does not
necessarily mean a complete data store unavailability.
▪ Soft stateensures processing even in the presence of
inconsistencies but achieving consistency eventually. A program
suitably takes into account the inconsistency found during
processing. NoSQL database design does not consider the need
of consistency all along the processing time.
▪ Eventual consistency means consistency requirement in NoSQL
databases meeting at some point of time in future. Data
converges eventually to a consistent state with no time-frame
specification for achieving that. ACID rules require consistency
all along the processing on completion of each transaction.
BASE does not have that requirement and has the flexibility.
3.3 NOSQL data architecture patterns
3.3.1 Key value store

▪ The simplest way to implement a schema-less data store is to


use key-value pairs
▪ The data store characteristics are high performance, scalability
and flexibility
▪ Data retrieval is fast in key-value pairs data store
▪ Key-value store accesses use a primary key for accessing the
values.
▪ Therefore, the store can be easily scaled up for very large data.
Figure 3.4 key-value pairs architectural pattern and example of students'
database as key-value pairs
Advantages of a key-value store are as follows
▪ Data Store can store any data type in a value field.
The key-value system stores the information as a BLOB of data

▪ A query just requests the values and returns the values as a


single item. Values can be of any data type.
▪ Key-value store is eventually consistent.
▪ Key-value data store may be hierarchical or may be ordered
key-value store.
▪ Returned values on queries can be used to convert into lists,
table columns, data-frame fields and columns
▪ Have (i) scalability, (ii) reliability, (iii) portability and (iv) low
operational cost.
▪ The key can be synthetic or auto-generated.
Limitations of key-value store architectural pattern are:
1. No indexes are maintained on values, thus a subset of values
is not searchable.
2. Key-value store does not provide traditional database
capabilities, such as atomicity of transactions, or consistency
when multiple transactions are executed simultaneously.
3. Maintaining unique values as keys may become more difficult
when the volume of data increases.
4. Queries cannot be performed on individual values.
Traditional relational data model vs. the key-value store model

Traditional relational model Key-value store model

Result set based on row values Queries return a single item

Values of rows for large datasets No indexes on values


are indexed
Same data type values in columns Any data type values
Typical uses of key-value store are:
(i) Image store
(ii) Document or file store
(iii) Lookup table
(iv) Query-cache
Riak

▪ It is open-source Erlang language data store.


▪ It is a key-value data store system.
▪ Data auto-distributes and replicates in Riak.
▪ It is thus, fault tolerant and reliable.
▪ Some other widely used key-value pairs in NoSQL DBs are
Amazon's DynamoDB, Redis (often referred as Data Structure
server), Memcached and its flavours, Berkeley DB, upscaledb
(used for embedded databases), project Voldemort and
Couchbase.
3.3.2 Document store

▪ Characteristics of Document Data Store are high performance


and flexibility
Following are the features in Document Store:
1. Document stores unstructured data.
2. Storage has similarity with object store.
3. Data stores in nested hierarchies.
4. Querying is easy.
5. No object relational mapping enables easy search by following
paths from the root of document tree.
6. Transactions on the document store exhibit ACID properties.
Typical uses of a document store are: (i) office documents, (ii)
inventory store, (iii) forms data, (iv) document exchange and (v)
document search. Examples of Document Data Stores are
CouchDB and MongoDB.
CSV File Format

▪ CSV data store is a format for records


▪ CSV is comma separated values
▪ CSV does not represent object-oriented databases or
hierarchical data records.
JSON Files

▪ Semi-structured data
▪ object-oriented records and hierarchical data records
▪ JSON refers to a language format for semistructured data.
▪ JSON represents object-oriented and hierarchical data records,
object, and resource arrays in JavaScript
Document JSON format CouchDB Database

Apache CouchDB is an opensource database. Its features are:


▪ CouchDB provides mapping functions during querying, combining
and filtering of information.
▪ CouchDB deploys JSON Data Store model for documents. Each
document maintains separate data and metadata (schema).
▪ CouchDB is a multi-master application. Write does not require field
locking when controlling the concurrency during multi-master
application.
▪ CouchDB querying language is JavaScript.
▪ CouchDB queries the indices using a web browser
▪ CouchDB data replication is the distribution model that results in
fault tolerance and reliability
XML

▪ An extensible, simple and scalable language. Its self-describing


format describes structure and contents in an easy to
understand format
▪ XML is widely used. The document model consists of root
element and their sub-elements. XML document model has
features of object-oriented records.
▪ XML format finds wide uses in data store and XML document
model has a hierarchical structure. XML format finds wide uses
in data store
3.3.3 Tabular data

Tabular data stores use rows and columns. Row-head field may
be used as a key which access and retrieves multiple values from
the successive columns in that row. The OLTP is fast on in-
memory row-format data.
[Link] Columnar family store

Columnar Data Store - A way to implement a schema is the


divisions into columns. Storage of each column, successive values
is at the successive memory addresses. A pair of row-head and
column-head is a key-pair. The pair accesses a field in the table.
Column-Family Data Store Column-family data-store has a group of
columns as a column family. A combination of row-head, column-
family head and table column head can also be a key to access a
field in a column of the table during querying.
▪ Sparse Column Fields A row may associate a large number of
columns but contains values in few column fields. Similarly,
many column fields may not have data. Most elements of sparse
matrix are empty
▪ Grouping of Column Families Two or more column-families in data
store form a super group, called super column.
▪ Grouping into Rows When number of rows are very large then
horizontal partitioning of the table is a necessity. Each partition
forms one row-group
Characteristics of Columnar Family Data Store

▪ Scalability
▪ Partitionability
▪ Availability
▪ Tree-like columnar structure
▪ Adding new data at ease
▪ Querying all field values
▪ Replication of columns
▪ No optimization for join
[Link] Big Table Data Store

Examples of widely used column-family data store are Google's


BigTable, HBase and Cassandra. Keys for row key, column key,
timestamp and attribute uniquely identify the values in the fields.
Following are features of a BigTable:
Massively scalable NoSQL. BigTable scales up to 100s of petabytes.
Integrates easily with Hadoop and Hadoop compatible systems.
Compatibility with MapReduce, HBase APis which are open-source
Big Data platforms.
Key for a field uses not only row_ID and Column_ID, but also
timestamp and attributes. Values are ordered bytes. Therefore,
multiple versions of values may be present in the BigTable.
▪ Handles million of operations per second.
▪ Handle large workloads with low latency and high throughput
▪ Consistent low latency and high throughput
▪ APis include security and permissions
▪ BigTable, being Google's cloud service, has global availability
and its service is seamless
[Link] RC File Format

Hive uses Record Columnar (RC) file-format records for querying.


RC is the best choice for intermediate tables for fast column-
family store in HDFS with Hive. Serializability of RC table column
data is the advantage. RC file is DeSerializable into column data.
[Link] ORC File Format

▪ An ORC (Optimized Row Columnar) file consists of row-group


data called stripes. ORC enables concurrent reads of the same
file using separate RecordReaders
▪ ORC is an intelligent Big Data file format for HDFS and Hive
▪ An ORC file stores a collections of rows as a row-group. Each
row-group data store in columnar format
[Link] Parquet File Formats

▪ Parquet is nested hierarchical columnar-storage concept


▪ Nesting sequence is the table, row group, column chunk and
chunk page
▪ Apache Parquet file is columnar-family store file
▪ Apache Spark SQL executes user defined functions (UDFs)
which query the Parquet file columns
▪ A Parquet file uses an HDFS block. The block stores the file for
processing queries on Big Data
▪ The Parquet file consists of row groups.
▪ A row-group columns data process in memory after data cache
and buffer at the memory from the disk.
▪ Each row group has a number of columns
▪ A column chunk can be divided into pages and thus, consists of
one or more pages.
▪ The column chunk consists of a number of interleaved pages
▪ A page is a conceptualized unit which can be compressed or
encoded together at an instance.

Combination of keys for content page in the Parquet file format


3.3.4 Object Data Store

An object store refers to a repository which stores the:


1. Objects (such as files, images, documents, folders, and
business reports)
2. System metadata which provides information such as filename,
creation_date, last_modified, language_used (such as Java, C,
C#, C++, Smalltalk, Python), access_permissions, supported
query languages)
3. Custom metadata which provides information, such as subject,
category, sharing permissions.
▪ Metadata enables the gathering of metrics of objects, searches,
finds the contents and specifies the objects in an object data-
store tree.
▪ Metadata finds the relationships among the objects, maps the
object relations and trends.
▪ Object Store metadata interfaces with the Big Data.
▪ API first mines the metadata to enable mining of the trends and
analytics.
▪ The metadata defines classes and properties of the objects.
Each Object Store may consist of a database.
▪ Document content can be stored in either the object store
database storage area or in a file storage area.
[Link] Object Relational Mapping
3.3.5 Graph Data Base

▪ One way to implement a data store is to use graph database.


▪ A characteristic of graph is high flexibility. Any number of nodes
and any number of edges can be added to expand a graph.
▪ The complexity is high and the performance is variable with
scalability.
▪ Data store as series of interconnected nodes. Graph with data
nodes interconnected provides one of the best database system
when relationships and relationship types have critical values.
▪ Nodes represent entities or objects. Edges encode relationships
between nodes
Characteristics of graph databases are:
1. Use specialized query languages, such as RDF uses SPARQL
2. Create a database system which models the data in a
completely different way than the key-values, document,
columnar and object data store models.
3. Can have hyper-edges. A hyper-edge is a set of vertices of a
hypergraph. A hypergraph is a generalization of a graph in which
an edge can join any number of vertices (not only the
neighbouring vertices).
4. Consists of a collection of small data size records, which have
complex interactions between graph-nodes and hypergraph
nodes. Nodes represent the entities or objects. Nodes use Joins.
▪ Graph databases have poor scalability. They are difficult to scale
out on multiple servers. This is due to the close connectivity
feature of each node in the graph.
▪ Data can be replicated on multiple servers to enhance read and
the query processing performance.
▪ Write operations to multiple servers and graph queries that
span multiple nodes, can be complex to implement

Examples of graph DBs are Neo4J, AllegroGraph, HyperGraph,


Infinite Graph, Titan and FlockDB
3.4 NO SQL to Manage Big Data
[Link] NoSQL solutions for Big Data

Characteristics of Big Data NoSQL solution are:


l. High and easy scalability: NoSQL data stores are designed to
expand horizontally. Horizontal scaling means that scaling out by
adding more machines as data nodes (servers) into the pool of
resources (processing, memory, network connections).
2. Support to replication: Multiple copies of data store across
multiple nodes of a cluster. This ensures high availability,
partition, reliability and fault tolerance.
3. Distributable: Big Data solutions permit sharding and
distributing of shards on multiple clusters which enhances
performance and throughput.
4. Usages of NoSQL servers which are less expensive. NoSQL
data stores require less management efforts. It supports many
feature data models that makes database administrator (OBA)
and tuning requirements less stringents like automatic repair,
easier data distribution and simpler
5. Usages of open-source tools: NoSQL data stores are cheap and
open source. Database implementation is easy and typically uses
cheap servers to manage the exploding data and transaction
while RDBMS databases are expensive and use big servers and
storage systems.
6. Support to schema-less data model: NoSQL data store is
schema less, so data can be inserted in a NoSQL data store
without any predefined schema. So, the format or data model
can be changed any time, without disruption of application.
7. Support to integrated caching: NoSQL data store support the
caching in system memory. That increases output performance.
SQL database needs a separate infrastructure for that.
8. No inflexibility unlike the SQL/RDBMS, NoSQL DBs are flexible
(not rigid) and have no structured way of storing and
manipulating data. SQL stores in the form of tables consisting of
rows and columns. NoSQL data stores have flexibility in following
ACID rules.
[Link] Types of Big Data Problems

Big Data problems arise due to limitations of NoSQL and other DBs.
1. Big Data need the scalable storage and use of distributed servers
together as a cluster. Therefore, the solutions must drop support for
the database Joins
2. NoSQL database is open source and that is its greatest strength but
at the same time its greatest weakness also because there are not
many defined standards for NoSQL data stores. Hence, no two NoSQL
data stores are equal. For example:
(i) No stored procedures in MongoDB (NoSQL data store)
(ii) GUI mode tools to access the data store are not available in the market
(iii) Lack of standardization
(iv) NoSQL data stores sacrifice ACID compliancy for flexibility and processing
speed.
Comparison of NOSQL/RDBMS
Feature NOSQL Data Store SQL/RDBMS
Model Schema-less model Relational

Schema Dynamic schema Predefined

Types of data Key/value based, column-family based, document


Table based
architecture patterns based, graph based, object based

Scalable Horizontally scalable Vertically scalable

Use ofSQL No Yes


Dataset size Large dataset not
Prefers large datasets
preference preferred
Consistency Variable Strong

Vendor support Open source Strong

ACID properties May not support, instead follows Brewer's CAP


Strictly follows
theorem or BASE properties
3.5 Shared-nothing Architecture For Big Data Tasks
▪ The columns of two tables relate by a relationship. A relational
algebraic equation specifies the relation.
▪ Keys share between two or more SQL tables in RDBMS.
▪ Shared nothing (SN) is a cluster architecture. A node does not
share data with any other node.
▪ The features of SN architecture are as follows:
1. Independence: Each node with no memory sharing; thus possesses
computational self-sufficiency
2. Self-Healing: A link failure causes creation of another link
3. Each node functioning as a shard: Each node stores a shard (a partition of
large DBs)
4. No network contention
3.5.1 Choosing the Distribution Models

▪ Big Data requires distribution on multiple data nodes at


clusters. Distributed software components give advantage of
parallel processing; thus providing horizontal scalability.
▪ Distribution gives (i) ability to handle large-sized data, and (ii)
processing of many read and write operations simultaneously in
an application.
▪ Distribution increases the availability when a network slows or
link fails
Four models for distribution of the data store are given below:
Single Server Model

▪ Simplest distribution option for NoSQL data store and access is


Single Server Distribution (SSD) of an application.
▪ A graph database processes the relationships between nodes at
a server.
▪ The SSD model suits well for graph DBs. Aggregates of datasets
may be key-value, column-family or BigTable data stores which
require sequential processing.
Sharding Very Large Databases

• Sharding provides horizontal


scalability.
• A data store may add an auto-
sharding feature.
• The performance improves in
the SN.
• However, in case of a link
failure with the application, the
application can migrate the
shard DB to another node
Master Slave Distribution
▪ Master directs the slaves. Slave nodes data replicate on multiple slave
servers in Master Slave Distribution (MSD) model.
▪ When a process updates the master, it updates the slaves also. A process
uses the slaves for read operations.
▪ Processing performance improves when process runs large datasets
distributed onto the slave nodes
Master-Slave Replication
▪ Processing performance decreases due to replication in MSD distribution
model.
▪ Resilience for read operations is high, which means if in case data is not
available from a slave node, then it becomes available from the replicated
nodes.
▪ Master uses the distinct write and read paths.
Peer-to-Peer Distribution Model

Peer-to-Peer distribution (PPD) model and replication show the


following characteristics:
(1) All replication nodes accept read request and send the responses.
(2) All replicas function equally.
(3) Node failures do not cause loss of write capability, as other
replicated node responds
▪ Cassandra adopts the PPD model.
▪ The data distributes among all the nodes in a cluster.
▪ Performance can further be enhanced by adding the nodes. Since
nodes read and write both, a replicated node also has updated data
Shards replicating on the nodes, which does read and write operations both
3.5.2 Ways of Handling Big Data Problems
3.6 MONGODB DATABASE
▪ MongoDB is an open source DBMS.
▪ MongoDB manages the collection and document data store.
▪ MongoDB functions do querying and accessing the required
information. The functions include viewing, querying, changing,
visualizing and running the transactions. Changing includes
updating, inserting, appending or deleting.
▪ MongoDB is (i) non-relational, (ii) NoSQL, (iii) distributed, (iv)
open source, (v) document based (vi) cross-platform, (vii)
Scalable, (viii) flexible data model, (ix) Indexed, (x) multi-
master and (xi) fault tolerant.
Features of Mango DB

▪ MongoDB data store - is a physical container for collections. Each DB


gets its own set of files on the file system. The database server of
MongoDB is mongod and the client is mongo
▪ Collection - stores a number of MongoDB documents. It is analogous
to a table of RDBMS.
▪ Document model - is well defined. Structure of document is clear,
Document is the unit of storing data in a MongoDB database.
▪ Document use JSON – (JavaScript Object Notation) approach for
storing data.
▪ Storing of data is flexible - which implies that the fields can vary
from document to document and data structure can be changed
over time;
▪ Storing of documents - on disk is in BSON serialization format.
BSON is a binary representation of JSON documents. The
mongo JavaScript shell and MongoDB language drivers perform
translation between BSON and language-specific document
representation.
▪ Querying, indexing, and real time aggregation - allows
accessing and analyzing the data efficiently.
▪ Deep query-ability-Supports - dynamic queries on documents
using a document-based query language that's nearly as
powerful as SQL.
▪ No complex Joins.
▪ Distributed DB - makes availability high, and provides horizontal
scalability
▪ Indexes on any field - Users can create indexes on any field in a
document. Indices support queries and operations. By default,
MongoDB creates an index on the _id field of every collection.
▪ Atomic operations on a single document - can be performed even
though support of multi-document transactions is not present. The
operations are alternate to ACID transaction requirement of a
relational DB.
▪ Fast-in-place updates: The DB does not have to allocate new
memory location and write a full new copy of the object in case of
data updates. This results into high performance for frequent
update use cases.
▪ No configurable cache: MongoDB uses all free memory on the
system automatically by way of memory-mapped files. The
most recently used data is kept in RAM. If indexes are created
for queries and the working dataset fits in RAM, MongoDB
serves all queries from memory.
▪ Conversion/mapping - of application objects to data store
objects not needed
Dynamic Schema - Dynamic schema implies that documents in the
same collection do not need to have the same set of fields or
structure. Also, the similar fields in a document may contain
different types of data
Replication - Replication ensures high availability in Big Data.
Presence of multiple copies increases on different database
servers. This makes DBs fault tolerant against any database
server failure. MongoDB replicates with the help of a replica set.
A replica set in MongoDB is a group of mongod (MongoDb server)
processes that store the same dataset. Replica sets provide
redundancy but high availability.
Auto-sharding :Sharding is a method for distributing data across
multiple machines in a distributed application environment.
MongoDB uses sharding to provide services to Big Data
applications. Sharding automatically balances the data and load
across various servers.
Rich Queries and Other DB Functionalities - MongoDB offers a rich set
of features and functionality compared to those offered in simple
key-value stores. MongoDB has a complete query language,
highly-functional secondary indexes and a powerful aggregation
framework for data analysis
MongoDB querying commands

▪ To Create database Command


use lego creates a database named lego
Default database in MongoDB is test.
▪ To see the existence of database
db
▪ To get list of all the databases
show dbs
▪ To drop database Command
use lego
[Link]()
▪ To create a collection
To create a collection, the easiest way is to insert a record
(a document consisting of keys (Field names) and Values) into a
collection

▪ To view all documents in a collection


db.<database name>.find()
[Link]()
▪ To update a document
db.<database name>.update()
▪ To delete a document
db.<database name>.remove()
▪ To add array in a collection
3.7 Cassandra Database
▪ Cassandra was developed by Facebook and released by Apache.
▪ IBM also released the enhancement of Cassandra
▪ Cassandra is basically a column family database that stores and
handles massive data of any format including structured, semi-
structured and unstructured data.
▪ Cassandra provides functions (commands) for querying the data
and accessing the required information. Functions do the
viewing, querying and changing (update, insert or append or
delete), visualizing and perform transactions on the DB.
▪ Apache Cassandra has the distributed design of Dynamo.
Cassandra is written in Java.
Characteristics of Cassandra are (i) open source, (ii) scalable (iii)
nonrelational (v) NoSQL (iv) Distributed (vi) column based, (vii)
decentralized, (viii) fault tolerant and (ix) tuneable consistency.
Features of Cassandra are as follows:
1. Maximizes the number of writes - writes are not very expensive
2. Maximizes data duplication
3. Does not support Joins, group by, OR clause and aggregations
4. Uses Classes consisting of ordered keys and semi-structured data storage
systems
5. Is fast and easily scalable with write operations spread across the cluster.
The cluster does not have a master-node, so any read and write can be
handled by any node in the cluster.
6. Is a distributed DBMS designed for handling a high volume of structured
data across multiple cloud servers
▪ Data Replication - Cassandra stores data on multiple nodes (data
replication) and thus has no single point of failure, and ensures
availability, a requirement in CAP theorem
▪ Components at Cassandra
✓ Node - Place where data stores for processing
✓ Data Center - Collection of many related nodes
✓ Cluster - Collection of many data centers
✓ Commit log - Used for crash recovery; each write operation written to
commit log
✓ Mem-table - Memory resident data structure, after data written in
commit log, data write in mem-table temporarily
✓ SSTable - When mem-table reaches a certain threshold, data flush
into an SSTable disk file
✓ Bloom filter - Fast and memory-efficient, probabilistic-data structure
to find whether an element is present in a set, Bloom
filters are accessed after every query.
▪ Scalability - Cassandra provides linear scalability which increases
the throughput and decreases the response time on increase in
the number of nodes at cluster.
▪ Transaction - Support Supports ACID properties (Atomicity,
Consistency, Isolation, and Durability).
▪ Replication Option - Specifies any of the two replica placement
strategy names. The strategy names are Simple Strategy or
Network Topology Strategy. The replica placement strategies
are:
✓ Simple Strategy: Specifies simply a replication factor for the cluster.
✓ Network Topology Strategy: Allows setting the replication factor for
each data center independently.
Data types
Cassandra Data Model

Cassandra Data model is based on Google's BigTable Each value maps


with two strings (row key, column key) and timestamp, similar to
HBase.
▪ The database can be considered as a sparse distributed multi-
dimensional sorted map.
▪ Google file system splits the table into multiple tablets (segments of
the table) along a row.
▪ Each tablet, called META1 tablet, maximum size is 200 MB, above
which a compression algorithm used.
▪ META0 is the master-server. Querying by META0 server retrieves a
META1 tablet.
▪ During execution of the application, caching of locations of tablets
reduces the number of queries.
Cassandra Data Model consists of four main components:
▪ Cluster: Made up of multiple nodes and keyspaces,
▪ Keyspace: a namespace to group multiple column families,
especially one per partition,
▪ Column: consists of a column name, value and timestamp
▪ Column family: multiple columns with row key reference.

Keyspaces

A keyspace in NoSQL data store is an object that contains all


column families of a design as a bundle. Keyspace is the
outermost grouping of data in the data store
Cassandra Query
Language (CQL)

You might also like