0% found this document useful (0 votes)
66 views18 pages

Nosql Qbsol Ia-02

Uploaded by

manojsri1404
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
66 views18 pages

Nosql Qbsol Ia-02

Uploaded by

manojsri1404
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 18

Module-03

1. Explain the basic Map-Reduce taking suitable examples

Ans:

 Map-Reduce is a programming model used for processing and generating large datasets.
 It divides the task into two main steps
o Map: Processes input data and transforms it into intermediate key-value pairs.
o Reduce: Combines the intermediate key-value pairs with the same key.
 Basic Idea: Let's assume we have chosen orders as our aggregate, with each order having
line items. Each line item has a product ID, quantity, and the price charged. This aggregate
makes a lot of sense as usually people want to see the whole order in one access. We
have lots of orders, so we've sharded the dataset over many machines.
 However, sales analysis people want to see a product and its total revenue for the last
seven days. This report doesn't fit the aggregate structure that we have-which is the
downside of using aggregates. In order to get the product revenue report, you'll have to
visit every machine in the cluster and examine many records on each machine.
 The first stage in a map-reduce job is the map.
 A map is a function whose input is a single aggregate and whose output is a bunch of key
value pairs. In this case, the input would be an order.
 The output would be key-value pairs corresponding to the line items. Each one would
have the product ID as the key and an embedded map with the quantity and price as the
values
 Each application of the map function is independent of all the others. This allows them
to be safely parallelizable, so that a map-reduce framework can create e icient map
tasks on each node and freely allocate each order to a map task. This yields a great deal
of parallelism and locality of data access.
 A map operation only operates on a single record; the reduce function takes multiple map
outputs with the same key and combines their values. So, a map function might yield
1000 line items from orders for “Database Refactoring”; the reduce function would
reduce down to one, with the totals for the quantity and revenue. While the map function
is limited to working only on data from a single aggregate, the reduce function can use all
values emitted for a single key
 The map-reduce framework arranges for map tasks to be run on the correct nodes to
process all the documents and for data to be moved to the reduce function


2. Explain partitioning and combining in map reduce by taking suitable examples

Ans:

 Partitioning
o When Mapper Output is Collected its Partitioned which means that it will be
written to the output specified by the partitioner
o Partitioniing is responsible for dividing up the intermediate key space and
assigning intermediate key value pairs to reducers
o Its Assigns approximately the same number of keys to each reducer
o The first thing we can do is increase parallelism by partitioning the output of the
mappers. Each reduce function operates on the results of a single key.
o This is a limitation—it means you can’t do anything in the reduce that operates
across keys—but it’s also a benefit in that it allows you to run multiple reducers
in parallel.
o To take advantage of this, the results of the mapper are divided up based the key
on each processing node. Typically, multiple keys are grouped together into
partitions.
o The framework then takes the data from all the nodes for one partition, combines
it into a single group for that partition, and sends it o to a reducer. Multiple
reducers can then operate on the partitions in parallel, with the final results
merged together. (This step is also called “shu ling,” and the partitions are
sometimes referred to as “buckets” or “regions.”)
o Uses
 It are used to group all the values for a specific key. Values of same key
are sent to the same reducer. This helps to determine which reducer is
responsible for which key
 In MapReduce, the number of partioners and reducers are the same . The
output of a single partitioner is sent to a single reducer . if they are only
one reducer no partitioner will be created

o
 Combining
o The amount of data being moved from node to node between the map and reduce
stages. Much of this data is repetitive, consisting of multiple key-value pairs for
the same key. A combiner function cuts this data down by combining all the data
for the same key into a single value . A combiner function is, in essence, a reducer
function—indeed, in many cases the same function can be used for combining
as the final reduction. The reduce function needs a special shape for this to work:
Its output must match its input. We call such a function a combinable reducer.
o Combiners are an Optimization in Map Reduce that Allow for local Aggregation
before the shu le and sort phase
o If a combiner is used then the map key value pair are not immediately written to
the output Instead they will be collected in lists, one list per each key-value
o The map function for such an operation would need to emit the product and the
customer. The reducer can then combine them and count how many times each
customer appears for a particular product, emitting the product and the count .
But this reducer’s output is di erent from its input, so it can’t be used as a
combiner. You can still run a combining function here: one that just eliminates
duplicate product-customer pairs, but it will be di erent from the final reducer.

o
3. Describe how Map-Reduce calculations are composed. Explain with a diagram

Ans:

 The map-reduce approach is a way of thinking about concurrent processing that trades
o flexibility in how you structure your computation for a relatively straightforward model
for parallelizing the computation over a cluster. Since it’s a tradeo , there are constraints
on what you can do in your calculations
 Within a map task, you can only operate on a single aggregate. Within a reduce task, you
can only operate on a single key
 One simple limitation is that you have to structure your calculations around operations
that fit in well with the notion of a reduce operation
 An important property of averages is that they are not composable —that is, if I take two
groups of orders, I can’t combine their averages alone. Instead, I need to take total
amount and the count of orders from each group, combine those, and then calculate the
average from the combined sum and count
 To make a count, the mapping function will emit count fields with a value of 1, which can
be summed to get a total count


4. Explain the following with respect to Map Reduce
a. Two stage Map Reduce
b. Incremental Map-Reduce

Ans:

 Two Stage Map Reduce


o Two-stage Map-Reduce refers to chaining two (or more) Map-Reduce jobs such that
the output of one job serves as the input for the next. This approach is useful when
the problem requires more complex computations that cannot be completed in a
single Map-Reduce cycle.
o map-reduce calculations get more complex, it’s useful to break them down into
stages using a pipes-and-filters approach, with the output of one stage serving as
input to the next, rather like the pipelines in UNIX
o Consider an example where we want to compare the sales of products for each
month in 2011 to the prior year. To do this, we’ll break the calculations down into two
stages.
 The first stage will produce records showing the aggregate figures for a single
product in a single month of the year.
 The second stage then uses these as inputs and produces the result for a
single product by comparing one month’s results with the same month in the
prior year

o
o

o
 Incremental Map-Reduce
o Incremental Map-Reduce is a variant of Map-Reduce that handles incremental data
updates without requiring the entire dataset to be reprocessed. This is especially
beneficial in scenarios with frequently changing data, such as logs, streams, or real-
time analytics.
o The examples we've discussed so far are complete map-reduce computations,
where we start with raw inputs and create a final output.
o Many map-reduce computations take a while to perform, even with clustered
hardware, and new data keeps coming in which means we need to rerun the
computation to keep the output up to date.
o Starting from scratch each time can take too long, so often it's useful to structure a
map-reduce computation to allow incremental updates, so that only the minimum
computation needs to be done.
o The map stages of a map-reduce are easy to handle incrementally-only if the input
data changes does the mapper need to be rerun. Since maps are isolated from each
other, incremental updates are straightforward.
o The more complex case is the reduce step, since it pulls together the outputs from
many maps and any change in the map outputs could trigger a new reduction.
o This recomputation can be lessened depending on how parallel the reduce step is.
o If we are partitioning the data for reduction, then any partition that's unchanged does
not need to be re-reduced.
o Similarly, if there's a combiner step, it doesn't need to be rerun if its source data hasn't
changed.
o If our reducer is combinable, there's some more opportunities for computation
avoidance.
o If the changes are additive-that is, if we are only adding new records but are not
changing or deleting any old records then we can just run the reduce with the existing
result and the new additions.
o If there are destructive changes, that is updates and deletes, then we can avoid some
recomputation by breaking up the reduce operation into steps and only recalculating
those steps whose inputs have changed-essentially, using a Dependency Network to
organize the computation.

5.Explain the purpose of using Key Value stores. Explain how the entire data can be stored in a single
bucket of key value data store. List some popular Key Value databases

Ans:

 A key-value store is a simple hash table, primarily used when all access to the database is via
primary key.
 Think of a table in a traditional RDBMS with two columns, such as ID and NAME.
 The ID column being the key and NAME column storing the value. In an RDBMS, the NAME
column is restricted to storing data of type String.
 If the ID already exists the current value is overwritten, otherwise a new entry is created.

 The Three Operations Performed on a Key Value database are
o Put(Key,Value)
o Get(Key)
o Delete(key)
 Key-value stores are the simplest NoSQL data stores to use from an API perspective. • The
client can either get the value for the key, put a value for a key, or delete a key from the data
store.
 The value is a blob that the data store just stores, without caring or knowing what’s inside; it’s
the responsibility of the application to understand what was stored.
 Since key-value stores always use primary-key access, they generally have great
performance and can be easily scaled
 Examples of KeyValue DBMS
o Redis
o Riak
o Oracle NoSql
 If we wanted to store user session data, shopping cart information, and user preferences in
Riak, we could just store all of them in the same bucket with a single key and single value for
all of these objects.
 Storing all the data in a single bucket
o
 Domain Buckets
o We could also create buckets which store specific data. In Riak, they are known as
domain buckets allowing the serialization and deserialization to be handled by the
client driver.
o Using domain buckets or di erent buckets for di erent objects (such as UserProfile
and ShoppingCart) segments the data across di erent buckets allowing you to read
only the object you need without having to change key design.
o Bucket bucket = client.fetchBucket(bucketName).execute();
o DomainBucket profileBucket =
DomainBucket.builder(bucket,UserProfile.class).build();

6. Describe the features of Key-Value stores

Ans:

 Key-Value Store Features


o Consistency
o Transactions
o Query Features
o Structure of Data Scaling
 Consistency
o Consistency is applicable only for operations on a single key, since these
operations are either a get, put, or delete on a single key.
o Optimistic writes can be performed, but are very expensive to implement,
because a change in value cannot be determined by the data store.
o In distributed key-value store implementations like Riak, the eventually
consistent model of consistency is implemented.
o Since the value may have already been replicated to other nodes, Riak hastwo
ways of resolving update conflicts: either the newest write wins and older writes
loose, or both (all) values are returned allowing the client to resolve the conflict.
o InRiak, these options can be set up during the bucket creation. Buckets are just a
way to namespace keysso that key collisions can be reduced—for example, all
customer keysmay reside in the customer bucket
o When creating a bucket, default values for consistency can be provided, for
example that a write is considered good only when the data is consistent across
all the nodes where the data is stored.
o Implementation
 Bucket bucket = connection .createBucket(bucketName)
.withRetrier(attempts(3)) .allowSiblings(siblingsAllowed)
.nVal(numberOfReplicasOfTheData)
.w(numberOfNodesToRespondToWrite)
.r(numberOfNodesToRespondToRead) .execute();
 If we need data in every node to be consistent, we can increase the
numberOfNodesToRespondToWrite set by w to be the same asnVal.
 Transactions
o Di erent products of the key-value store kind have di erent specifications of
transactions. Generally speaking, there are no guarantees on the writes.
o Many data stores do implement transactions in di erent ways.
o Riak uses the concept of quorum implemented by using the W value —replication
factor—during the write APIcall.
o Assume we have a Riak cluster with a replication factor of 5 and we supply the W
value of 3.
o When writing, the write is reported as successful only when it is written and
reported as a success on at least three of the nodes.
o This allows Riak to have write tolerance; in our example, with N equal to 5
o with a W value of 3, the cluster can tolerate N - W = 2 nodes being down for write
operations, though we would still have lost some data on those nodesfor read.
o Allkey-valuestores can query by the key—andthat’s about it.
o If you have requirements to query by using some attribute of the value column,
it’s not possible to use the database:Your application needs to read the value to
figure out if the attribute meetsthe conditions. Query bykeyalso hasaninteresting
side e ect.
o Some key-value databases get around this by providing the ability to search inside
the value, such as Riak Search that allows you to query the data just like you
would query it using Lucene indexes
 Query Features
o All key-value stores can query by the key.
o • If you have requirements to query by using some attribute of the value column,
it’s not possible to use the database: Your application needs to read the value to
figure out if the attribute meets the conditions.
o Query by key also has an interesting side e ect. What if we don’t know the key,
especially during ad-hoc querying during debugging? Most of the data stores will
not give you a list of all the primary keys; even if they did, retrieving lists of keys
and then querying for the value would be very cumbersome
 Structure of Data
o • Key-value databases don’t care what is stored in the value part of the key-value
pair. The value can be a blob, text, JSON, XML, and so on. In Riak, we can use the
Content-Type in the POST request to specify the data type
 Scaling
o Many key-value stores scale by using sharding .
o With sharding, the value of the key determines on which node the key is stored.
o Let’s assume we are sharding by the first character of the key; if the key is
f4b19d79587d, which starts with an f, it will be sent to di erent node than the key
ad9c7a396542.
o This kind of sharding setup can increase performance as more nodes are added
to the cluster.
o Sharding also introduces some problems
o If the node used to store f goes down, the data stored on that node becomes
unavailable, nor can new data be written with keys that start with f.
o Data stores such as Riak allow you to control the aspects of the CAP Theorem .
o N (number of nodes to store the key-value replicas), R (number of nodes that have
the data being fetched before the read is considered successful), and W (the
number of nodes the write has to be written to before it is considered successful)

7.Describe suitable use cases for Key Value stores

Ans:

 Suitable Use Cases


o Storing Session Information
o User Profiles,Preferences
o Shopping Cart Data
 Storing Session Information
o Generally, every web session is unique and is assigned a unique sessionid value.
o Applications that store the sessionid on disk or in an RDBMS will greatly benefit from
moving to a key-value store, since everything about the session can be stored by a
single PUT request or retrieved using GET.
o This single-request operation makes it very fast, as everything about the session is
stored in a single object.
o Solutions such as Memcached are used by many web applications, and Riak can be
used when availability isimportant.
 User Profiles, Preferences
o Almost every user has a unique userId, username, or some other attribute, as well as
preferences such as language, color, timezone, which products the user has access
to, and so on.
o Almost every user has a unique userId, username, or some other attribute, as well as
preferences such as language, color, timezone, which products the user has access
to, and so on.
o Similarly, product profiles can be stored.
 Shopping CartData
o E-commerce websites have shopping carts tied to the user.
o As we want the shopping carts to be available all the time, across browsers,
machines, and sessions, all the shopping information can be put into the value where
the key is the userid.
o A Riak cluster would be best suited for these kinds of applications
 When Not toUse
o Relationships among Data: If you need to have relationships between di erent sets
of data, or correlate the data between di erent sets of keys, key-value stores are not
the bestsolution to use, eventhough some key-valuestores provide link-walking
features.
o Multioperation Transactions: If you’re saving multiple keys and there is a failure to
save any one of them, and you want to revert or roll back the rest of the operations,
key-value stores are not the best solution to be used.
o Query by Data: If you need to search the keys based on something found in the value
part of the key-value pairs, then key-value stores are not going to perform well for you.
There is no way to inspect the value on the database side, with the exception of some
products like Riak Search or indexing engines like Lucene or Solr.
o Operations by Sets: Since operations are limited to one key at a time, there is no way
to operate upon multiple keys at the same time. If you need to operate upon multiple
keys, you have to handle this from the client side
Module-04

1. Explain Document data model taking suitable examples. Differentiate between Oracle and
Mongo DB

Ans:

 Document Databases
o Documents are the main concept in document databases.
o The database stores and retrieves documents, which can be XML, JSON, BSON,
and so on.
o These documents are self-describing, hierarchical tree data structures which
can consist of maps, collections, and scalar values.
o The documents stored are similar to each other but do not have to be exactly the
same.
o Document databases store documents in the value part of the key-value store;
think about document databases as key-value stores where the value is
examinable..

o
2. Describe the purpose of using Document databases with suitable examples. Explain the
features of Document databases
3. Explain the following w.r.t Document databases taking suitable examples
(i) Availability
(ii) Scaling

(OR)
4. Explain Consistency and Transaction features of Document data model

Ans:

 The Features of Document Database are


o Consistency
o Transactions
o Availability
o Query Features
o Scaling
 Consistency
o Consistency in MongoDB database is configured by using the replica sets and
choosing to wait for the writes to be replicated to all the slaves or a given
number of slaves.
o Every write can specify the number of servers the write has to be propagated to
before it returns as successful.
o Example
 A command like db.runCommand({ getlasterror : 1 , w : "majority" })
tells the database how strong is the consistency you want. For example,
if you have one server and specify the w as majority, the write will return
immediately since there is only one node.
 If you have three nodes in the replica set and specify w as majority,
the write will have to complete at a minimum of two nodes before it is
reported as a success.
 You can increase the w value for stronger consistency, but you will suffer
on write performance, since now the writes have to complete at more
nodes.
 Replica sets also allow you to increase the read performance by allowing
reading from slaves by setting slaveOk; this parameter can be set on the
connection, or database, or collection, orindividually for each operation.

o Write consistency
 Write consistency, if desired. By default, a write is reported successful
once the database receives it;
 You can change this so as to wait for the writes to be synced to disk or
to propagate to two or more slaves.
 This is known as WriteConcern:
 You make sure that certain writes are written to the master and some
slaves by setting WriteConcern to REPLICAS_SAFE.
o Setting the WriteConcern for all writes to a collection:
 WriteConcern can also be set per operation by specifying it on the save
command:
 There is a tradeoff that you need to carefully think about, based on your
application needs and
 business requirements, to decide what settings make
sense for slaveOk during read or what safety
 level you desire during write with WriteConcern.


 Transactions
o Transactions, in the traditional RDBMS sense, mean that you can start
modifying the database with insert, update, or delete commands over different
tables and then decide if you want to keep the changes or not by using commit
or rollback.
o These constructs are generally not available in NoSQL solutions—a write either
succeeds or fails. Transactions at the single-document level are known as
atomic transactions.
o Transactions involving more than one operation are not possible, although there
are products such as RavenDB that do support transactions across multiple
operations.
o WriteConcern parameter
 By default, all writes are reported as successful. A finer control over the
write can be achieved by using WriteConcern parameter.
 We ensure that order is written to more than one node before it’s
reported successful by using WriteConcern.REPLICAS_SAFE.
Different levels of WriteConcern
 Let you choose the safety level during writes; for example, when writing
log entries, you can use lowest level of safety, WriteConcern.NONE.

 Availability
o The CAP theorem (“The CAP Theorem,” p. 53) dictates that we can have only two of
Consistency, Availability, and Partition Tolerance.
o Document databases try to improve on availability by replicating data using the
master-slave setup.
o The same data is available on multiple nodes and the clients can get to the data even
when the primary node is down.
o Usually, the application code does not have to determine if the primary node is
available or not.
o MongoDB implements replication,providing high availability using replica sets.
 Query Features
o Document databases provide di erent query features.
o CouchDB allows you to query via views complex queries on documents which can
be either materialized (“Materialized Views,”
o p. 30) or dynamic (think of them as RDBMS views which are either materialized or
not).
o With CouchDB, if you need to aggregate the number of reviews for a product as well
as the average rating, you could add a view implemented via map-reduce (“Basic
Map-Reduce,” p. 68) to return the count of reviews and the average of their ratings.
 Scaling
o The idea of scaling is to add nodes or change data storage without simply migrating
the database to a bigger box.
o We are not talking about making application changes to handle more load; instead,
we are interested in what features are in the database so that it can handle more load.
o Scaling for heavy-read loads can be achieved by adding more read slaves, so that all
the reads can be directed to the slaves.
o Given a heavy-read application, with our 3-node replica-set cluster, we can add more
read capacity to the cluster as the read load increases just by adding more slave
nodes to the replica set to execute reads with the slaveOk flag

5.Describe the suitable use cases of Document databases.

Ans:

 Suitable Use Cases of Document Database


o Event Logging
o Content Management Systems, Blogging Platforms
o Web Analytics or Real-Time Analytics
o E-Commerce Applications
 Event Logging
o Applications have di erent event logging needs; within the enterprise, there are many
di erent applications that want to log events.
o Document databases can store all these di erent types of events and can act as a
central data store for event storage.
o This is especially true when the type of data being captured by the events keeps
changing.
o Events can be sharded by the name of the application where the event originated or
by the type of event such as order_processed or customer logged.
 Content Management Systems, Blogging Platforms
o Since document databases have no predefined schemas and usually understand
JSON documents, they work well in content management systems or applications
for publishing websites, managing user comments, user registrations, profiles, web-
facing documents.
 Web Analytics or Real-Time Analytics
o Document databases can store data for real-time analytics; since parts of the
document can be updated, it’s very easy to store page views or unique visitors, and
new metrics can be easily added without schema changes.
 E-Commerce Applications
o E-commerce applications often need to have flexible schema for products and
orders, as well as the ability to evolve their data models without expensive database
refactoring or data migration

You might also like