Nosql Qbsol Ia-02
Nosql Qbsol Ia-02
Ans:
Map-Reduce is a programming model used for processing and generating large datasets.
It divides the task into two main steps
o Map: Processes input data and transforms it into intermediate key-value pairs.
o Reduce: Combines the intermediate key-value pairs with the same key.
Basic Idea: Let's assume we have chosen orders as our aggregate, with each order having
line items. Each line item has a product ID, quantity, and the price charged. This aggregate
makes a lot of sense as usually people want to see the whole order in one access. We
have lots of orders, so we've sharded the dataset over many machines.
However, sales analysis people want to see a product and its total revenue for the last
seven days. This report doesn't fit the aggregate structure that we have-which is the
downside of using aggregates. In order to get the product revenue report, you'll have to
visit every machine in the cluster and examine many records on each machine.
The first stage in a map-reduce job is the map.
A map is a function whose input is a single aggregate and whose output is a bunch of key
value pairs. In this case, the input would be an order.
The output would be key-value pairs corresponding to the line items. Each one would
have the product ID as the key and an embedded map with the quantity and price as the
values
Each application of the map function is independent of all the others. This allows them
to be safely parallelizable, so that a map-reduce framework can create e icient map
tasks on each node and freely allocate each order to a map task. This yields a great deal
of parallelism and locality of data access.
A map operation only operates on a single record; the reduce function takes multiple map
outputs with the same key and combines their values. So, a map function might yield
1000 line items from orders for “Database Refactoring”; the reduce function would
reduce down to one, with the totals for the quantity and revenue. While the map function
is limited to working only on data from a single aggregate, the reduce function can use all
values emitted for a single key
The map-reduce framework arranges for map tasks to be run on the correct nodes to
process all the documents and for data to be moved to the reduce function
2. Explain partitioning and combining in map reduce by taking suitable examples
Ans:
Partitioning
o When Mapper Output is Collected its Partitioned which means that it will be
written to the output specified by the partitioner
o Partitioniing is responsible for dividing up the intermediate key space and
assigning intermediate key value pairs to reducers
o Its Assigns approximately the same number of keys to each reducer
o The first thing we can do is increase parallelism by partitioning the output of the
mappers. Each reduce function operates on the results of a single key.
o This is a limitation—it means you can’t do anything in the reduce that operates
across keys—but it’s also a benefit in that it allows you to run multiple reducers
in parallel.
o To take advantage of this, the results of the mapper are divided up based the key
on each processing node. Typically, multiple keys are grouped together into
partitions.
o The framework then takes the data from all the nodes for one partition, combines
it into a single group for that partition, and sends it o to a reducer. Multiple
reducers can then operate on the partitions in parallel, with the final results
merged together. (This step is also called “shu ling,” and the partitions are
sometimes referred to as “buckets” or “regions.”)
o Uses
It are used to group all the values for a specific key. Values of same key
are sent to the same reducer. This helps to determine which reducer is
responsible for which key
In MapReduce, the number of partioners and reducers are the same . The
output of a single partitioner is sent to a single reducer . if they are only
one reducer no partitioner will be created
o
Combining
o The amount of data being moved from node to node between the map and reduce
stages. Much of this data is repetitive, consisting of multiple key-value pairs for
the same key. A combiner function cuts this data down by combining all the data
for the same key into a single value . A combiner function is, in essence, a reducer
function—indeed, in many cases the same function can be used for combining
as the final reduction. The reduce function needs a special shape for this to work:
Its output must match its input. We call such a function a combinable reducer.
o Combiners are an Optimization in Map Reduce that Allow for local Aggregation
before the shu le and sort phase
o If a combiner is used then the map key value pair are not immediately written to
the output Instead they will be collected in lists, one list per each key-value
o The map function for such an operation would need to emit the product and the
customer. The reducer can then combine them and count how many times each
customer appears for a particular product, emitting the product and the count .
But this reducer’s output is di erent from its input, so it can’t be used as a
combiner. You can still run a combining function here: one that just eliminates
duplicate product-customer pairs, but it will be di erent from the final reducer.
o
3. Describe how Map-Reduce calculations are composed. Explain with a diagram
Ans:
The map-reduce approach is a way of thinking about concurrent processing that trades
o flexibility in how you structure your computation for a relatively straightforward model
for parallelizing the computation over a cluster. Since it’s a tradeo , there are constraints
on what you can do in your calculations
Within a map task, you can only operate on a single aggregate. Within a reduce task, you
can only operate on a single key
One simple limitation is that you have to structure your calculations around operations
that fit in well with the notion of a reduce operation
An important property of averages is that they are not composable —that is, if I take two
groups of orders, I can’t combine their averages alone. Instead, I need to take total
amount and the count of orders from each group, combine those, and then calculate the
average from the combined sum and count
To make a count, the mapping function will emit count fields with a value of 1, which can
be summed to get a total count
4. Explain the following with respect to Map Reduce
a. Two stage Map Reduce
b. Incremental Map-Reduce
Ans:
o
o
o
Incremental Map-Reduce
o Incremental Map-Reduce is a variant of Map-Reduce that handles incremental data
updates without requiring the entire dataset to be reprocessed. This is especially
beneficial in scenarios with frequently changing data, such as logs, streams, or real-
time analytics.
o The examples we've discussed so far are complete map-reduce computations,
where we start with raw inputs and create a final output.
o Many map-reduce computations take a while to perform, even with clustered
hardware, and new data keeps coming in which means we need to rerun the
computation to keep the output up to date.
o Starting from scratch each time can take too long, so often it's useful to structure a
map-reduce computation to allow incremental updates, so that only the minimum
computation needs to be done.
o The map stages of a map-reduce are easy to handle incrementally-only if the input
data changes does the mapper need to be rerun. Since maps are isolated from each
other, incremental updates are straightforward.
o The more complex case is the reduce step, since it pulls together the outputs from
many maps and any change in the map outputs could trigger a new reduction.
o This recomputation can be lessened depending on how parallel the reduce step is.
o If we are partitioning the data for reduction, then any partition that's unchanged does
not need to be re-reduced.
o Similarly, if there's a combiner step, it doesn't need to be rerun if its source data hasn't
changed.
o If our reducer is combinable, there's some more opportunities for computation
avoidance.
o If the changes are additive-that is, if we are only adding new records but are not
changing or deleting any old records then we can just run the reduce with the existing
result and the new additions.
o If there are destructive changes, that is updates and deletes, then we can avoid some
recomputation by breaking up the reduce operation into steps and only recalculating
those steps whose inputs have changed-essentially, using a Dependency Network to
organize the computation.
5.Explain the purpose of using Key Value stores. Explain how the entire data can be stored in a single
bucket of key value data store. List some popular Key Value databases
Ans:
A key-value store is a simple hash table, primarily used when all access to the database is via
primary key.
Think of a table in a traditional RDBMS with two columns, such as ID and NAME.
The ID column being the key and NAME column storing the value. In an RDBMS, the NAME
column is restricted to storing data of type String.
If the ID already exists the current value is overwritten, otherwise a new entry is created.
The Three Operations Performed on a Key Value database are
o Put(Key,Value)
o Get(Key)
o Delete(key)
Key-value stores are the simplest NoSQL data stores to use from an API perspective. • The
client can either get the value for the key, put a value for a key, or delete a key from the data
store.
The value is a blob that the data store just stores, without caring or knowing what’s inside; it’s
the responsibility of the application to understand what was stored.
Since key-value stores always use primary-key access, they generally have great
performance and can be easily scaled
Examples of KeyValue DBMS
o Redis
o Riak
o Oracle NoSql
If we wanted to store user session data, shopping cart information, and user preferences in
Riak, we could just store all of them in the same bucket with a single key and single value for
all of these objects.
Storing all the data in a single bucket
o
Domain Buckets
o We could also create buckets which store specific data. In Riak, they are known as
domain buckets allowing the serialization and deserialization to be handled by the
client driver.
o Using domain buckets or di erent buckets for di erent objects (such as UserProfile
and ShoppingCart) segments the data across di erent buckets allowing you to read
only the object you need without having to change key design.
o Bucket bucket = client.fetchBucket(bucketName).execute();
o DomainBucket profileBucket =
DomainBucket.builder(bucket,UserProfile.class).build();
Ans:
Ans:
1. Explain Document data model taking suitable examples. Differentiate between Oracle and
Mongo DB
Ans:
Document Databases
o Documents are the main concept in document databases.
o The database stores and retrieves documents, which can be XML, JSON, BSON,
and so on.
o These documents are self-describing, hierarchical tree data structures which
can consist of maps, collections, and scalar values.
o The documents stored are similar to each other but do not have to be exactly the
same.
o Document databases store documents in the value part of the key-value store;
think about document databases as key-value stores where the value is
examinable..
o
2. Describe the purpose of using Document databases with suitable examples. Explain the
features of Document databases
3. Explain the following w.r.t Document databases taking suitable examples
(i) Availability
(ii) Scaling
(OR)
4. Explain Consistency and Transaction features of Document data model
Ans:
Transactions
o Transactions, in the traditional RDBMS sense, mean that you can start
modifying the database with insert, update, or delete commands over different
tables and then decide if you want to keep the changes or not by using commit
or rollback.
o These constructs are generally not available in NoSQL solutions—a write either
succeeds or fails. Transactions at the single-document level are known as
atomic transactions.
o Transactions involving more than one operation are not possible, although there
are products such as RavenDB that do support transactions across multiple
operations.
o WriteConcern parameter
By default, all writes are reported as successful. A finer control over the
write can be achieved by using WriteConcern parameter.
We ensure that order is written to more than one node before it’s
reported successful by using WriteConcern.REPLICAS_SAFE.
Different levels of WriteConcern
Let you choose the safety level during writes; for example, when writing
log entries, you can use lowest level of safety, WriteConcern.NONE.
Availability
o The CAP theorem (“The CAP Theorem,” p. 53) dictates that we can have only two of
Consistency, Availability, and Partition Tolerance.
o Document databases try to improve on availability by replicating data using the
master-slave setup.
o The same data is available on multiple nodes and the clients can get to the data even
when the primary node is down.
o Usually, the application code does not have to determine if the primary node is
available or not.
o MongoDB implements replication,providing high availability using replica sets.
Query Features
o Document databases provide di erent query features.
o CouchDB allows you to query via views complex queries on documents which can
be either materialized (“Materialized Views,”
o p. 30) or dynamic (think of them as RDBMS views which are either materialized or
not).
o With CouchDB, if you need to aggregate the number of reviews for a product as well
as the average rating, you could add a view implemented via map-reduce (“Basic
Map-Reduce,” p. 68) to return the count of reviews and the average of their ratings.
Scaling
o The idea of scaling is to add nodes or change data storage without simply migrating
the database to a bigger box.
o We are not talking about making application changes to handle more load; instead,
we are interested in what features are in the database so that it can handle more load.
o Scaling for heavy-read loads can be achieved by adding more read slaves, so that all
the reads can be directed to the slaves.
o Given a heavy-read application, with our 3-node replica-set cluster, we can add more
read capacity to the cluster as the read load increases just by adding more slave
nodes to the replica set to execute reads with the slaveOk flag
Ans: