MST answers Big Data
Absolutely — I will now give you slightly longer, exam-ready answers, written in
simple RGPV language, 5–8 marks length, paragraph + points + diagrams.
These will be longer than before, but still easy to remember and write.
I will start with UNIT-III complete expanded answers.
After that, I will expand UNIT-IV and UNIT-V as well (just tell me “continue Unit IV”).
⭐ UNIT–III (EXPANDED EXAM-READY ANSWERS)
(RGPV Dec/Nov 2024–2023)
Q.1 Explain the procedure to write User Defined
Functions (UDFs) in Hive. (RGPV Dec 2024)
User Defined Functions (UDFs) in Hive allow developers to add custom processing
logic that is not available in Hive’s built-in functions. UDFs make Hive more flexible
and suitable for complex data transformations. Since Hive is based on Java and
MapReduce, UDFs are also written in Java.
Detailed Procedure
1. Create Java Class
Create a Java file and import Hive libraries.
Extend the class using:
org.apache.hadoop.hive.ql.exec.UDF .
This helps Hive recognize your class as a function.
2. Implement evaluate() Method
Hive executes only the evaluate() method of a UDF.
It receives one or more parameters and returns the result.
Example: convert a string to uppercase or perform complex calculations.
3. Compile the Java Code
Compile the class using javac .
1/24
All .class files are bundled into a JAR file using:
jar cf myudf.jar *.class .
4. Add JAR to Hive
Start the Hive shell and load the JAR:
ADD JAR /path/myudf.jar;
Hive now has access to the custom function.
5. Register the Function
A temporary function is created using:
CREATE TEMPORARY FUNCTION myfunc AS 'mypackage.MyUDF';
You may also create a permanent function stored in metastore.
6. Use Function in HiveQL Queries
UDF works like any built-in function:
SELECT myfunc(column_name) FROM table;
7. Test & Optimize
Check output type, performance, and exceptions.
Diagram – Hive UDF Workflow
sql
Java Code → Compile → Create JAR → ADD JAR → CREATE FUNCTION → Run in Hive Query
Q.2 Write a word count program in Pig to count
occurrences of similar words. (RGPV Dec 2024)
Word count is the most common Pig Latin example. Pig makes this easier because it
supports built-in functions like TOKENIZE() and relational operators like GROUP and
COUNT() .
Pig Latin Code (Word Count Script)
pgsql
2/24
lines = LOAD 'input.txt' USING TextLoader() AS (line:chararray);
words = FOREACH lines GENERATE FLATTEN(TOKENIZE(line)) AS word;
grouped = GROUP words BY word;
wordcount = FOREACH grouped GENERATE group AS word, COUNT(words) AS total;
STORE wordcount INTO 'output';
Explanation in Steps
1. LOAD loads each line from file.
2. TOKENIZE splits lines into individual words.
3. FLATTEN removes nested structure.
4. GROUP collects identical words together.
5. COUNT calculates frequency.
6. STORE writes output back to HDFS.
Diagram
pgsql
Input Lines → Tokenize → Words → Group by Words → Count → Output
Q.3 Explain the working of Hive with proper steps and
diagram. (RGPV Dec 2024)
Hive is a data warehousing tool that allows SQL-like queries on top of Hadoop. It
converts SQL queries (HiveQL) into MapReduce, Tez, or Spark jobs. Hive is mainly
used for batch processing, aggregation, and data summarization.
Working Steps of Hive
1. Query Submission
User submits HiveQL query using CLI, JDBC, Web UI, or beeline.
2. Parsing
Hive compiler checks syntax errors.
3/24
Converts query into an Abstract Syntax Tree (AST).
3. Semantic Analysis
Meaning of query verified using metadata in Metastore.
Ensures tables, columns, and data types exist.
4. Logical Plan Generation
Hive creates a logical plan showing operations like SELECT, FILTER, JOIN.
5. Optimization
Optimizer performs:
predicate pushdown
column pruning
join reordering
Map-side aggregation
6. Physical Plan
Converts logical operators into tasks.
Tasks executed using MapReduce, Tez, or Spark.
7. Execution
The executed jobs read/write data from HDFS using YARN.
8. Result Return
Output stored back in HDFS or displayed on console.
Hive Architecture Diagram
pgsql
+-------------------+
| HiveQL Query |
+--------+----------+
|
Query Compiler
|
+--------+----------+
| Query Optimizer |
+--------+----------+
4/24
|
Execution Engine
|
Hadoop/YARN Jobs
|
Output Returned to User
Q.4 What are the different Hive Data Types? Explain
them. (RGPV Dec 2024)
Hive supports primitive, complex, and miscellaneous data types.
⭐ A. Primitive Types
1. Numeric Types
TINYINT, SMALLINT, INT, BIGINT: store integers.
FLOAT, DOUBLE: decimal values.
DECIMAL: precision-based numeric type.
2. String Types
STRING: unlimited length text.
VARCHAR(n): variable length up to n.
CHAR(n): fixed length.
3. Date/Time Types
DATE, TIMESTAMP, INTERVAL.
4. Boolean Type
BOOLEAN: true or false.
5. Binary Type
BINARY: raw bytes.
⭐ B. Complex Data Types
1. ARRAY<T>
Ordered list of elements.
5/24
Example: [1, 2, 3] .
2. MAP<K,V>
Key-value pairs.
Example: {"name": "Garvit"} .
3. STRUCT<…>
Group of fields with different data types.
Example: struct<name:string, age:int> .
4. UNIONTYPE
Stores one of many specified types.
Q.5 What is Hive Metastore? Which classes are used by
Hive to read and write HDFS files? (RGPV Nov 2023)
Hive Metastore (HMS)
Hive Metastore is a central repository that stores metadata required by Hive for
query execution. Hive uses HMS to know how data is stored in HDFS.
Metadata Stored in HMS
Table names, columns, and data types
Partition information
File formats (ORC, Parquet, Text)
HDFS location of tables
SerDe (Serializer-Deserializer) classes
Storage information
Why Metastore is important?
Without HMS, Hive cannot execute queries.
Helps compiler verify schema during semantic analysis.
Stores metadata in MySQL/Derby database.
Classes Used by Hive to Read/Write HDFS Files
Hive uses Hadoop’s FileSystem API:
6/24
1. FileSystem Class
Handles file operations like open, create, delete.
2. FSDataInputStream
Used to read files from HDFS.
Supports seek operations.
3. FSDataOutputStream
Used to write files to HDFS.
4. Path Class
Represents path of file in HDFS.
Q.6 Write example Hive Queries for Natural Join and
Outer Join. (RGPV Nov 2023)
Natural Join
sql
SELECT *
FROM employee
NATURAL JOIN department;
Automatically joins on common column names.
Left Outer Join
sql
SELECT e.name, d.dept_name
FROM employee e
LEFT OUTER JOIN department d
ON e.dept_id = d.id;
Right Outer Join
sql
7/24
SELECT e.name, d.dept_name
FROM employee e
RIGHT OUTER JOIN department d
ON e.dept_id = d.id;
Full Outer Join
pgsql
SELECT *
FROM employee e
FULL OUTER JOIN department d
ON e.dept_id = d.id;
Q.7 List and explain Relational Operators in Pig. (RGPV
Nov 2023)
Pig provides relational operators to process data step-by-step like SQL but more
flexible.
Major Relational Operators
1. LOAD
Loads data from external source.
Example:
LOAD 'file.txt' AS (id:int, name:chararray);
2. STORE
Saves results to file.
3. FILTER
Select rows based on condition.
Example: FILTER data BY age > 20;
4. FOREACH…GENERATE
Performs transformations.
Similar to SELECT in SQL.
8/24
5. GROUP
Groups records by a key.
6. JOIN
Combines datasets based on keys.
7. ORDER BY
Sorts data.
8. DISTINCT
Removes duplicates.
9. LIMIT
Restricts number of rows.
10. UNION
Combines datasets.
11. SPLIT
Divides data into subsets.
Q.8 Discuss the various data types in Pig. (RGPV Nov
2022)
Pig supports two main categories: simple and complex types.
⭐ A. Simple/Atomic Types
int – Integer
long – Larger integer
float, double – Decimal numbers
chararray – Strings
bytearray – Raw bytes
boolean – True/false
⭐ B. Complex Types
9/24
1. TUPLE
Ordered set of fields (like row).
Example: (1, "Garvit", 78) .
2. BAG
Collection of tuples (unordered).
Example: {(1,A),(2,B),(3,C)} .
3. MAP
Key–value pairs.
Example: [name#Garvit, age#21] .
If you want, I will now provide UNIT-IV expanded answers next.
Just reply “Continue Unit IV”.
Great! Here are the expanded, exam-length, RGPV-oriented answers for UNIT–IV
(NoSQL).
These are written in simple language, with paragraph + points + examples, perfect
for 5–8 marks.
⭐ UNIT–IV (NO SQL DATABASES) – EXPANDED
ANSWERS
Q.1 Write a short note on use of NoSQL database in
industry. (RGPV Dec 2024)
NoSQL databases have become extremely important in modern industries because
traditional relational databases (SQL databases) cannot handle massive,
unstructured, and fast-changing data generated today. Industries like e-commerce,
social networking, IoT, healthcare, banking, entertainment, and real-time
applications rely heavily on NoSQL for scalability and speed.
Why industries use NoSQL (Detailed Points):
10/24
Handles Big Data: Industries generate terabytes of data per day (posts, logs,
clickstreams), which NoSQL can store and process efficiently.
Flexible Schema: Supports unstructured and semi-structured data like JSON,
XML, logs, images, videos.
High Scalability: Can add multiple servers easily (horizontal scaling).
Real-Time Processing: Fast reads and writes enable instant updates (chat apps,
notifications).
High Availability: Replication provides failover and reliability.
Cloud-Friendly: Works natively with distributed cloud systems.
Industrial Use Cases:
E-commerce (Amazon): Product catalog, user sessions, shopping carts
(DynamoDB, Cassandra).
Social Networks (Instagram, Facebook): User feeds, comments, reactions
(Cassandra, HBase).
Streaming Platforms (Netflix, YouTube): Recommendation metadata, video
catalogs.
IoT and Sensors: Time-series databases store live sensor data.
Gaming: User profiles, multiplayer game state stored instantly.
Thus, NoSQL supports huge-scale applications where speed and flexibility are crucial.
Q.2 Discuss various types of NoSQL databases with
examples. (RGPV Dec 2024)
NoSQL databases are non-relational and fall into four main categories, each
designed for specific types of data processing.
1. Key–Value Stores
Data stored as simple key → value pairs.
Highly scalable and extremely fast.
Best suited for caching, session data, and real-time applications.
Examples: Redis, Riak, Amazon DynamoDB.
11/24
2. Document-Oriented Databases
Store data as JSON/BSON documents.
Allow nested structures and flexible schema.
Ideal for content management, user profiles, blogs.
Examples: MongoDB, CouchDB, Couchbase.
3. Column-Family Stores (Wide Column Stores)
Data stored in tables with rows and dynamic columns.
Excellent for large analytical workloads and distributed storage.
Used in big data systems.
Examples: Apache Cassandra, Apache HBase.
4. Graph Databases
Store data in nodes, edges, and properties.
Best for relationship-heavy data like social networks.
Supports graph-based algorithms like shortest path, PageRank.
Examples: Neo4j, OrientDB, Amazon Neptune.
Diagram: Types of NoSQL Databases
mathematica
NoSQL
-------------------------------------------------
| | | |
Key–Value Document Column-Family Graph
Redis MongoDB HBase Neo4j
Riak CouchDB Cassandra OrientDB
Q.3 Discuss key characteristics and advantages of
NoSQL database. (RGPV Dec 2024)
Key Characteristics
12/24
1. Schema-Free or Flexible Schema
No predefined column structure needed.
JSON-like documents allow easy updates.
2. Horizontal Scalability
Add more nodes instead of upgrading hardware.
Perfect for cloud environments.
3. High Performance
Handles millions of reads/writes per second.
Uses caching, in-memory storage, and efficient indexing.
4. Distributed Architecture
Data replicated across multiple nodes for reliability.
5. Supports Unstructured Data
Accepts logs, multimedia, documents, and irregular data.
6. BASE Model
Basic Availability, Soft state, Eventual consistency.
Advantages
1. Handles Big Data Easily
High scalability makes NoSQL perfect for modern data volume.
2. Faster Development
Flexible schema reduces development effort.
3. Low Latency
Quick data retrieval essential for real-time apps.
4. High Availability
Automatic failover ensures system never goes down.
5. Cost-Effective
Runs on commodity hardware, reducing server cost.
13/24
6. Ideal for Distributed Apps
Works with microservices, cloud-native apps.
Q.4 Write a short note on NoSQL databases. (RGPV
Nov 2023)
NoSQL databases are non-relational databases designed to overcome the limitations
of traditional relational (SQL) systems. They can store and process large volumes of
structured, semi-structured, or unstructured data. NoSQL databases do not
require fixed tables or schemas, making them more flexible and suitable for dynamic
applications.
Important Features
schema-free
horizontally scalable
high performance
distributed storage
supports JSON-like documents
suitable for real-time systems
NoSQL is widely used in industries like social media, e-commerce, IoT, and cloud-
based applications.
Q.5 Differentiate between NoSQL and Relational
Database. (RGPV Nov 2023)
Relational
Feature Database (SQL) NoSQL Database
Data Model Tables (Rows & Key-Value, Document,
Columns) Column, Graph
Schema Fixed, predefined Flexible schema
Scalability Vertical (scale-up) Horizontal (scale-out)
Data Type Structured Unstructured & semi-
structured
14/24
Relational
Feature Database (SQL) NoSQL Database
Query Language SQL Varies (JSON, map
APIs, queries)
Transactions ACID BASE
Performance Slower for big data High speed for big
data
Suitable For Banking, Big data, cloud apps,
traditional apps social media
Q.6 Describe applications of NoSQL databases in
Industry. (RGPV Nov 2023)
NoSQL databases support a wide variety of industrial applications.
Applications
1. Social Networking
Store user feeds, likes, messages.
Example: Facebook uses Cassandra.
2. E-commerce
Shopping cart, customer info, product catalog.
Example: Amazon DynamoDB.
3. Big Data Analytics
Log processing, clickstream analysis, A/B testing.
4. Real-Time Data Processing
Stock market data, live dashboards.
5. Gaming
Multiplayer game state, leaderboards, real-time stats.
6. IoT
Sensor data, telemetry, time-series databases.
15/24
7. Healthcare
Patient history, medical records, machine data.
8. Finance & Banking
Fraud detection, user behavior patterns.
Q.7 What is NoSQL database? Discuss key
characteristics & advantages. (RGPV Nov 2023)
(This is a combination question → expanded below)
Definition
NoSQL stands for “Not Only SQL”. It includes a group of database technologies
designed for storing huge, unstructured, semi-structured, or rapidly changing data.
Characteristics
No fixed schema
High scalability
High performance
Distributed and fault tolerant
Supports variety of data types
BASE model instead of ACID
Advantages
Handles big data efficiently
Real-time query performance
Perfect for cloud-based and distributed apps
Faster development & schema changes
Cost-efficient
Q.8 How NoSQL is useful for Big Data problems? (RGPV
Nov 2022)
NoSQL databases solve major challenges of big data: volume, variety, velocity, and
veracity.
16/24
Why NoSQL is suitable for Big Data
1. Scalability
Data can be spread across thousands of servers.
2. Handles Unstructured Data
Big data includes logs, images, social posts, documents.
3. High Speed
In-memory storage makes data retrieval extremely fast.
4. Distributed Architecture
Replication ensures fault-tolerance and reliability.
5. Flexible Schema
New fields can be added anytime.
6. Real-Time Analytics
Supports real-time dashboards and monitoring.
Thus, NoSQL fits naturally into big data ecosystems.
Q.9 Explain the process of Sharding in MongoDB.
(RGPV Nov 2022)
Sharding is a method in MongoDB used to split large datasets across multiple
servers to improve performance and storage capacity.
⭐ Steps in Sharding
1. Choose a Shard Key
A field in the document used to divide data.
Example: user_id , country , date .
2. Create Shards
Each shard is a group of MongoDB servers.
Each shard stores a subset of data.
17/24
3. Config Servers
Store metadata about shards and chunk locations.
4. Mongos Router
Acts as a query router.
Determines which shard holds the required data.
5. Chunking
MongoDB divides data into chunks (ranges).
Chunks automatically migrate across shards for balancing.
6. Automatic Balancing
Keeps shard load equal.
Moves chunks between shards if imbalance occurs.
Diagram: MongoDB Sharding
powershell
CLIENT
|
Mongos
-------------------
| | |
Shard1 Shard2 Shard3
(Data A) (Data B) (Data C)
Config Servers
(Store metadata & chunk info)
If you want, I will now continue with UNIT–V expanded answers.
Just reply: Continue Unit V
Sure! Here are the expanded, exam-oriented, RGPV-style answers for UNIT–V
(Social Network Analysis).
18/24
All answers are in simple language, 5–8 marks length, with diagrams, examples,
and points.
⭐ UNIT–V (SOCIAL NETWORK ANALYTICS) –
EXPANDED ANSWERS
Q.1 Explain the following. (RGPV Dec 2024)
(i) Social Network as a Graph
A social network can be represented as a graph structure, where each person is
treated as a node (vertex) and the relation between people is shown as edges
(links). Social networks like Facebook, Twitter, LinkedIn naturally form large graphs.
Graph theory helps in understanding relationships, central users, groups,
influencers, etc.
Elements of Social Graph
Nodes: users/people
Edges: friendships, followers, interactions
Edge direction:
Undirected → Facebook friend
Directed → Twitter follow
Weight on edges: frequency of interaction
Graph Example Diagram
mathematica
A ---- B ---- C
| |
D ----------- E
Importance
Helps detect communities
Helps find influential users
Helps model information spread
19/24
Performs link prediction (who may connect next)
(ii) Social Network Mining
Social network mining is the process of extracting meaningful information from
social graphs. It includes studying how people connect, identifying important users,
detecting communities, and analyzing the spread of information.
Key Functions of Social Network Mining
Detects hidden patterns
Identifies communities (clusters)
Finds influencers (high centrality)
Predicts links and relationships
Performs sentiment and trend analysis
Helps in targeted marketing
Applications
Facebook friend suggestions
Twitter trending hashtags
Fake account detection
Political opinion mining
Customer segmentation
(iii) Recommender System
A recommender system suggests items, movies, products, songs, or videos to users
based on past behavior or similar users. It is a core technology behind YouTube,
Netflix, Amazon, Flipkart, etc.
Types of Recommendation Systems
1. Content-Based Filtering
Finds similarity between items.
Example: If a user watches romantic movies → recommend more romantic
movies.
2. Collaborative Filtering
Finds similarity between users.
20/24
Example: “Users similar to you liked these items.”
3. Hybrid Model
Combines both for better accuracy.
Diagram
powershell
User History → Feature Extraction → Matching → Recommendations
Q.2 Explain the following: (RGPV Nov 2023)
(i) Applications of Social Network Mining
Social network mining helps industries, governments, and companies understand
online behavior, trends, and patterns.
Major Applications
Friend/connection recommendations
User segmentation for advertising
Trend analysis & hashtag monitoring (Twitter)
Fraud/Bot detection using unusual patterns
Community detection (groups with similar interests)
Opinion mining & sentiment analysis
Brand monitoring for companies
Influencer identification (celebrities, trendsetters)
Disease spread prediction using network models
(ii) Recommender System
A recommender system predicts what the user will like and shows personalized
suggestions.
Use Cases
Netflix movie recommendations
YouTube suggested videos
Flipkart/Amazon product recommendations
21/24
Spotify recommended songs
Benefits
Increases user engagement
Improves customer satisfaction
Boosts sales
Reduces search time for users
Q.3 Which mining method is most frequently used for
Social Network Graph? Explain. (RGPV Nov 2022)
The most frequently used mining method for social network graphs is Graph Mining.
What is Graph Mining?
Graph mining applies graph algorithms to explore relationships between nodes in a
social network.
Tasks performed in graph mining:
1. Centrality Calculation
Finds important/influential nodes.
Types: degree centrality, closeness, betweenness, eigenvector.
2. Community Detection
Groups similar users into clusters.
Helps find interest-based communities.
3. Link Prediction
Predicts future relationships (Facebook “People You May Know”).
4. Influence Spread Analysis
Models how information spreads (virality).
5. Shortest Path Analysis
Finds minimum interaction path between two people.
Why used most?
Social networks naturally form graphs.
22/24
Graph algorithms work efficiently on relationship data.
Provides deep insight into user behavior.
Q.4 What do you understand by Clustering in Social
Network Analysis? (RGPV Nov 2022)
Clustering is a major technique in social network analysis used to group users into
communities, where connections inside the group are strong and connections with
other groups are weak.
Definition:
Clustering divides the social network into groups of users who share common
interests, behavior, or frequent interactions.
Features of Clustering
Nodes within a cluster are closely connected.
Helps in identifying natural communities.
Reveals hidden patterns in society or organizations.
Helps target groups for advertisements.
Types of Clustering Techniques
Modularity-based clustering
K-means clustering (when converted to vector form)
Hierarchical clustering
Spectral clustering
Applications of Clustering
Finding friend groups on Facebook
Creating customer segments for marketing
Detecting fraud rings in banking
finding terrorist networks
Understanding influence networks in Twitter/Instagram
Simple Diagram
yaml
23/24
Cluster 1: A—B—C
\|
D
Cluster 2: E—F—G
⭐ UNIT–V Completed
If you want, I can now:
✅ Combine Unit III + Unit IV + Unit V into one neat PDF
OR
✅ Expand any answer even more (2–3 pages each)
OR
✅ Provide diagrams for all answers in printed format
Just tell me what you need!
24/24