0% found this document useful (0 votes)
7 views24 pages

Big Data PYQ

The document provides detailed exam-ready answers on Big Data topics, specifically focusing on Hive and NoSQL databases. It covers procedures for writing User Defined Functions in Hive, a word count program in Pig, the workings of Hive, and various data types in Hive and Pig. Additionally, it discusses the use of NoSQL databases in industry, their types, characteristics, advantages, and applications.

Uploaded by

Garvit Dani
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views24 pages

Big Data PYQ

The document provides detailed exam-ready answers on Big Data topics, specifically focusing on Hive and NoSQL databases. It covers procedures for writing User Defined Functions in Hive, a word count program in Pig, the workings of Hive, and various data types in Hive and Pig. Additionally, it discusses the use of NoSQL databases in industry, their types, characteristics, advantages, and applications.

Uploaded by

Garvit Dani
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

MST answers Big Data

Absolutely — I will now give you slightly longer, exam-ready answers, written in
simple RGPV language, 5–8 marks length, paragraph + points + diagrams.
These will be longer than before, but still easy to remember and write.

I will start with UNIT-III complete expanded answers.


After that, I will expand UNIT-IV and UNIT-V as well (just tell me “continue Unit IV”).

⭐ UNIT–III (EXPANDED EXAM-READY ANSWERS)


(RGPV Dec/Nov 2024–2023)

Q.1 Explain the procedure to write User Defined


Functions (UDFs) in Hive. (RGPV Dec 2024)
User Defined Functions (UDFs) in Hive allow developers to add custom processing
logic that is not available in Hive’s built-in functions. UDFs make Hive more flexible
and suitable for complex data transformations. Since Hive is based on Java and
MapReduce, UDFs are also written in Java.

Detailed Procedure

1. Create Java Class


Create a Java file and import Hive libraries.
Extend the class using:
org.apache.hadoop.hive.ql.exec.UDF .
This helps Hive recognize your class as a function.

2. Implement evaluate() Method


Hive executes only the evaluate() method of a UDF.
It receives one or more parameters and returns the result.
Example: convert a string to uppercase or perform complex calculations.

3. Compile the Java Code


Compile the class using javac .

1/24
All .class files are bundled into a JAR file using:
jar cf myudf.jar *.class .

4. Add JAR to Hive


Start the Hive shell and load the JAR:
ADD JAR /path/myudf.jar;

Hive now has access to the custom function.

5. Register the Function


A temporary function is created using:
CREATE TEMPORARY FUNCTION myfunc AS 'mypackage.MyUDF';

You may also create a permanent function stored in metastore.

6. Use Function in HiveQL Queries


UDF works like any built-in function:
SELECT myfunc(column_name) FROM table;

7. Test & Optimize


Check output type, performance, and exceptions.

Diagram – Hive UDF Workflow

sql

Java Code → Compile → Create JAR → ADD JAR → CREATE FUNCTION → Run in Hive Query

Q.2 Write a word count program in Pig to count


occurrences of similar words. (RGPV Dec 2024)
Word count is the most common Pig Latin example. Pig makes this easier because it
supports built-in functions like TOKENIZE() and relational operators like GROUP and
COUNT() .

Pig Latin Code (Word Count Script)

pgsql

2/24
lines = LOAD 'input.txt' USING TextLoader() AS (line:chararray);

words = FOREACH lines GENERATE FLATTEN(TOKENIZE(line)) AS word;

grouped = GROUP words BY word;

wordcount = FOREACH grouped GENERATE group AS word, COUNT(words) AS total;

STORE wordcount INTO 'output';

Explanation in Steps
1. LOAD loads each line from file.
2. TOKENIZE splits lines into individual words.
3. FLATTEN removes nested structure.
4. GROUP collects identical words together.
5. COUNT calculates frequency.
6. STORE writes output back to HDFS.

Diagram

pgsql

Input Lines → Tokenize → Words → Group by Words → Count → Output

Q.3 Explain the working of Hive with proper steps and


diagram. (RGPV Dec 2024)
Hive is a data warehousing tool that allows SQL-like queries on top of Hadoop. It
converts SQL queries (HiveQL) into MapReduce, Tez, or Spark jobs. Hive is mainly
used for batch processing, aggregation, and data summarization.

Working Steps of Hive

1. Query Submission
User submits HiveQL query using CLI, JDBC, Web UI, or beeline.

2. Parsing
Hive compiler checks syntax errors.

3/24
Converts query into an Abstract Syntax Tree (AST).

3. Semantic Analysis
Meaning of query verified using metadata in Metastore.
Ensures tables, columns, and data types exist.

4. Logical Plan Generation


Hive creates a logical plan showing operations like SELECT, FILTER, JOIN.

5. Optimization
Optimizer performs:
predicate pushdown
column pruning
join reordering
Map-side aggregation

6. Physical Plan
Converts logical operators into tasks.
Tasks executed using MapReduce, Tez, or Spark.

7. Execution
The executed jobs read/write data from HDFS using YARN.

8. Result Return
Output stored back in HDFS or displayed on console.

Hive Architecture Diagram

pgsql

+-------------------+
| HiveQL Query |
+--------+----------+
|
Query Compiler
|
+--------+----------+
| Query Optimizer |
+--------+----------+

4/24
|
Execution Engine
|
Hadoop/YARN Jobs
|
Output Returned to User

Q.4 What are the different Hive Data Types? Explain


them. (RGPV Dec 2024)
Hive supports primitive, complex, and miscellaneous data types.

⭐ A. Primitive Types
1. Numeric Types
TINYINT, SMALLINT, INT, BIGINT: store integers.
FLOAT, DOUBLE: decimal values.
DECIMAL: precision-based numeric type.

2. String Types
STRING: unlimited length text.
VARCHAR(n): variable length up to n.
CHAR(n): fixed length.

3. Date/Time Types
DATE, TIMESTAMP, INTERVAL.

4. Boolean Type
BOOLEAN: true or false.

5. Binary Type
BINARY: raw bytes.

⭐ B. Complex Data Types


1. ARRAY<T>
Ordered list of elements.

5/24
Example: [1, 2, 3] .

2. MAP<K,V>
Key-value pairs.
Example: {"name": "Garvit"} .

3. STRUCT<…>
Group of fields with different data types.
Example: struct<name:string, age:int> .

4. UNIONTYPE
Stores one of many specified types.

Q.5 What is Hive Metastore? Which classes are used by


Hive to read and write HDFS files? (RGPV Nov 2023)
Hive Metastore (HMS)
Hive Metastore is a central repository that stores metadata required by Hive for
query execution. Hive uses HMS to know how data is stored in HDFS.

Metadata Stored in HMS


Table names, columns, and data types
Partition information
File formats (ORC, Parquet, Text)
HDFS location of tables
SerDe (Serializer-Deserializer) classes
Storage information

Why Metastore is important?


Without HMS, Hive cannot execute queries.
Helps compiler verify schema during semantic analysis.
Stores metadata in MySQL/Derby database.

Classes Used by Hive to Read/Write HDFS Files


Hive uses Hadoop’s FileSystem API:

6/24
1. FileSystem Class
Handles file operations like open, create, delete.

2. FSDataInputStream
Used to read files from HDFS.
Supports seek operations.

3. FSDataOutputStream
Used to write files to HDFS.

4. Path Class
Represents path of file in HDFS.

Q.6 Write example Hive Queries for Natural Join and


Outer Join. (RGPV Nov 2023)
Natural Join

sql

SELECT *
FROM employee
NATURAL JOIN department;

Automatically joins on common column names.

Left Outer Join

sql

SELECT e.name, d.dept_name


FROM employee e
LEFT OUTER JOIN department d
ON e.dept_id = d.id;

Right Outer Join

sql

7/24
SELECT e.name, d.dept_name
FROM employee e
RIGHT OUTER JOIN department d
ON e.dept_id = d.id;

Full Outer Join

pgsql

SELECT *
FROM employee e
FULL OUTER JOIN department d
ON e.dept_id = d.id;

Q.7 List and explain Relational Operators in Pig. (RGPV


Nov 2023)
Pig provides relational operators to process data step-by-step like SQL but more
flexible.

Major Relational Operators

1. LOAD
Loads data from external source.
Example:
LOAD 'file.txt' AS (id:int, name:chararray);

2. STORE
Saves results to file.

3. FILTER
Select rows based on condition.
Example: FILTER data BY age > 20;

4. FOREACH…GENERATE
Performs transformations.
Similar to SELECT in SQL.

8/24
5. GROUP
Groups records by a key.

6. JOIN
Combines datasets based on keys.

7. ORDER BY
Sorts data.

8. DISTINCT
Removes duplicates.

9. LIMIT
Restricts number of rows.

10. UNION
Combines datasets.

11. SPLIT
Divides data into subsets.

Q.8 Discuss the various data types in Pig. (RGPV Nov


2022)
Pig supports two main categories: simple and complex types.

⭐ A. Simple/Atomic Types
int – Integer
long – Larger integer
float, double – Decimal numbers
chararray – Strings
bytearray – Raw bytes
boolean – True/false

⭐ B. Complex Types
9/24
1. TUPLE
Ordered set of fields (like row).
Example: (1, "Garvit", 78) .

2. BAG
Collection of tuples (unordered).
Example: {(1,A),(2,B),(3,C)} .

3. MAP
Key–value pairs.
Example: [name#Garvit, age#21] .

If you want, I will now provide UNIT-IV expanded answers next.


Just reply “Continue Unit IV”.

Great! Here are the expanded, exam-length, RGPV-oriented answers for UNIT–IV
(NoSQL).
These are written in simple language, with paragraph + points + examples, perfect
for 5–8 marks.

⭐ UNIT–IV (NO SQL DATABASES) – EXPANDED


ANSWERS

Q.1 Write a short note on use of NoSQL database in


industry. (RGPV Dec 2024)
NoSQL databases have become extremely important in modern industries because
traditional relational databases (SQL databases) cannot handle massive,
unstructured, and fast-changing data generated today. Industries like e-commerce,
social networking, IoT, healthcare, banking, entertainment, and real-time
applications rely heavily on NoSQL for scalability and speed.

Why industries use NoSQL (Detailed Points):

10/24
Handles Big Data: Industries generate terabytes of data per day (posts, logs,
clickstreams), which NoSQL can store and process efficiently.
Flexible Schema: Supports unstructured and semi-structured data like JSON,
XML, logs, images, videos.
High Scalability: Can add multiple servers easily (horizontal scaling).
Real-Time Processing: Fast reads and writes enable instant updates (chat apps,
notifications).
High Availability: Replication provides failover and reliability.
Cloud-Friendly: Works natively with distributed cloud systems.

Industrial Use Cases:


E-commerce (Amazon): Product catalog, user sessions, shopping carts
(DynamoDB, Cassandra).
Social Networks (Instagram, Facebook): User feeds, comments, reactions
(Cassandra, HBase).
Streaming Platforms (Netflix, YouTube): Recommendation metadata, video
catalogs.
IoT and Sensors: Time-series databases store live sensor data.
Gaming: User profiles, multiplayer game state stored instantly.

Thus, NoSQL supports huge-scale applications where speed and flexibility are crucial.

Q.2 Discuss various types of NoSQL databases with


examples. (RGPV Dec 2024)
NoSQL databases are non-relational and fall into four main categories, each
designed for specific types of data processing.

1. Key–Value Stores
Data stored as simple key → value pairs.
Highly scalable and extremely fast.
Best suited for caching, session data, and real-time applications.
Examples: Redis, Riak, Amazon DynamoDB.

11/24
2. Document-Oriented Databases
Store data as JSON/BSON documents.
Allow nested structures and flexible schema.
Ideal for content management, user profiles, blogs.
Examples: MongoDB, CouchDB, Couchbase.

3. Column-Family Stores (Wide Column Stores)


Data stored in tables with rows and dynamic columns.
Excellent for large analytical workloads and distributed storage.
Used in big data systems.
Examples: Apache Cassandra, Apache HBase.

4. Graph Databases
Store data in nodes, edges, and properties.
Best for relationship-heavy data like social networks.
Supports graph-based algorithms like shortest path, PageRank.
Examples: Neo4j, OrientDB, Amazon Neptune.

Diagram: Types of NoSQL Databases

mathematica

NoSQL
-------------------------------------------------
| | | |
Key–Value Document Column-Family Graph
Redis MongoDB HBase Neo4j
Riak CouchDB Cassandra OrientDB

Q.3 Discuss key characteristics and advantages of


NoSQL database. (RGPV Dec 2024)
Key Characteristics

12/24
1. Schema-Free or Flexible Schema
No predefined column structure needed.
JSON-like documents allow easy updates.

2. Horizontal Scalability
Add more nodes instead of upgrading hardware.
Perfect for cloud environments.

3. High Performance
Handles millions of reads/writes per second.
Uses caching, in-memory storage, and efficient indexing.

4. Distributed Architecture
Data replicated across multiple nodes for reliability.

5. Supports Unstructured Data


Accepts logs, multimedia, documents, and irregular data.

6. BASE Model
Basic Availability, Soft state, Eventual consistency.

Advantages

1. Handles Big Data Easily


High scalability makes NoSQL perfect for modern data volume.

2. Faster Development
Flexible schema reduces development effort.

3. Low Latency
Quick data retrieval essential for real-time apps.

4. High Availability
Automatic failover ensures system never goes down.

5. Cost-Effective
Runs on commodity hardware, reducing server cost.

13/24
6. Ideal for Distributed Apps
Works with microservices, cloud-native apps.

Q.4 Write a short note on NoSQL databases. (RGPV


Nov 2023)
NoSQL databases are non-relational databases designed to overcome the limitations
of traditional relational (SQL) systems. They can store and process large volumes of
structured, semi-structured, or unstructured data. NoSQL databases do not
require fixed tables or schemas, making them more flexible and suitable for dynamic
applications.

Important Features
schema-free
horizontally scalable
high performance
distributed storage
supports JSON-like documents
suitable for real-time systems

NoSQL is widely used in industries like social media, e-commerce, IoT, and cloud-
based applications.

Q.5 Differentiate between NoSQL and Relational


Database. (RGPV Nov 2023)
Relational
Feature Database (SQL) NoSQL Database

Data Model Tables (Rows & Key-Value, Document,


Columns) Column, Graph

Schema Fixed, predefined Flexible schema

Scalability Vertical (scale-up) Horizontal (scale-out)

Data Type Structured Unstructured & semi-


structured

14/24
Relational
Feature Database (SQL) NoSQL Database

Query Language SQL Varies (JSON, map


APIs, queries)

Transactions ACID BASE

Performance Slower for big data High speed for big


data

Suitable For Banking, Big data, cloud apps,


traditional apps social media

Q.6 Describe applications of NoSQL databases in


Industry. (RGPV Nov 2023)
NoSQL databases support a wide variety of industrial applications.

Applications

1. Social Networking
Store user feeds, likes, messages.
Example: Facebook uses Cassandra.

2. E-commerce
Shopping cart, customer info, product catalog.
Example: Amazon DynamoDB.

3. Big Data Analytics


Log processing, clickstream analysis, A/B testing.

4. Real-Time Data Processing


Stock market data, live dashboards.

5. Gaming
Multiplayer game state, leaderboards, real-time stats.

6. IoT
Sensor data, telemetry, time-series databases.

15/24
7. Healthcare
Patient history, medical records, machine data.

8. Finance & Banking


Fraud detection, user behavior patterns.

Q.7 What is NoSQL database? Discuss key


characteristics & advantages. (RGPV Nov 2023)
(This is a combination question → expanded below)

Definition
NoSQL stands for “Not Only SQL”. It includes a group of database technologies
designed for storing huge, unstructured, semi-structured, or rapidly changing data.

Characteristics
No fixed schema
High scalability
High performance
Distributed and fault tolerant
Supports variety of data types
BASE model instead of ACID

Advantages
Handles big data efficiently
Real-time query performance
Perfect for cloud-based and distributed apps
Faster development & schema changes
Cost-efficient

Q.8 How NoSQL is useful for Big Data problems? (RGPV


Nov 2022)
NoSQL databases solve major challenges of big data: volume, variety, velocity, and
veracity.

16/24
Why NoSQL is suitable for Big Data

1. Scalability
Data can be spread across thousands of servers.

2. Handles Unstructured Data


Big data includes logs, images, social posts, documents.

3. High Speed
In-memory storage makes data retrieval extremely fast.

4. Distributed Architecture
Replication ensures fault-tolerance and reliability.

5. Flexible Schema
New fields can be added anytime.

6. Real-Time Analytics
Supports real-time dashboards and monitoring.

Thus, NoSQL fits naturally into big data ecosystems.

Q.9 Explain the process of Sharding in MongoDB.


(RGPV Nov 2022)
Sharding is a method in MongoDB used to split large datasets across multiple
servers to improve performance and storage capacity.

⭐ Steps in Sharding
1. Choose a Shard Key
A field in the document used to divide data.
Example: user_id , country , date .

2. Create Shards
Each shard is a group of MongoDB servers.
Each shard stores a subset of data.

17/24
3. Config Servers
Store metadata about shards and chunk locations.

4. Mongos Router
Acts as a query router.
Determines which shard holds the required data.

5. Chunking
MongoDB divides data into chunks (ranges).
Chunks automatically migrate across shards for balancing.

6. Automatic Balancing
Keeps shard load equal.
Moves chunks between shards if imbalance occurs.

Diagram: MongoDB Sharding

powershell

CLIENT
|
Mongos
-------------------
| | |
Shard1 Shard2 Shard3
(Data A) (Data B) (Data C)

Config Servers
(Store metadata & chunk info)

If you want, I will now continue with UNIT–V expanded answers.


Just reply: Continue Unit V

Sure! Here are the expanded, exam-oriented, RGPV-style answers for UNIT–V
(Social Network Analysis).

18/24
All answers are in simple language, 5–8 marks length, with diagrams, examples,
and points.

⭐ UNIT–V (SOCIAL NETWORK ANALYTICS) –


EXPANDED ANSWERS

Q.1 Explain the following. (RGPV Dec 2024)


(i) Social Network as a Graph
A social network can be represented as a graph structure, where each person is
treated as a node (vertex) and the relation between people is shown as edges
(links). Social networks like Facebook, Twitter, LinkedIn naturally form large graphs.
Graph theory helps in understanding relationships, central users, groups,
influencers, etc.

Elements of Social Graph


Nodes: users/people
Edges: friendships, followers, interactions
Edge direction:
Undirected → Facebook friend
Directed → Twitter follow
Weight on edges: frequency of interaction

Graph Example Diagram

mathematica

A ---- B ---- C
| |
D ----------- E

Importance
Helps detect communities
Helps find influential users
Helps model information spread

19/24
Performs link prediction (who may connect next)

(ii) Social Network Mining


Social network mining is the process of extracting meaningful information from
social graphs. It includes studying how people connect, identifying important users,
detecting communities, and analyzing the spread of information.

Key Functions of Social Network Mining


Detects hidden patterns
Identifies communities (clusters)
Finds influencers (high centrality)
Predicts links and relationships
Performs sentiment and trend analysis
Helps in targeted marketing

Applications
Facebook friend suggestions
Twitter trending hashtags
Fake account detection
Political opinion mining
Customer segmentation

(iii) Recommender System


A recommender system suggests items, movies, products, songs, or videos to users
based on past behavior or similar users. It is a core technology behind YouTube,
Netflix, Amazon, Flipkart, etc.

Types of Recommendation Systems


1. Content-Based Filtering
Finds similarity between items.
Example: If a user watches romantic movies → recommend more romantic
movies.
2. Collaborative Filtering
Finds similarity between users.

20/24
Example: “Users similar to you liked these items.”
3. Hybrid Model
Combines both for better accuracy.

Diagram

powershell

User History → Feature Extraction → Matching → Recommendations

Q.2 Explain the following: (RGPV Nov 2023)


(i) Applications of Social Network Mining
Social network mining helps industries, governments, and companies understand
online behavior, trends, and patterns.

Major Applications
Friend/connection recommendations
User segmentation for advertising
Trend analysis & hashtag monitoring (Twitter)
Fraud/Bot detection using unusual patterns
Community detection (groups with similar interests)
Opinion mining & sentiment analysis
Brand monitoring for companies
Influencer identification (celebrities, trendsetters)
Disease spread prediction using network models

(ii) Recommender System


A recommender system predicts what the user will like and shows personalized
suggestions.

Use Cases
Netflix movie recommendations
YouTube suggested videos
Flipkart/Amazon product recommendations

21/24
Spotify recommended songs

Benefits
Increases user engagement
Improves customer satisfaction
Boosts sales
Reduces search time for users

Q.3 Which mining method is most frequently used for


Social Network Graph? Explain. (RGPV Nov 2022)
The most frequently used mining method for social network graphs is Graph Mining.

What is Graph Mining?


Graph mining applies graph algorithms to explore relationships between nodes in a
social network.

Tasks performed in graph mining:

1. Centrality Calculation
Finds important/influential nodes.
Types: degree centrality, closeness, betweenness, eigenvector.

2. Community Detection
Groups similar users into clusters.
Helps find interest-based communities.

3. Link Prediction
Predicts future relationships (Facebook “People You May Know”).

4. Influence Spread Analysis


Models how information spreads (virality).

5. Shortest Path Analysis


Finds minimum interaction path between two people.

Why used most?


Social networks naturally form graphs.

22/24
Graph algorithms work efficiently on relationship data.
Provides deep insight into user behavior.

Q.4 What do you understand by Clustering in Social


Network Analysis? (RGPV Nov 2022)
Clustering is a major technique in social network analysis used to group users into
communities, where connections inside the group are strong and connections with
other groups are weak.

Definition:
Clustering divides the social network into groups of users who share common
interests, behavior, or frequent interactions.

Features of Clustering
Nodes within a cluster are closely connected.
Helps in identifying natural communities.
Reveals hidden patterns in society or organizations.
Helps target groups for advertisements.

Types of Clustering Techniques


Modularity-based clustering
K-means clustering (when converted to vector form)
Hierarchical clustering
Spectral clustering

Applications of Clustering
Finding friend groups on Facebook
Creating customer segments for marketing
Detecting fraud rings in banking
finding terrorist networks
Understanding influence networks in Twitter/Instagram

Simple Diagram

yaml

23/24
Cluster 1: A—B—C
\|
D

Cluster 2: E—F—G

⭐ UNIT–V Completed
If you want, I can now:
✅ Combine Unit III + Unit IV + Unit V into one neat PDF
OR
✅ Expand any answer even more (2–3 pages each)
OR
✅ Provide diagrams for all answers in printed format
Just tell me what you need!

24/24

You might also like