0% found this document useful (0 votes)
24 views29 pages

Big Data Analytics Practical Guide

The document is a practical file for a Big Data Analytics course, detailing various aspects of Big Data, Hadoop architecture, and related tools. It covers concepts such as the characteristics of Big Data, types of data, and the implementation of MapReduce, alongside practical exercises on Hadoop ecosystem tools and NoSQL databases. Additionally, it discusses document similarity measures and nearest neighbor search algorithms.

Uploaded by

demo.972350
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views29 pages

Big Data Analytics Practical Guide

The document is a practical file for a Big Data Analytics course, detailing various aspects of Big Data, Hadoop architecture, and related tools. It covers concepts such as the characteristics of Big Data, types of data, and the implementation of MapReduce, alongside practical exercises on Hadoop ecosystem tools and NoSQL databases. Additionally, it discusses document similarity measures and nearest neighbor search algorithms.

Uploaded by

demo.972350
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

SWARNIM STARTUP s INNOVATION UNIVERSITY

Swarrnim School of Computing s IT

Practical File

Course Name: MCA


Subject Code: 16110301

StudentName:Brahmbhatt

Dhruv y.

Enrollment No: 2414607025

Subject Name: Big Data Analytics

Academic Year:2025- 2026

Faculty Sign:
1) Practical 1:- Study of Big Data Concepts :

Introduction

Big Data refers to extremely large and complex datasets that cannot be
processed efficiently using traditional data processing techniques. With the
growth of social media, IoT, mobile devices, and digital services, massive
amounts of data are generated every second. Big Data technologies help in
storing, processing, and analyzing this data to extract valuable insights.

1. Characteristics of Big Data (5 V’s)

1. Volume

● Refers to the huge amount of data generated.


● Data size ranges from terabytes to petabytes.
● Example: Facebook generates terabytes of data daily.

2. Velocity

● Speed at which data is generated, processed, and analyzed.


● Real- time or near real- time processing is required.
● Example: Stock market data, live streaming data.

3. Variety

● Different types and formats of data.


● Includes:
o Structured (tables, databases)
o Semi- structured (XML, JSO N)
o Unstructured (images, videos, text)
● Example: Emails, tweets, images, sensor data.

4. Veracity

● Refers to the quality, accuracy, and reliability of data.


● Data may be noisy, incomplete, or inconsistent.
● Example: Fake reviews or incorrect sensor data.

5. Value
● Ability to extract meaningful insights from data.
● Data is useful only if it provides business or analytical value.
● Example: Customer behavior analysis for marketing.

2. Types of Big Data

1. Structured Data

● Data organized in rows and columns.


● Stored in relational databases.
● Easy to search and analyze.
● Example: Student records, bank transactions.

2. Semi- Structured Data

● Data does not follow a strict table structure.


● Uses tags or markers.
● Example: XML files, JSO N data, emails.

3. Unstructured Data

● No predefined format.
● Difficult to process using traditional tools.
● Example: Images, videos, audio files, social media posts.

3. Traditional Data vs Big Data

Feature Traditional Big Data


Data
Data Small to Very large
Size medium
Data Structured Structured, Semi C
Type Unstructured
Storage RDBMS HDFS, NoSQ L
Batch
Processi Real- time C Batch
processing
ng
Scalabilit Limited Highly scalable
y
Tools SQ L, Excel Hadoop, Spark
Cost- effective (commodity
Cost High
hardware)
2) Practical 2 :- Study of Hadoop Architecture :
Introduction

Apache Hadoop is an open- source framework used for storing and


processing Big Data in a distributed computing environment. It is designed
to run on commodity hardware and provides high scalability, fault tolerance,
and reliability. Hadoop is widely used for processing large volumes of
structured and unstructured data.

1. Hadoop Distributed File System (HDFS)

● Storage layer of Hadoop.


● Stores large files by dividing them into blocks.
● Provides fault tolerance through data replication.

2. YARN (Yet Another Resource Negotiator)

● Resource management layer.


● Manages cluster resources like CPU and memory.
● Schedules and executes applications.

3. MapReduce

● Data processing layer.


● Uses Map and Reduce functions to process large datasets in parallel.
● Suitable for batch processing.

4. Hadoop Common

● Contains shared libraries and utilities.


● Supports all Hadoop modules.

2. HDFS Architecture

1. NameNode (Master Node)

● Manages metadata (file name, size, block location).


● Does not store actual data.
● Maintains file system namespace.
2. DataNode (Slave Node)

● Stores actual data blocks.


● Performs read/write operations.
● Sends heartbeat signals to NameNode.

3. Secondary NameNode

● Performs checkpointing.
● Helps in recovery by merging metadata.
● Not a backup of NameNode.

3. Advantages of Hadoop

1. Scalable – Can add nodes easily.


2. Cost- effective – Uses commodity hardware.
3. Fault tolerant – Data replication ensures reliability.
4. High throughput – Suitable for batch processing.
5. Handles all data types – Structured and unstructured data.
6. O pen source – No licensing cost.

4. Limitations of Hadoop

1. Not suitable for real- time processing.


2. Complex setup and maintenance.
3. High latency due to batch processing.
4. NameNode is a single point of failure (in older versions).
5. Requires skilled professionals.
6. Inefficient for small datasets.

3) Practical 3 :- Study of Hadoop Ecosystem Tools :

Introduction

The Hadoop ecosystem consists of various tools that work together to provide
storage, processing, analysis, coordination, and management of Big Data. These
tools enhance Hadoop’s capability by supporting data querying, machine
learning, workflow scheduling, and data transfer.
1. Apache Hive

Description

● Data warehouse tool built on Hadoop.


● Provides SQ L- like query language (HiveQ L).
● Converts queries into MapReduce jobs.

Use Cases

● Data analysis and reporting.


● Q uerying large datasets stored in HDFS.
● Used by analysts familiar with SQ L.
● Example: Sales data analysis.

2. Apache Pig

Description

● High- level scripting platform.


● Uses Pig Latin language.
● Simplifies complex data processing tasks.

Use Cases

● ETL (Extract, Transform, Load) operations.


● Data cleansing and transformation.
● Rapid prototyping of data pipelines.
● Example: Log file processing.

3. Apache HBase

Description

● Distributed, column- oriented NoSQ L database.


● Built on top of HDFS.
● Supports real- time read/write access.
Use Cases

● Storing large sparse datasets.


● Real- time applications like messaging systems.
● Time- series data storage.
● Example: Facebook message storage.

4. Apache Sqoop

Description

● Tool for transferring data between Hadoop and RDBMS.


● Supports import and export operations.

Use Cases

● Importing data from MySQ L/O racle to HDFS.


● Exporting processed data back to databases.
● Data migration and backup.
● Example: Import customer data into Hadoop.

5. Apache O ozie

Description

● Workflow scheduling system for Hadoop jobs.


● Manages and coordinates Hadoop tasks.

Use Cases

● Automating Hadoop workflows.


● Scheduling MapReduce, Hive, Pig jobs.
● Managing complex data pipelines.
● Example: Nightly batch processing.
6. Apache Mahout

Description

● Machine learning library for Hadoop.


● Provides scalable algorithms.

Use Cases

● Recommendation systems.
● Clustering and classification.
● Collaborative filtering.
● Example: Product recommendation engine.

7. Apache ZooKeeper

Description

● Distributed coordination service.


● Manages configuration and synchronization.

Use Cases

● Maintaining configuration information.


● Leader election in distributed systems.
● Ensuring high availability.
● Example: Coordination of HBase services.

4) Practical 4 :- Implementation of MapReduce – Word Count Program :

Introduction

MapReduce is a programming model used in Hadoop for processing large


datasets in a
distributed and parallel manner. It divides the task into two main phases:

● Map Phase – Processes input data and generates key- value pairs.
● Reduce Phase – Aggregates and summarizes the output from the
mapper.
The Word Count program is the most basic and commonly used example to
understand MapReduce.
Algorithm (Word Count)
1. Read input text file from HDFS.
2. Mapper reads each line and splits it into words.
3. Mapper emits (word, 1) for each word.
4. Reducer receives (word, [ 1,1,1… ] ).
5. Reducer sums values and outputs (word, total count).

Flow Diagram: Word Count using MapReduce

Input Text File


|
Mappe
r
(word, 1)
|
Shuffle & Sort
| Reducer
(word, count)
|
O utput File

5) Practical 5 :- Study of NoSQL Databases:

Introduction

NoSQ L (Not O nly SQ L) databases are designed to handle large- scale,


unstructured, and semi- structured data. Unlike traditional relational databases,
NoSQ L provides high scalability, flexibility, and performance for Big Data
applications.

1. Types of NoSQ L Databases

1. Key- Value Store

● Stores data as key- value pairs.


● Extremely fast for simple lookups.
● Examples: Redis, Riak, DynamoDB.
● Use Case: Caching, session management.
Example:
Key: user123
Value: {"name":"John", "age":25}

2. Document Store

● Stores data in documents (JSO N, XML, BSO N).


● Each document is self- describing and flexible.
● Examples: MongoDB, CouchDB.
● Use Case: Content management, user profiles, blogging platforms.

Example:

{
"id": "101",
"name": "Alice",
"skills": [ "Python", "Hadoop"]
}

3. Column- Family Store

● Stores data in columns instead of rows.


● O ptimized for analytical queries and big data operations.
● Examples: Apache HBase, Cassandra.
● Use Case: Time- series data, event logging.

Example:

Row Key: user1


Columns: name=John, age=30, city=Delhi

4. Graph Database

● Represents data as nodes (entities) and edges (relationships).


● Efficient for relationship- heavy data.
● Examples: Neo4j, ArangoDB.
● Use Case: Social networks, recommendation systems, fraud detection.

Example:

Node: Alice
Node: Bob
Edge: Alice - > Friend - > Bob
2. Comparison of SQ L vs NoSQ L

Feature SQ L (Relational DB) NoSQ L (Non- relational DB)


Tables (Rows C Key- Value, Document, Column,
Data
Columns) Graph
Model
Schema Fixed schema Dynamic / Flexible schema
Q uery
SQ L Proprietary / API- based
Language
Scalability Vertical (scale- up) Horizontal (scale- out)
BASE compliant (eventual
Transaction ACID compliant
consistency)
s
Unstructured / Semi- structured
Best For Structured data
data
MySQ L, O racle, MongoDB, Cassandra, Redis,
Examples
PostgreSQ L Neo4j

6) Practical 6 :- Document Similarity using Distance Measures:

Introduction

Document similarity measures are used in text mining and information retrieval
to quantify how similar two documents are.

● Jaccard Distance compares the overlap of words between two


documents.
● Cosine Similarity measures the angle between document vectors in a
multi- dimensional space.

1. Jaccard Distance

Definition

Jaccard similarity coefficient measures similarity between two sets:

J(A,B)= A B A! B J(A, B) = \frac{|A \cap B|}{|A \cup B|}

J(A,B)= A! B A B Jaccard Distance:

DJ(A,B)=1−J(A,B)D_J(A, B) = 1 - J(A, B)DJ (A,B)=1−J(A,B)


Algorithm

1. Tokenize both documents into sets of words.


2. Find intersection and union of the two sets.
3. Calculate Jaccard similarity and distance.

Example

● Doc1 = {Big, Data, Hadoop, Spark}


● Doc2 = {Big, Data, NoSQ L, Hive}

A B =2, A! B =6|A \cap B| = 2, \quad |A \cup B| = 6 A B =2, A! B =6


JaccardSimilarity=26=0.33Jaccard Similarity = \frac{2}{6} =
0.33JaccardSimilarity=62
=0.33 JaccardDistance=1−0.33=0.67Jaccard Distance = 1 -
0.33 = 0.67JaccardDistance=1−0.33=0.67

2. Cosine Similarity

Definition

Cosine similarity measures the cosine of the angle between two document
vectors:

Cosine Similarity=cos(θ)=A∀B A × B \text{Cosine Similarity} =


\cos(\theta) =
\frac{\vec{A} \cdot \vec{B}}{||A|| \times ||B||}Cosine
Similarity=cos(θ)= A × B A∀B

● Values range from 0 (no similarity) to 1 (identical).

Algorithm

1. Represent each document as a vector of term frequencies.


2. Compute dot product of the vectors.
3. Divide by the product of their magnitudes.

Example

● Doc1 vector = 1, 1, 1, 0] (Big, Data, Hadoop, Spark,


[ 1, NoSQ L)
● Doc2 vector = 1, 0 0, 1] (Big, Data, NoSQ L, Hive, Spark)
[ 1, ,

Dot Product=1+1+0+0+0=2\text{Dot Product} = 1+1+0+0+0 = 2Dot


Product=1+1+0+0+0=2 A =1+1+1+1+0=2, B =1+1+0+0+1=1.732||A|| =
\sqrt{1+1+1+1+0} = 2, \quad ||B|| = \sqrt{1+1+0+0+1} = 1.732 A =1+1+1+1+0
=2, B =1+1+0+0+1 =1.732 Cosine Similarity=22×1.732=0.577\text{Cosine
Similarity}
= \frac{2}{2 \times 1.732} = 0.577Cosine Similarity=2×1.7322 =0.577

3. Application in Document Comparison

● Plagiarism Detection: Identify copied or similar text.


● Search Engines: Retrieve documents most similar to a query.
● Recommender Systems: Suggest articles or papers based on similarity.
● Text Clustering: Group similar documents together

7) Practical 7 :- Nearest Neighbor Search :

Introduction

The Nearest Neighbor (NN) algorithm is a distance- based method used to find
the closest data point(s) to a given query point.
It is widely used in pattern recognition, recommendation systems, and
classification tasks.

1. Algorithm: Basic Nearest Neighbor

Steps

1. Prepare a dataset of points (users, items, or feature vectors).


2. Choose a query point for which you want to find the nearest neighbor(s).
3. Compute the distance between the query point and all points in the
dataset.
4. Select the point(s) with the smallest distance.

Distance Formula (Euclidean Distance)

For two points P=(p1,p2,...,pn)P = (p_1, p_2, ..., p_n)P=(p1 ,p2 ,...,pn ) and
Q=(q1,q2,...,qn)Q = (q_1, q_2, ..., q_n)Q =(q1 ,q2 ,...,qn ):

d(P,Q)=∑i=1n(pi−qi)2d(P, Q) = \sqrt{\sum_{i=1}^{n} (p_i - q_i)^2}d(P,Q )=i=1∑n (pi


−qi )2
2. Example

Dataset (2D points)

Poin x
t
A 2 3

B 5 4

C 1 2

Query Point

● Q = (3, 3)

Distances

● d(Q , A) = √((3- 2)² + (3- 3)²) = √1 = 1


● d(Q , B) = √((3- 5)² + (3- 4)²) = √5 ≈ 2.236
● d(Q , C) = √((3- 1)² + (3- 2)²) = √5 ≈ 2.236

Nearest Neighbor: A (distance = 1)

3. Applications in Recommendation Systems

1. User- based Collaborative Filtering


a. Recommends items based on similar users.
b. Find nearest neighbors (users) with similar preferences.
2. Item- based Collaborative Filtering
a. Recommends items similar to the item a user likes.
b. Uses nearest neighbor search on item feature vectors.
3. Content Recommendation
a. Suggest movies, products, or articles based on similarity with previous
items.
4. Real- time Applications
a. Music recommendations (Spotify), product suggestions (Amazon), social
media friend suggestions.

4. Advantages
● Simple and intuitive.
● Works well with small datasets.
● Can be used for both classification and recommendation.

5. Limitations

● Inefficient for very large datasets (requires computing distance to all


points).
● Sensitive to irrelevant features (feature scaling needed).
● Performance decreases in high- dimensional spaces (curse of
dimensionality).

8) Practical 8 : - Stream Data Analysis :

Introduction

A data stream is a continuous flow of data generated over time, often too
large to store entirely in memory.
Stream data analysis is used in real- time applications like network monitoring,
sensor data processing, financial transactions, and social media analytics.

1. Data Stream Model

Key Concepts

1. Stream Elements: Individual data points in the stream (e.g., sensor


readings, tweets).
2. Stream Processing: Analyze elements as they arrive.
3. Sliding Window: Focus on the most recent subset of the stream.
4. Single- pass algorithms: Cannot store the entire stream, so
approximate methods are used.

2. Counting Distinct Elements

Problem

Given a data stream, count the number of distinct elements efficiently without
storing all elements.

Example Stream:
[ A, B, A, C, B, D]
Distinct Elements: A, B, C, D # Count = 4
Naive Approach

● Store all elements in a set and count its size.


● Works for small streams, but memory intensive for large streams.

Optimized Approach

● Use Hashing / Probabilistic Algorithms like HyperLogLog or Flajolet-


Martin
for large streams.

3. Applications of Counting Distinct Elements in


Data Streams
1. Network Traffic Monitoring: Count unique IP addresses.
2. Social Media Analytics: Count unique hashtags or users in real- time.
3. Recommendation Systems: Count unique items clicked by users.
4. Fraud Detection: Detect unusual activity by counting distinct transactions.

4. Advantages of Stream Data Analysis

● Enables real- time analytics.


● Works with large- scale, continuous data.
● Reduces memory usage by approximate algorithms.

5. Limitations

● Cannot store all historical data.


● Accuracy depends on approximation algorithms.
● Processing high- velocity streams can be challenging.

G) Practical G :- PageRank Algorithm:

Introduction

PageRank is an algorithm used by Google Search to rank web pages based on


their importance.
It measures the influence of a web page based on the number and quality
of links pointing to it.
Key Idea:

● A page is important if many important pages link to it.


● Each page distributes its rank evenly among its outgoing links.

PageRank is widely used in:

● Search engine ranking


● Social network analysis
● Citation analysis

1. PageRank Formula

The PageRank of page P is calculated as:

PR(P)=(1−d)+d∑i∃M(P)PR(i)L(i)PR(P) = (1 - d) + d \sum_{i \in M(P)}


\frac{PR(i)}{L(i)}PR(P)=(1−d)+di∃M(P)∑ L(i)PR(i)

Where:

● PR(P)PR(P)PR(P) = PageRank of page P


● ddd = damping factor (usually 0.85)
● M(P)M(P)M(P) = set of pages linking to P
● L(i)L(i)L(i) = number of outbound links from page i

2. Example: Simple Web Graph

Graph Structure

● Pages: A, B, C, D
● Links:
oA # B, C
oB # C
oC # A
oD # C

Step 1: Initialize PageRank

● Total pages = 4
● Initial PR for each page = 1 / 4 = 0.25
Step 2: Iterative Calculation

Using damping factor d=0.85d = 0.85d=0.85:

Iteration 1:

PR(A)=(1−0.85)/4+0.85×(PR(C)/1)=0.0375+0.85×0.25=0.25PR(A) = (1 - 0.85)/
4+
0.85 \times (PR(C)/1) = 0.0375 + 0.85 \times 0.25 =
0.25PR(A)=(1−0.85)/4+0.85×(PR(C)/1)=0.0375+0.85×0.25=0.25
PR(B)=0.0375+0.85×(PR(A)/2)=0.0375+0.85×0.125=0.14375PR(B) = 0.0375 +
0.85 \times (PR(A)/2) = 0.0375 + 0.85 \times 0.125 =
0.14375 PR(B)=0.0375+0.85×(PR(A)/2)=0.0375+0.85×0.125=0.14375
PR(C)=0.0375+0.85×(PR(A)/2+PR(B)/1+PR(D)/
1)=0.0375+0.85%(0.125+0.25+0.2 5)=0.0375+0.85%0.625=0.5625PR(C) =
0.0375 + 0.85 \times (PR(A)/2 + PR(B)/1 + PR(D)/1) = 0.0375 + 0.85*(0.125 +
0.25 + 0.25) = 0.0375 + 0.85*0.625 = 0.5625PR(C)=0.0375+0.85×(PR(A)/
2+PR(B)/1+PR(D)/1)=0.0375+0.85%(0.125+0.25+0.25)
=0.0375+0.85%0.625=0.5625 PR(D)=0.0375+0.85%0=0.0375PR(D) = 0.0375 +
0.85*0
= 0.0375 PR(D)=0.0375+0.85%0=0.0375

Iteration 2:

● Repeat until PageRank values converge.

3. Applications of PageRank

1. Search Engines: Rank web pages by importance.


2. Social Networks: Identify influential users.
3. Recommender Systems: Suggest items or content based on link analysis.
4. Scientific Citations: Rank research papers by citation importance.

10) Practical 10 :- Frequent Itemset Mining :

Introduction

Frequent Itemset Mining is a technique in data mining to find commonly


occurring sets of items in transaction databases.
Market Basket Analysis helps retailers understand customer buying patterns.

Applications:

● Product recommendations
● Store layout optimization
● Cross- selling strategies

1. Simple Dataset

Transaction ID Items Purchased


ead, Milk
2 Bread, Diaper, Beer, Eggs
lk, Diaper, Beer, Cola
ead, Milk, Diaper, Beer
ead, Milk, Diaper, Cola

2. Algorithm: Apriori (Simplified)

1. Identify all items in transactions.


2. Generate candidate itemsets (1- item, 2- item, 3- item...).
3. Count support (frequency) of each itemset.
4. Keep only itemsets whose support ≥ minimum support threshold.
5. Repeat for larger itemsets until no frequent itemsets are found.

3. Manual Calculation (Example)

Step 1: Count 1- itemsets (support ≥ 3)

Item Cou
nt
Brea 4
d
Milk 4
Diap 4
er
Beer 3
Cola 2
Eggs 1

Frequent 1- itemsets: Bread, Milk, Diaper, Beer

Step 2: Count 2-

itemsets Itemset
Count
Bread, Milk 3
Bread, Diaper 3
Bread, 2
Beer
Milk, 3
Diaper
Milk, Beer 2

Diaper, 3
Beer

Frequent 2- itemsets: Bread+Milk, Bread+Diaper, Milk+Diaper, Diaper+Beer


Step 3: Count 3- itemsets

Itemset

Count

Bread, Milk, Diaper


2 Milk, Diaper, Beer
2
Bread, Diaper,
Beer 1

Frequent 3- itemsets: None (threshold = 3)

4. Applications of Frequent Itemset Mining

1. Product Recommendations: Suggest products frequently bought


together.
2. Store Layout Planning: Place items together to increase sales.
3. Promotions: Bundle frequently purchased items.
4. Inventory Management: Predict high- demand item combinations.

You might also like