0% found this document useful (0 votes)
24 views44 pages

Unit 1

The document provides an overview of Big Data and its analytics, discussing its characteristics, types of data (structured, semi-structured, and unstructured), and the evolution of Big Data technologies. It highlights the challenges organizations face in managing Big Data, the differences between traditional business intelligence and Big Data, and the role of Hadoop and data warehousing. Additionally, it emphasizes the importance of real-time data processing and the various tools used for Big Data analytics.

Uploaded by

Omkar mulge
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views44 pages

Unit 1

The document provides an overview of Big Data and its analytics, discussing its characteristics, types of data (structured, semi-structured, and unstructured), and the evolution of Big Data technologies. It highlights the challenges organizations face in managing Big Data, the differences between traditional business intelligence and Big Data, and the role of Hadoop and data warehousing. Additionally, it emphasizes the importance of real-time data processing and the various tools used for Big Data analytics.

Uploaded by

Omkar mulge
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 44

Big Data Analytics

UNIT-I: INTRODUCTION TO BIG DATA AND ANALYTICS

Types of Digital Data, Classification of Digital Data

Introduction to Big Data:

Characteristics – Evolution of Big Data – Definition of Big Data - Challenges with Big Data - Other Characteristics of
Data - Why Big Data - Traditional Business Intelligence versus Big Data - Data Warehouse and Hadoop Environment

Big Data Analytics: Classification of Analytics –What Big Data Analytics isn’t, Sudden Hype around Big Data
Analytics, Classification of Analytics, Greatest Challenges that prevent Business from Capitalizing the business Data ,
Top Challenges facing Big Data, Big Data Analytics important – Data Science, Terminology used around Big Data.
Data Classification
• Data :It is Raw Facts
• Information: Processed Form of Data is called Information

• Data classification means organizing information into different categories so it can be

easily found and used. This process helps with security, compliance, and achieving

business or personal goals. It also ensures that data is available when needed.

• Example:

• Imagine a library where books are grouped into sections like fiction, science, history,

and technology. This makes it easier to find a book instead of searching through every

shelf. Similarly, data classification helps in retrieving important information quickly.


Types of Classification
Structured Data with examples
• Structured data is created using a fixed schema and is maintained in
tabular format.
• The elements in structured data are addressable for effective analysis.
• It contains all the data which can be stored in the SQL database in a
tabular format.
• Today, most data is developed and processed in the simplest way to
manage information.

Scenario:
Consider an example of Relational Data you have to maintain a record of
students for a university, like a student's name, ID, address, and Email of
the student. To store the record of students, the following relational
schema and table were used.
Semi-Structured Data with examples
• Semi-structured data is information that does not reside in a relational database but that
have some organizational properties that make it easier to analyze.

• With some process, user can store them in a relational database but is very hard for some
semi-structured data, but semi-structured exist to ease space.

• Semi-structured data is partially organized—it has some structure but is not as rigid as
databases. Some parts of data is easy-to-understand
Example:1: Email Id
• Structured Part: Sender, Receiver, Subject, Date, Time
• Unstructured Part: Email body (free text), Attachments (PDFs, Images, etc.)

Example:2 Social Media Posts (📱 Facebook, Twitter, Instagram)


• Structured Part: Username, Post ID, Date, Likes, Comments Count
• Unstructured Part: Post content (text, emojis, images, videos
Unstructured Data with examples
• It is defined as the data that does not follow a pre-defined standard, or you can say that
any does not follow any organized format.

• This kind of data is also not fit for the relational database because you will see a
predefined manner or organized way of data in the relational database.

• Unstructured data is also very important for the big data domain and to manage and
store Unstructured data there are many platforms to handle it like No-SQL Database.

• Examples: Data that is generated like Audio Video Files, Medical Reports, Customer
Support Charts, Satellite and Drone images. etc.
Features of Classification:
The main goal of the organization of data is to arrange the data in such a form that it

becomes fairly available to the users. So, it’s basic features as following.

• Homogeneity – The data items in a particular group should be similar to each

other.

• Clarity – There must be no confusion in positioning any data item in a particular

group.

• Stability – The data item set must be stable i.e. any investigation should not affect

the same set of classification.

• Elastic – One should be able to change the basis of classification as the purpose of

classification changes.
What is Big Data?
• The term "big data" refers to the vast amounts of structured,
semi-structured, and unstructured data that organizations and
businesses collect from various sources such as social media,
sensors, mobile devices, transaction records, and more.

• This data is typically characterized by its volume, velocity,


variety, and veracity, value (5v’s) which makes it challenging
to manage and analyze using traditional data processing
methods
5 v’s or Characteristics of Big Data
V’s of Big Data Description Key Aspects Interesting Facts
- Large-scale data
🌍 2.7 Zettabytes of
from multiple
data exist in the
sources (documents,
The amount of data digital world.
media, IoT devices,
being generated, 🏪 Walmart handles
1️⃣Volume etc.)
stored, and 1M+
- Need for modern
processed. transactions/hour,
storage & processing
storing 2.5
tools (Big Data,
Petabytes of data.
Cloud, Hadoop)
- Raw data alone is 📊 Facebook
not useful; insights processes 30+
matter. Petabytes of user
The importance of - Requires Big Data data.
extracting Analytics for 📈 McKinsey reports
2️⃣Value
meaningful insights decision-making. that retailers using
from data. - Cost vs. Benefit Big Data effectively
analysis ensures ROI can increase
(Return on operating margin
Investment). by 60%.
- Data flows
continuously from
various sources 🚀 Data growth is
(social media, IoT, unprecedented
The speed at which transactions). and torrential.
data is generated, - Real-time ⏳ Real-time
3️⃣Velocity
collected, and processing is processing leads to
analyzed. crucial for timely better business
business decisions. results than
- Fast data access delayed analysis.
improves decision-
making.
- Structured Data
(Databases, Excel)
- Semi-Structured
🔄 Businesses must
The different types Data (JSON, XML,
handle diverse
4️⃣Variety and sources of data Emails)
formats to gain
in Big Data. - Unstructured
better insights.
Data (Videos,
Images, Social
Media)
- Ensuring data
consistency &
reliability. 📉 Poor data quality
- Handling costs businesses
The quality,
incomplete, noisy, millions annually.
accuracy, and
5️⃣Veracity (NEW) or misleading 🔍 AI & Machine
trustworthiness of
data. Learning help
data.
- Using data improve data
cleansing & accuracy.
Summary of 5v’s
• Volume refers to the massive amount of data being
generated.
Value focuses on deriving meaningful insights rather than just
storing data.
Velocity highlights the need for real-time data processing.
Variety emphasizes the diverse formats of data (structured,
semi-structured, unstructured).
Veracity ensures data is accurate, reliable, and free of
errors.
Evolution of Big Data

1. Emergence of Databases (1960s) – Structured Data


Storage
• Example: IBM's System R and Oracle’s RDBMS laid the
foundation for structured data storage.
• Use Case: Banks started using relational databases for
customer account management.
• Airline reservation systems stored ticket booking records
efficiently.
2. Data Warehousing (1980s) – Centralized Data Analysis
Example: Data warehouses consolidated data from multiple
sources for better decision-making.
Use Case: Walmart used data warehousing to analyze customer
purchases and optimize inventory.
Healthcare organizations stored patient records for analysis and
diagnosis.
Conti..
3. Internet & Web 1.0 (1990s) – Unstructured Data Growth
• Example: The World Wide Web generated a surge in digital data.
• Use Case: Google and Yahoo indexed and categorized web
pages for better search results.
• E-commerce platforms like Amazon started collecting customer
browsing data.
4. Digital Transformation & Enterprise Systems (2000s) –
Business Process Automation
• Example: ERP and CRM systems automated business
operations.
• Use Case: Salesforce helped companies track customer
interactions to improve services.
• SAP enabled businesses to integrate finance, HR, and supply
chain management data.
Conti..
5. Hadoop & Distributed Computing (2005) – Processing
Big Data Example: Hadoop’s distributed computing model
revolutionized large-scale data processing.
• Use Case: Facebook used Hadoop to analyze user
interactions for targeted ads.
• Telecom companies processed billions of call records to
detect network issues.
6. The Rise of NoSQL (Late 2000s) – Handling
Unstructured Data Example: NoSQL databases like
MongoDB and Cassandra became popular.
• Use Case: Twitter used NoSQL to handle millions of tweets
per second.
• LinkedIn used NoSQL to store and retrieve massive user
profile data.
Conti..
7. Advanced Analytics & Machine Learning (2010s) –
Predictive Insights
Example: AI-powered analytics helped businesses extract insights
from data.
Use Case: Netflix used machine learning to recommend shows
based on viewing history.
Banks used AI models to detect fraudulent transactions in real time.
8. Cloud Computing (Late 2010s) – Scalable Data Storage &
Processing
Example: AWS, Google Cloud, and Microsoft Azure provided cloud-
based data solutions.
Use Case: Spotify stored user music preferences on the cloud for
personalized playlists.
Uber processed ride requests and traffic data on cloud servers.
Conti..
9. Internet of Things (IoT) (2020s) – Real-Time Data
Streaming Example: IoT devices generated massive real-
time data for analysis.
Use Case: Tesla’s self-driving cars collected sensor data
for improved navigation.
Smart home assistants like Alexa processed voice
commands and provided instant responses.
10. Current Trends – AI, Ethics, and Real-Time Big
Data
Example: AI-driven analytics, real-time processing, and
ethical concerns over data privacy.
Use Case : AI models analyze real-time patient data to
predict potential health issues before they occur.
Challenges in BigData
1. Incomplete Understanding of Big Data
• Organizations may not understand the importance of data.
• Difficulties in integrating algorithms for data analysis.
• Example: If employees don't realize the importance of
backing up sensitive data, they might fail to store it
correctly. This could result in losing critical information
when it's needed.
2. Exponential Data Growth
• Data grows exponentially over time.
• Much of the data is unstructured (e.g., videos,
documents).
• Example: Companies may struggle to manage vast
amounts of data from social media, emails, and videos,
leading to storage and analysis difficulties.
3. Security of Data:
• While companies focus on collecting and analyzing data, they
sometimes overlook securing it, leaving them vulnerable to
data breaches.
• Solution: Hiring cybersecurity experts to ensure the data is
protected from unauthorized access.
• Example: If sensitive customer data isn't secured, hackers
could steal it, leading to significant financial and reputational
damage.
4. Data Integration:
• Data comes in many forms (structured like phone numbers or
unstructured like videos), making it difficult to integrate for
analysis.
• Solution: Using technologies like IBM Infosphere or Microsoft
SQL to integrate data seamlessly.
• Example: Merging data from different sources (sales reports,
5. Confusion in Tool Selection
• Companies may struggle to select the right tools (e.g.,
HBase, Cassandra, Hadoop, Spark).
• Poor tool choices can waste time, money, and resources.
• Example: Choosing the wrong technology can hinder
data analysis.
6. Lack of Data Professionals
• Skilled professionals (data scientists, analysts, engineers)
are essential for Big Data success.
• Without the right expertise, technologies may not be
used effectively.
• Example: Lack of professionals may lead to poor data
management.
Traditional Business Intelligence (BI) Vs Big Data:

Features Traditional BI Big Data

Data Sources Primarily structured data from Structured, semi-structured, and


internal sources (e.g., databases, unstructured data from internal and
spreadsheets, ERP systems). external sources (e.g., social media, sensors,
multimedia content).
Data Volume Deals with smaller datasets Handles massive datasets (terabytes to
and Velocity (gigabytes to terabytes). petabytes or more).
Data is processed in batches, Data is processed in real-time or near real-
focusing on historical analysis. time, enabling quicker insights.
Processing Uses structured query language Uses distributed computing tools like Apache
Methods (SQL) for data analysis. Hadoop and Spark.
Primarily relies on relational Supports both batch and real-time processing
databases and data warehouses. for more complex analysis.
Scalability Operates on fixed infrastructure and Offers horizontal scalability (add/remove
can struggle with rapidly growing resources as needed).
data Can handle large and growing datasets across
diverse sources.
Decision- Slower decision-making based on Faster decision-making with real-time
Making Speed historical data (periodic reports). insights and data.
Data Warehouse vs Hadoop Environment

Two different Concepts and Technologies used in


managing and processing large amounts of data
1. Data Warehouse: Stores structured, historical
data for business intelligence.
2. Hadoop: Distributed framework for storing and
processing large datasets.
Conti…

What? Central repository for structured, Open-source framework for


historical data. distributed storage and processing
Supports Business Intelligence of Big Data.
(BI) and reporting. Uses commodity hardware for cost-
effective scaling.
Key Structure: Uses RDBMS, Distributed Storage (HDFS):
Features organized in tables (fact and Data stored across multiple nodes
dimension tables).ETL Process: in a cluster.
Extract, Transform, Load (ETL) Distributed Processing:
for data population. MapReduce framework for parallel
Historical Data: Stores long- data processing.
term data for trend analysis. Scalability: Scales horizontally
BI & Reporting: Supports ad with more nodes.
hoc queries, fast reporting, and Flexibility: Handles structured,
analysis. unstructured, and semi-structured
data.
Tools BI tools (e.g., Tableau, Power BI) Ecosystem tools like Apache Spark,
Hive, Pig
Conti…

Data Warehouse Hadoop

Structured, Unstructured,
Data Type Structured data
Semi-structured
RDBMS, predefined Hadoop Distributed File
Storage
schema System (HDFS)
Processin MapReduce for parallel
ETL for data population
g processing
Historical data, BI, Big Data, Real-time, Batch
Data Use
Reporting Processing
Scalabilit
Limited scalability Scales horizontally
y
Data ware house Process Flow
Hadoop
Hadoop Architecture:
Hadoop is an open-source framework for distributed storage &
processing of Big Data across clusters of commodity hardware.
Provides a scalable, cost-effective, and fault-tolerant
solution for handling large datasets.
Distributed Storage (HDFS)
Uses Hadoop Distributed File System (HDFS) to store data
across multiple nodes.
Key Features:
Data Splitting: Large files are divided into blocks and
distributed. Fault Tolerance: Replication of data ensures
availability.
Example: A 300MB file split into 3 blocks (100MB each) stored
across different nodes.
Distributed Processing (MapReduce)
MapReduce processes data in parallel across multiple nodes.
How it works:
Map Step: Splits the data into smaller chunks & processes
them independently.
Reduce Step: Aggregates results from different nodes.
Example: Counting word frequency in a large dataset (e.g.,
log files).
Scalability
Scales horizontally by adding more nodes.
Handles petabytes of data efficiently.
Example: A social media platform storing user posts and
logs.
Flexibility & Data Variety

Supports structured, semi-structured, and unstructured


data.
Handles formats like text, logs, images, videos, JSON, XML.
Example:
Analyzing user comments (text) & images (unstructured data)
on an e-commerce site.
Hadoop Ecosystem & Data Processing Frameworks
Apache Spark – Faster in-memory processing.
Apache Hive – SQL-like querying for Big Data.
Apache Pig – High-level scripting for processing.

Example: Using Hive for analyzing customer purchase trends.


Batch vs. Real-time Processing
Batch Processing: Hadoop MapReduce for large-scale
data analysis.
Real-time Processing: Apache Spark for live analytics.
Example:
Batch: Monthly sales report generation.
Real-time: Fraud detection in credit card transactions.
Big Data Analytics
Big Data Analytics is the process of analyzing large
amounts of data to find useful patterns and trends
that help businesses make better decisions.
Example:
• Imagine an online shopping website like Amazon:
• It collects data on what people buy, search for, and
review.
• Using Big Data Analytics, it can:
• Suggest products you might like.
• Offer discounts on popular items.
• Identify trends, like more people buying jackets in winter.
Data analytics helps in understanding and
making decisions based on data.
There are Four types
1. Descriptive Analytics (What happened?)
Looks at past data to summarize trends.
Example: A school checks students' past test scores to
:
see the average marks in Math.
2. Diagnostic Analytics (Why did it happen?)
Finds reasons behind a trend or event.
Example: If many students failed in Math, the school
finds out why – maybe the exam was too hard or
students didn’t study well.
Data analytics helps in understanding and
making decisions based on data.
There are Four types
3. Predictive Analytics (What might happen in the
future?)
Uses past data to predict future trends.
:
Example: If a student always scores above 90%, the
teacher predicts they will do well in the next exam.
4. Prescriptive Analytics (What should we do next?)
Gives solutions and recommendations. Suggests the
actions using AI & machine learning
Example: If students are failing in Math, the school
suggests extra classes to improve results.
Big data analytics challenges
1.Data Quality – Poor-quality data (incomplete,
inconsistent, or inaccurate) leads to incorrect analysis
Example: A hospital's patient records have missing
entries, causing incorrect diagnosis trends.
2.Data Integration – Combining data from different
sources with varied formats is complex.
Example: A bank merges customer data from online
banking, mobile apps, and branch systems.
3. Data Privacy & Security – Protecting sensitive
information while following regulations is essential.
Example: Companies must secure customer payment
details to comply with GDPR.
4. Skill Gap & Talent Acquisition – Shortage of skilled
data professionals hinders analytics projects.
Example: A company struggles to find a data scientist to
optimize its sales strategy.
5. Technology Selection & Integration – Choosing the
right tools and integrating them with existing systems is
challenging.
Example: An e-commerce company integrates AI-based
analytics tools with its old database.
6. Scalability & Performance – Handling increasing
data volumes efficiently requires robust infrastructure.
Example: A social media platform processes millions of
daily posts and must scale accordingly.
7. Cost Management – High costs of hardware,
software, and maintenance can strain budgets.
Example: A startup struggles to afford cloud-based data
analytics solutions.
Why Big data analytics important?

Improved Decision Making


• Meaning: Helps businesses find useful patterns and make data-
driven decisions.
• Example: E-commerce websites like Amazon analyze
customer purchases to suggest products and improve sales.
Enhanced Operational Efficiency
• Meaning: Identifies inefficiencies and helps optimize processes.
• Example: Ride-sharing apps like Uber analyze traffic
patterns to reduce wait times and fuel costs.
Personalized Customer Experience
• Meaning: Understands customer preferences to offer
customized services.
• Example: Netflix and YouTube recommend movies and
videos based on user watch history.
Why Big data analytics important?

Fraud Detection & Security


• Meaning: Detects suspicious activities to prevent fraud.
• Example: Banks like SBI and ICICI use AI to detect unusual
transactions and prevent fraud.
Product Development & Innovation
• Meaning: Helps companies design better products based on
customer feedback.
• Example: Smartphone brands like Apple & Samsung
analyze user feedback to improve features in new models.
Predictive Analytics
• Meaning: Uses past data to predict future trends and
demands.
• Example: Weather forecasting apps predict storms using
past climate data and real-time satellite images.
Scalability & Agility
• Meaning: Enables businesses to quickly process large
datasets and adapt to changes.
• Example: Google Search processes millions of queries
per second and provides instant results.
Conclusion:
• Big Data Analytics helps organizations make smarter
decisions, improve efficiency, detect fraud,
personalize services, innovate new products, and
predict future trends—making it an essential
technology in today’s world.
Top analytical tools
1.Python
• What it does: A programming language used for data analysis, machine
learning, and automation.
• Example: Netflix uses Python to analyze what users watch and
recommend movies based on their preferences.
2. R
• What it does: A statistical programming language used for data
visualization and analysis.
• Example: Healthcare companies use R to analyze patient data and
predict disease trends.
3. Apache Spark
• What it does: A fast, distributed computing system for processing big
data in real-time.
• Example: Banking systems use Spark to detect fraudulent
transactions instantly.
Top analytical tools
4. Apache Hadoop
• What it does: A framework used to store and process massive amounts
of data.
• Example: Social media platforms like Facebook use Hadoop to
store billions of user posts and images.
5. Cassandra
• What it does: A NoSQL database for handling large-scale data with
minimal downtime.
• Example: E-commerce websites like Amazon use Cassandra to store
product details and customer reviews efficiently.
6. MongoDB
• What it does: A NoSQL database that stores data in flexible formats
like JSON.
• Example: Gaming companies use MongoDB to store player profiles,
game progress, and in-game purchases.
7. Tableau
• What it does: A powerful data visualization tool for
creating interactive charts and dashboards.
• Example: Sales teams use Tableau to track customer
buying trends and improve marketing strategies.
8. Microsoft Power BI
• What it does: A business intelligence tool that helps
companies analyze and visualize data.
• Example: Retail stores use Power BI to monitor daily
sales and predict future stock requirements.
9. SAS (Statistical Analysis System)
• What it does: An analytics tool used for data
management and advanced reporting.
• Example: Insurance companies use SAS to predict
risks and determine policy pricing.
10. QlikView/Qlik Sense
• What it does: Business intelligence tools that
help users analyze data visually.
• Example: Hospital management uses
QlikView to track patient records and improve
healthcare services.
11. MATLAB
• What it does: A computing tool used for
mathematical modeling, simulations, and AI
development.
• Example: Engineers use MATLAB to simulate
car crash tests before manufacturing.
Important Questions
1. What is Big Data ? Explain Characteristics
of Big data?
2. What is big data analytics? Explain five
‘V’s of Big data. Briefly discuss applications
of big data?
3. What are the benefits of Big Data? Discuss
challenges under Big Data.?
4. Explain the difference between structure and
unstructured data.
5. What is big data analytics and discuss the different
types of big data analytics?
6. Differentiate between traditional BI and Big data?

You might also like