0% found this document useful (0 votes)
5 views56 pages

CH 1

Data analytics chapter 1

Uploaded by

zs754547
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views56 pages

CH 1

Data analytics chapter 1

Uploaded by

zs754547
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 56

Ch1

Introduction to Data Analytics


Data Analytics: An Overview
● Definition: Data Analytics is the practice of extracting actionable insights from raw
data using statistical techniques, algorithms, and various tools
● Core Processes:
○ Collection: Gathering data from diverse sources such as databases, web
platforms, sensors, or transactions.
○ Cleaning & Organization: Removing errors, sorting, and formatting data to
ensure reliability.
○ Analysis: Applying statistical models, machine learning, and exploratory
methods to find patterns, trends, and relationships.
○ Visualization: Presenting results via dashboards, charts, or reports for easier
understanding and sharing.
Applications:
○ Business intelligence, market research, operations optimization, healthcare,
finance, retail, and entertainment.
● Roles & Careers:
○ Data Analyst, Data Scientist, Business Intelligence Analyst, Market Analyst,
and Operations Research Analyst—all integral for modern organizations.
● Job Outlook:
○ High demand, with projected growth rates much faster than
average—reflecting the shift toward data-driven decision-making in every
industry.
Example:
○ A retailer collects customer purchase data and analyzes it to optimize
inventory, adjust marketing, and improve sales.

Importance of Data Analytics
● Informed Decision Making:
○ Enables leaders to base strategies and choices on facts and trends rather
than intuition, reducing risk and increasing accuracy.
● Operational Efficiency and Cost Savings:
○ Identifies bottlenecks and streamlines processes, leading to better resource
allocation, automation, and reduced operational costs.
○ Example: Manufacturers schedule predictive maintenance to minimize
downtime.
● Enhanced Customer Experience:
○ Personalizes services and products by analyzing purchasing behavior and
feedback, driving loyalty and satisfaction.Example: E-commerce sites
recommend products based on browsing and buying history.
● Competitive Advantage:
○ Harnesses deep insights about market trends, allowing organizations to adapt
quickly and stand out from competitors.

● Innovation and Growth:
○ Reveals customer pain points and market gaps, enabling new products and
creative business models.
○ Example: Companies launch innovative digital platforms by understanding user
habits via analytics.
● Risk Management:
○ Detects fraud, forecasts potential disruptions, and supports compliance by
monitoring relevant data streams.
● Revenue Optimization:
○ Leverages data to refine pricing strategies and target profitable customer
segments.
● Data-Driven Culture:
○ Empowers all levels of an organization to use insights for continual
improvement and learning.
Type of data analytics
Descriptive Analysis
● Definition: Descriptive analytics summarizes historical data to reveal trends,
patterns, and insights, answering "What happened?".
● Uses dashboards, reports, and descriptive statistics.Provides a snapshot of
performance over time.
● Key Features:
● Summarizes Data:Converts raw data into meaningful reports and dashboards
Historical Perspective :Focuses only on past events and outcomes
Easy to Understand :Uses charts, tables, and graphs for quick insights
Trend Identification:Highlights sales patterns, seasonal demand, growth/decline
Business Reporting:Used in monthly sales reports, performance reviews,
customer statistics
● Foundation for Further Analytics:Acts as a base for diagnostic, predictive, and
prescriptive analysis
● Examples:
○ Retail: Quarterly sales reports highlighting regional trends.
○ Media: Netflix tracking which shows users watch most to display trending titles
Explains “Why it Happened”:Goes beyond describing data to find the root cause of
outcomes
Drill-Down Approach:Breaks data into smaller parts to identify problem areas
Correlation and Causation:Finds relationships between different factors influencing
results
Root Cause Identification:Helps uncover reasons behind successes or failures
Comparative Analysis:Compares data across time periods, regions, or groups
Problem-Solving Oriented:Provides clarity on issues so corrective actions can be
taken
Data Mining & Statistical Tools:Uses methods like correlation, regression, and
clustering
Diagnostic Analysis
● Definition: Diagnostic analytics investigates reasons for past outcomes, answering
"Why did it happen?".Explores relationships and root causes using techniques like
regression, correlation, and cohort analysis.Supports hypotheses and identifies
anomalies.
● Key Features:
● Explains “Why it Happened”:Goes beyond describing data to find the root cause
of outcomes
● Drill-Down Approach:Breaks data into smaller parts to identify problem areas
● Correlation and Causation:Finds relationships between different factors
influencing results
● Root Cause Identification:Helps uncover reasons behind successes or failures
● Comparative Analysis:Compares data across time periods, regions, or groups
● Problem-Solving Oriented:Provides clarity on issues so corrective actions can be
taken
● Data Mining & Statistical Tools:Uses methods like correlation, regression, and
clustering

● Examples:
○ E-commerce: Analyzing why web traffic spiked after a marketing campaign
by comparing before-and-after data and testing variables like ad placement.
○ Logistics: Investigating delivery delays by examining route history and driver
performance data.

Predictive Analysis
● Definition: Predictive analytics utilizes statistical models and Machine Learning to
forecast future events, answering "What might happen next?".
○ Employs regression, decision trees, and neural networks.
○ Anticipates outcomes to guide proactive actions.
● Key Features:
● Answers “What Will Happen?”:Uses past and present data to forecast future
outcomes
● Data-Driven Forecasting:Predicts demand, sales, risks, or customer behavior
● Statistical & Machine Learning Models:Uses regression, classification, and AI
algorithms for predictions
● Risk Assessment”:Identifies potential threats or failures in advance
● Customer Behavior Prediction:Anticipates churn, buying patterns, or
preferences
● Supports Planning & Strategy:Helps businesses prepare for future opportunities
or challenges
● Example-Oriented Insights:E.g., predicting which customers are likely to stop
using a service
● Examples:
○ Sales Forecasting: Using past sales data and seasonal patterns to predict next
quarter’s revenue.
○ Predictive Maintenance: Predicting vehicle maintenance timing
Prescriptive Analysis
● Definition: Prescriptive analytics recommends possible actions based on data,
answering "What should we do?".
● Suggests optimal decisions using simulations, optimization, and advanced
algorithms.Evaluates multiple scenarios and provides actionable steps.
Key Features:
● Answers “What Should Be Done?”:Goes beyond predicting outcomes to suggest
best actions
● Decision-Making Focused:Provides recommendations for solving problems or
improving performance
● Optimization Techniques:Suggests the most efficient use of resources (time,
money, manpower)
● Simulation & Scenario Testing:Evaluates different “what-if” scenarios to choose
the best option
● Action-Oriented Insights:Offers clear strategies, not just raw predictions
● Supports Automation:Can power self-driving systems, dynamic pricing, and
supply chain management
● Competitive Edge:Helps organizations act quickly and effectively in changing
environments
● Examples:
○ Healthcare: Recommending personalized treatment plans to improve patient
outcomes and reduce readmissions.
○ Airlines: Automatically adjusting ticket prices based on demand forecasts, oil
prices, and weather using algorithms.
○ Supply Chain: Optimizing delivery routes each day in response to real-time
traffic and weather data.
Visual Analytics
● Definition: Visual analytics transforms large datasets into interactive visuals that
speed up insight generation and communication.
● Integrates data visualization with analytics for deep exploration and storytelling.
● Uses dashboards, scatter plots, heat maps, and other visual tools.
Key Features:
● Combines Data + Visualization:Merges advanced analytics with interactive visuals
for easy understanding
● Answers “How to See and Understand Data?”:Helps decision-makers quickly
grasp complex information
● Interactive Dashboards:Users can filter, drill-down, and explore data visually
● Pattern & Trend Detection:Makes hidden insights visible through charts, graphs,
and maps
● Supports Real-Time Analysis:Live dashboards show updated information instantly
● Improves Communication:Visual reports are easier to share and explain to
non-technical users
● Tools & Platforms:Uses Tableau, Power BI, QlikView, and similar visualization
tools
● Examples:
○ Business Dashboards: Interactive KPI dashboards that allow managers to drill
down into sales, finance, and operations metrics in real time
○ Marketing: Heat maps to visualize website visitor hotspots and click paths,
highlighting areas for optimization.
○ Scientific Research: Scatter plots to reveal correlations and outliers in
experimental data.
Type of Key Purpose Techniques Used Example
Analytics Question

Descriptive What Summarizes past data Reports, dashboards, A company reviews last year’s sales
Analytics happened? report
to understand trends & summary statistics
patterns

Diagnostic Why did it Finds reasons or root Drill-down, data mining, A hospital checks why patient
Analytics happen? readmissions increased
causes behind correlation analysis
outcomes

Predictive What will Uses historical data to Regression, machine An e-commerce site predicts which
Analytics happen? customers may stop shopping
forecast future learning, time-series
outcomes forecasting

Prescriptive What should Suggests best actions to Optimization, An airline recommends best ticket
Analytics be done? prices & routes
optimize results simulations, scenario
analysis
Life Cycle of Data Analytics
● Definition: The Data Analytics Life Cycle is a structured, often iterative process
that guides data projects from business objective definition to actionable insights
and implementation.
● Key Phases:
○ Discovery/Problem Definition: Identify business objectives and questions;
determine data needs.
○ Data Collection/Access: Gather necessary data from multiple sources
(databases, APIs, sensors, etc.).
○ Data Cleaning & Preparation: Remove errors, handle missing values, and
format data for analysis.
○ Exploratory Data Analysis: Visualize, summarize, and find trends or
anomalies in the data.
○ Model Planning & Building: Choose suitable models and analytical methods;
train and validate models.
○ Result Visualization & Communication: Present insights using charts,
dashboards, or reports for stakeholders.
○ Implementation & Monitoring: Deploy the solutions and track results,
Quality and Quantity of Data
● Data Quantity: Refers to the amount of data collected—the more, the better for
robust models, as long as the data remains relevant and representative.
○ Example: Machine learning systems improve when trained on thousands rather
than hundreds of examples.
● Data Quality: Refers to how accurate, complete, consistent, and relevant data is for
its intended analysis.
○ Dimensions of Data Quality:
■ Accuracy (data is correct and matches reality)
■ Completeness (no missing or incomplete fields)
■ Consistency (no contradictions across datasets)
■ Timeliness (data is up-to-date and available when needed)
■ Reliability (trusted sources and methods)
○ Example: Duplicate customer records decrease quality
● Balancing Act:
○ High quantity is only useful if matched by high quality; poor quality in a large dataset leads
to flawed outcomes.In data analytics, sometimes a smaller but cleaner dataset provides
better results than a massive, noisy one.
Quantitative Insights:
•Focus: Measurable, numerical data, such as sales figures, website traffic, or survey results
with scaled ratings.
•Nature: Objective, precise, and often used for statistical analysis.
•Purpose: To identify trends, patterns, and make predictions based on quantifiable data.
Example: Tracking daily active users (DAU) to understand user engagement.

Qualitative Insights:
•Focus:Non-numerical data, including text, audio, video, and open-ended survey responses.
•Nature:Subjective, descriptive, and focused on understanding user experiences and
motivations.
•Purpose:To provide context, explore "why" behind user behavior, and uncover deeper
insights.
•Example:
Analyzing customer feedback from interviews or social media comments
Combining Qualitative and Quantitative Insights:
Benefits:
•Provides a more holistic view of the subject by combining the strengths of both data types.
Examples:
•Using quantitative data to identify a drop in website traffic and qualitative data from user
feedback to understand the reasons behind the drop.
•Analyzing survey results with quantitative data (like satisfaction scores) and qualitative data
(open-ended comments) to get a complete picture.
•Using quantitative data to identify customer segments and qualitative data to understand their
specific needs and preferences.
•By combining both qualitative and quantitative insights, businesses can make more informed
decisions, develop effective strategies, and gain a deeper understanding of their customers.
Measurement in Data Analytics
● Definition: Measurement in analytics refers to the ways variables are quantified,
categorized, or ordered, impacting the types of analysis possible.
● Levels (Scales) of Measurement:
○ Nominal: Categories without order (e.g., gender, country).
○ Ordinal: Ordered categories, but without precise quantifiable differences (e.g.,
customer satisfaction: poor, fair, good).
○ Interval: Ordered and evenly spaced values, but zero does not mean absence
(e.g., temperature in Celsius).
○ Ratio: Ordered, evenly spaced values with an absolute zero (e.g., age, income,
weight).
● Why It Matters:
○ The level determines which statistical methods and visuals are valid (e.g. mean
can only be calculated for interval/ratio but not for nominal/ordinal data).
○ Improper measurement can lead to invalid conclusions and poor
decision-making.
Measures of Central Tendency and Dispersion
What is Central Tendency?
● Central tendency describes the tendency of data to cluster around a single typical or
central value.
● It summarizes a dataset with a single representative value.
● Common measures include:
○ Mean
○ Median
○ Mode
Mean
● called the arithmetic average.
● Calculated by summing all values and
dividing by the number of observations.
● Sensitive to extreme values (outliers), which can skew the mean.
● Example: For temperatures (22, 23, 21, 25, 22, 24, 20),
Mean=22+23+21+25+22+24=207/7=21.86

Median
● Middle value when data is sorted in ascending or descending order.
● For even number of values, median is the average of the two middle values.
● Less affected by outliers compared to the mean.

● Example: Sorted temperatures: 20, 21, 22, 22, 23, 24, 25


Median = 22

Mode
● The value that occurs most frequently in the dataset.
● Useful for categorical data.
● Can be unimodal, bimodal, or multimodal.

● Example: temperatures: 20, 21, 22, 22, 23, 24, 25


Mode of the temperature dataset is 22 (occurs twice).
What is Dispersion?
● Dispersion measures the spread or variability of data points around the central
value.
● Provides insight into how much data values differ from each other.
● Common measures of dispersion include:
○ Range
○ Variance
○ Standard Deviation
○ Interquartile Range (IQR)

Range
● The simplest measure of dispersion.
● Calculated as the difference between the maximum and minimum values.
● Range=Maximum Value−Minimum Value
● Example: For temperature data:If data = {5, 8, 12, 20},
then Range = 20 − 5 = 15
Variance (Population)
● is a number that tells us how spread out the values in a data set
are from the mean 5 (5 - 10)2 25

7 (7 - 10)2 9

9 (9 - 10)2 1
● σ2 is population variance, 10 (10 - 10)2 0
● μ is mean (average) of all data values,
● N is the total number of data values. 14 (14 - 10)2 16

● Steps to calculate: 15 (15 - 10)2 25


i. Find the mean.
76
ii. Subtract the mean from each data point.
iii. Square each result.
iv. Calculate the average of these squared
differences.
Variance = 76/6 = 12.67
● Find the Population variance of the data [5, 7, 9, 10, 14, 15].
● Step 1: Mean = (5 + 7 + 9 + 10 + 14 + 5) / 6 = 10
● If the variance is small, it means most numbers are close to the mean. If the variance is large, it means
the numbers are spread out more widely.
● A higher variance indicates greater variability, meaning the data is spread, while a lower variance
suggests the data points are closer to the mean
Standard Deviation
is a statistical measure that describes how much variation or dispersion there is in a set
of data points. It helps us understand how spread out the values in a dataset are
compared to the mean (average).

● The square root of variance.


● Expressed in the same units as the data.
● Describes how much data varies from the mean on average.
● Example: Standard deviation ≈sqrt(12.67)= 3.55
A small SD → data values are close to the mean (less variation).
Easier to predict future values.
Example: Students in a class scored between 68–72 when the mean = 70. All students performed
almost equally.
A large SD → data values are spread out widely from the mean (more variation).
Indicates inconsistency and fluctuations.
Harder to predict future values.
Example: Students scored between 40–95 when the mean = 70.Some students did very well while
Problem Statement:
Let there be 5 object height is 1m, 2m, 3m, 4m and 5m.
Calculate the Standard Deviation.
Quartiles: Quartiles divide the set into 4 equal parts.
● Quartiles are values that divide your data into 4 equal parts after sorting in
ascending order.
● They tell you about the spread and distribution of the data.
Types of Quartiles:
1. Q1 (First Quartile / Lower Quartile)
○ 25% of the data lies below Q1.
○ It is the median of the lower half of data.
2. Q2 (Second Quartile / Median)
○ 50% of the data lies below Q2.
○ It is simply the median of the dataset.
3. Q3 (Third Quartile / Upper Quartile)
○ 75% of the data lies below Q3.
○ It is the median of the upper half of data.
Use of Quartile:
● Summarize Data Distribution → Quartiles divide data into four equal parts
(like checkpoints). This helps understand how data is spread.
● Identify Spread Around Median → Unlike mean, quartiles focus on
median-based distribution, which is less affected by extreme values.
● Locate Outliers → Quartiles are the basis for outlier detection (using IQR).
● Compare Groups → Quartiles are widely used in box plots for comparing
datasets (e.g., exam scores across classes).
•Interquartile Range: Interquartile range is defined as the range between 75 percentile (Q3) and
25 percentile (Q1).
IQR = Q3 – Q1
Data = {5, 7, 8, 12, 15, 18, 20, 22}

● Q1=8 (25th percentile)

● Q2=13.5 (median, 50th percentile)

● Q3=19(75th percentile)

● IQR = Q3 − Q1 = 19 − 8 = 11

Interquartile Range = 50
Application of IQR:
Robust Measure of Spread
● IQR uses the middle 50% of data → ignores outliers and extreme values.
● More reliable than Range (which depends only on min & max).
Outlier Detection
● Outliers are usually defined as values belowR
● This is the standard rule for finding outliers.
Better than Standard Deviation (sometimes)
● Standard deviation assumes data is normally distributed.
● IQR works well even when data is skewed.
Aspect Central Tendency Dispersion

Definition Typical or central value Spread or variability of the data

Purpose Summarizes the dataset with a single Describes how far data points spread out
value

Examples Mean, median, mode Range, variance, standard deviation, IQR

Calculation Uses data values directly Uses deviations from the central value

Interpretation Data’s center or midpoint Data’s spread or consistency


Population and Sample

● Population
○ Entire set of individuals/items under study
○ Represents all possible data points
○ Example: All students in a university

● Sample
○ A subset of the population
○ Used when studying the whole population is impractical
○ Example: 200 students selected from the university
Sampling Funnel

•The process of going from a large population to a manageable sample for analysis.
Population: The entire group you want to study (e.g., all customers).
Sampling Frame: The list of individuals from which the sample is drawn (e.g., a customer
database).
Sample: The subset of the population that is actually studied (e.g., 1000 surveyed customers).
Benefits:
•Cost and time efficiency: Sampling reduces the time and resources needed for analysis.
•Feasibility for large populations: It allows analysis when studying the entire population is
impractical.
•Reduced risk of error: Proper sampling methods can minimize biases and the influence of
outliers.
•Accuracy: A well-chosen sample can accurately represent the larger population.
Central Limit Theorem
● It states that as you take a sufficiently large number of random samples from any
population, regardless of the population's original distribution, the distribution of the
sample means will approximate a normal distribution (also known as a bell curve).
● The larger the sample size, the more closely the distribution of the sample means
will resemble a normal distribution.
Conditions for the CLT:
● Random Sampling: Samples must be chosen randomly to be representative of the
population.
● Independence: Each sample must be independent of the others.
● Sufficiently Large Sample Size (n ≥ 30): This is a general rule of thumb. If the
original population is very skewed, a larger sample size may be needed.
Example

Imagine you want to find the average height of all students at a large university, but you
can't measure everyone.

1. Original Population: The height of all students at the university. This population's
distribution might not be a perfect bell curve; it could be slightly skewed.
2. Sampling: You randomly select 40 students and calculate their average height.
You repeat this process many times, creating hundreds of different samples, each
with 40 students.
3. Applying the CLT: Now, you plot a histogram of all the average heights you
calculated from your hundreds of samples. According to the Central Limit Theorem,
this histogram will form a normal, bell-shaped distribution.

Conclusion: The peak of this new bell-shaped curve will be very close to the true
average height of all students at the university. The spread of the curve (its standard
error) tells you how much the sample means typically vary from the true population
Advantages of CLT

1. Allows us to use normal distribution for making inferences about


population parameters.
2. Simplifies complex probability calculations.
3. Basis for confidence intervals and hypothesis testing.
4. Works even when population distribution is not normal.
5. Widely applicable in real-world data (finance, quality control,
healthcare, etc.).
Confidence Interval (CI)
Confidence Interval (CI)
Confidence Interval (CI)

● In statistics, a Confidence Interval (CI) is a range of values that is likely to


contain the population parameter (like mean or proportion).
● It provides an estimate rather than an exact value.

Key Features of Confidence Interval

1. Based on sample data but used to estimate population parameter.


2. Defined by confidence level (commonly 90%, 95%, or 99%).
3. Wider CI → More confidence but less precision.
4. Narrow CI → Less confidence but more precision
5. Depends on sample size and data variability.
Formula for Confidence Interval (Mean)

Where,

● xˉ = Sample mean
● Z = Z-value for confidence level (e.g., 1.96 for 95%)
● s(σ) = Population standard deviation (or sample SD if unknown)
● n = Sample size
Example (Simple)

● Suppose a sample of 100 students has average height = 160 cm, with
σ = 10 cm.
● Confidence Level = 95% → Z = 1.96

✅ Interpretation: We are 95% confident that the true mean height of all
students lies between 158.04 cm and 161.96 cm.
Advantages of Confidence Intervals

1. Provide a range of plausible values instead of a single number.


2. More informative than just reporting a mean or proportion.
3. Help understand accuracy and reliability of an estimate.
4. Useful for decision-making in research, business, medicine, etc.
5. Show how sample size and variability affect precision.
Sampling Variation

Definition: Sampling variation refers to the natural differences that occur in results
(statistics like mean, proportion, etc.) when different random samples are taken from the
same population.
Each sample gives slightly different results because no two samples are exactly alike.
Key Points

1. Occurs because we study only a part of the population, not the whole.
2. Sample statistics (mean, median, proportion) vary from sample to sample.
3. Larger samples → less sampling variation.
4. Basis for concepts like standard error, confidence intervals, and hypothesis
testing.
Example
Population: Average marks of 10,000 students = 60.
● If we take different random samples of size 50:
○ Sample 1 mean = 58
○ Sample 2 mean = 62
○ Sample 3 mean = 59.5
● The means are close to 60, but not the same → this difference is sampling
variation.
Uses of Sampling Variation
1. Helps in understanding reliability of sample results.
2. Explains why repeated surveys (like election polls) give slightly different outcomes.
3. Provides the foundation for standard error calculation.
4. Essential for constructing confidence intervals.
5. Important in hypothesis testing to judge whether differences are due to chance or
real effects.

You might also like