CH 1
CH 1
Descriptive What Summarizes past data Reports, dashboards, A company reviews last year’s sales
Analytics happened? report
to understand trends & summary statistics
patterns
Diagnostic Why did it Finds reasons or root Drill-down, data mining, A hospital checks why patient
Analytics happen? readmissions increased
causes behind correlation analysis
outcomes
Predictive What will Uses historical data to Regression, machine An e-commerce site predicts which
Analytics happen? customers may stop shopping
forecast future learning, time-series
outcomes forecasting
Prescriptive What should Suggests best actions to Optimization, An airline recommends best ticket
Analytics be done? prices & routes
optimize results simulations, scenario
analysis
Life Cycle of Data Analytics
● Definition: The Data Analytics Life Cycle is a structured, often iterative process
that guides data projects from business objective definition to actionable insights
and implementation.
● Key Phases:
○ Discovery/Problem Definition: Identify business objectives and questions;
determine data needs.
○ Data Collection/Access: Gather necessary data from multiple sources
(databases, APIs, sensors, etc.).
○ Data Cleaning & Preparation: Remove errors, handle missing values, and
format data for analysis.
○ Exploratory Data Analysis: Visualize, summarize, and find trends or
anomalies in the data.
○ Model Planning & Building: Choose suitable models and analytical methods;
train and validate models.
○ Result Visualization & Communication: Present insights using charts,
dashboards, or reports for stakeholders.
○ Implementation & Monitoring: Deploy the solutions and track results,
Quality and Quantity of Data
● Data Quantity: Refers to the amount of data collected—the more, the better for
robust models, as long as the data remains relevant and representative.
○ Example: Machine learning systems improve when trained on thousands rather
than hundreds of examples.
● Data Quality: Refers to how accurate, complete, consistent, and relevant data is for
its intended analysis.
○ Dimensions of Data Quality:
■ Accuracy (data is correct and matches reality)
■ Completeness (no missing or incomplete fields)
■ Consistency (no contradictions across datasets)
■ Timeliness (data is up-to-date and available when needed)
■ Reliability (trusted sources and methods)
○ Example: Duplicate customer records decrease quality
● Balancing Act:
○ High quantity is only useful if matched by high quality; poor quality in a large dataset leads
to flawed outcomes.In data analytics, sometimes a smaller but cleaner dataset provides
better results than a massive, noisy one.
Quantitative Insights:
•Focus: Measurable, numerical data, such as sales figures, website traffic, or survey results
with scaled ratings.
•Nature: Objective, precise, and often used for statistical analysis.
•Purpose: To identify trends, patterns, and make predictions based on quantifiable data.
Example: Tracking daily active users (DAU) to understand user engagement.
Qualitative Insights:
•Focus:Non-numerical data, including text, audio, video, and open-ended survey responses.
•Nature:Subjective, descriptive, and focused on understanding user experiences and
motivations.
•Purpose:To provide context, explore "why" behind user behavior, and uncover deeper
insights.
•Example:
Analyzing customer feedback from interviews or social media comments
Combining Qualitative and Quantitative Insights:
Benefits:
•Provides a more holistic view of the subject by combining the strengths of both data types.
Examples:
•Using quantitative data to identify a drop in website traffic and qualitative data from user
feedback to understand the reasons behind the drop.
•Analyzing survey results with quantitative data (like satisfaction scores) and qualitative data
(open-ended comments) to get a complete picture.
•Using quantitative data to identify customer segments and qualitative data to understand their
specific needs and preferences.
•By combining both qualitative and quantitative insights, businesses can make more informed
decisions, develop effective strategies, and gain a deeper understanding of their customers.
Measurement in Data Analytics
● Definition: Measurement in analytics refers to the ways variables are quantified,
categorized, or ordered, impacting the types of analysis possible.
● Levels (Scales) of Measurement:
○ Nominal: Categories without order (e.g., gender, country).
○ Ordinal: Ordered categories, but without precise quantifiable differences (e.g.,
customer satisfaction: poor, fair, good).
○ Interval: Ordered and evenly spaced values, but zero does not mean absence
(e.g., temperature in Celsius).
○ Ratio: Ordered, evenly spaced values with an absolute zero (e.g., age, income,
weight).
● Why It Matters:
○ The level determines which statistical methods and visuals are valid (e.g. mean
can only be calculated for interval/ratio but not for nominal/ordinal data).
○ Improper measurement can lead to invalid conclusions and poor
decision-making.
Measures of Central Tendency and Dispersion
What is Central Tendency?
● Central tendency describes the tendency of data to cluster around a single typical or
central value.
● It summarizes a dataset with a single representative value.
● Common measures include:
○ Mean
○ Median
○ Mode
Mean
● called the arithmetic average.
● Calculated by summing all values and
dividing by the number of observations.
● Sensitive to extreme values (outliers), which can skew the mean.
● Example: For temperatures (22, 23, 21, 25, 22, 24, 20),
Mean=22+23+21+25+22+24=207/7=21.86
●
Median
● Middle value when data is sorted in ascending or descending order.
● For even number of values, median is the average of the two middle values.
● Less affected by outliers compared to the mean.
Mode
● The value that occurs most frequently in the dataset.
● Useful for categorical data.
● Can be unimodal, bimodal, or multimodal.
Range
● The simplest measure of dispersion.
● Calculated as the difference between the maximum and minimum values.
● Range=Maximum Value−Minimum Value
● Example: For temperature data:If data = {5, 8, 12, 20},
then Range = 20 − 5 = 15
Variance (Population)
● is a number that tells us how spread out the values in a data set
are from the mean 5 (5 - 10)2 25
7 (7 - 10)2 9
9 (9 - 10)2 1
● σ2 is population variance, 10 (10 - 10)2 0
● μ is mean (average) of all data values,
● N is the total number of data values. 14 (14 - 10)2 16
● Q3=19(75th percentile)
● IQR = Q3 − Q1 = 19 − 8 = 11
Interquartile Range = 50
Application of IQR:
Robust Measure of Spread
● IQR uses the middle 50% of data → ignores outliers and extreme values.
● More reliable than Range (which depends only on min & max).
Outlier Detection
● Outliers are usually defined as values belowR
● This is the standard rule for finding outliers.
Better than Standard Deviation (sometimes)
● Standard deviation assumes data is normally distributed.
● IQR works well even when data is skewed.
Aspect Central Tendency Dispersion
Purpose Summarizes the dataset with a single Describes how far data points spread out
value
Calculation Uses data values directly Uses deviations from the central value
● Population
○ Entire set of individuals/items under study
○ Represents all possible data points
○ Example: All students in a university
● Sample
○ A subset of the population
○ Used when studying the whole population is impractical
○ Example: 200 students selected from the university
Sampling Funnel
•The process of going from a large population to a manageable sample for analysis.
Population: The entire group you want to study (e.g., all customers).
Sampling Frame: The list of individuals from which the sample is drawn (e.g., a customer
database).
Sample: The subset of the population that is actually studied (e.g., 1000 surveyed customers).
Benefits:
•Cost and time efficiency: Sampling reduces the time and resources needed for analysis.
•Feasibility for large populations: It allows analysis when studying the entire population is
impractical.
•Reduced risk of error: Proper sampling methods can minimize biases and the influence of
outliers.
•Accuracy: A well-chosen sample can accurately represent the larger population.
Central Limit Theorem
● It states that as you take a sufficiently large number of random samples from any
population, regardless of the population's original distribution, the distribution of the
sample means will approximate a normal distribution (also known as a bell curve).
● The larger the sample size, the more closely the distribution of the sample means
will resemble a normal distribution.
Conditions for the CLT:
● Random Sampling: Samples must be chosen randomly to be representative of the
population.
● Independence: Each sample must be independent of the others.
● Sufficiently Large Sample Size (n ≥ 30): This is a general rule of thumb. If the
original population is very skewed, a larger sample size may be needed.
Example
Imagine you want to find the average height of all students at a large university, but you
can't measure everyone.
1. Original Population: The height of all students at the university. This population's
distribution might not be a perfect bell curve; it could be slightly skewed.
2. Sampling: You randomly select 40 students and calculate their average height.
You repeat this process many times, creating hundreds of different samples, each
with 40 students.
3. Applying the CLT: Now, you plot a histogram of all the average heights you
calculated from your hundreds of samples. According to the Central Limit Theorem,
this histogram will form a normal, bell-shaped distribution.
Conclusion: The peak of this new bell-shaped curve will be very close to the true
average height of all students at the university. The spread of the curve (its standard
error) tells you how much the sample means typically vary from the true population
Advantages of CLT
Where,
● xˉ = Sample mean
● Z = Z-value for confidence level (e.g., 1.96 for 95%)
● s(σ) = Population standard deviation (or sample SD if unknown)
● n = Sample size
Example (Simple)
● Suppose a sample of 100 students has average height = 160 cm, with
σ = 10 cm.
● Confidence Level = 95% → Z = 1.96
✅ Interpretation: We are 95% confident that the true mean height of all
students lies between 158.04 cm and 161.96 cm.
Advantages of Confidence Intervals
Definition: Sampling variation refers to the natural differences that occur in results
(statistics like mean, proportion, etc.) when different random samples are taken from the
same population.
Each sample gives slightly different results because no two samples are exactly alike.
Key Points
1. Occurs because we study only a part of the population, not the whole.
2. Sample statistics (mean, median, proportion) vary from sample to sample.
3. Larger samples → less sampling variation.
4. Basis for concepts like standard error, confidence intervals, and hypothesis
testing.
Example
Population: Average marks of 10,000 students = 60.
● If we take different random samples of size 50:
○ Sample 1 mean = 58
○ Sample 2 mean = 62
○ Sample 3 mean = 59.5
● The means are close to 60, but not the same → this difference is sampling
variation.
Uses of Sampling Variation
1. Helps in understanding reliability of sample results.
2. Explains why repeated surveys (like election polls) give slightly different outcomes.
3. Provides the foundation for standard error calculation.
4. Essential for constructing confidence intervals.
5. Important in hypothesis testing to judge whether differences are due to chance or
real effects.