0% found this document useful (0 votes)
53 views31 pages

EDA Techniques in Python for Data Insights

Unit 5 focuses on Data Analytics and Real-World Applications, emphasizing Exploratory Data Analysis (EDA) as a critical step in understanding data patterns using Python libraries like pandas and matplotlib. It covers various data visualization techniques such as bar charts, line charts, and scatter plots, along with statistical analysis methods including descriptive, inferential, and predictive analytics. The document also discusses hypothesis testing, its types, and the importance of identifying patterns and insights in data for effective decision-making.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
53 views31 pages

EDA Techniques in Python for Data Insights

Unit 5 focuses on Data Analytics and Real-World Applications, emphasizing Exploratory Data Analysis (EDA) as a critical step in understanding data patterns using Python libraries like pandas and matplotlib. It covers various data visualization techniques such as bar charts, line charts, and scatter plots, along with statistical analysis methods including descriptive, inferential, and predictive analytics. The document also discusses hypothesis testing, its types, and the importance of identifying patterns and insights in data for effective decision-making.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Unit 3 Machine Learning and Data analytics Using Python

Unit -5
Data Analytics and Real-World Applications
Exploratory Data Analysis (EDA) is a important step in data analysis which focuses on
understanding patterns, trends and relationships through statistical tools and
visualizations. Python offers various libraries like pandas, numPy, matplotlib, seaborn and
plotly which enables effective exploration and insights generation to help in further
modeling and analysis. In this article, we will see how to perform EDA using python.

5.1 Data Visualization Techniques

• Data Visualization represents the text or numerical data in a visual format, which
makes it easy to grasp the information the data express. We, humans, remember
the pictures more easily than readable text, so Python provides us various libraries
for data visualization like matplotlib, seaborn, plotly, etc.
• Data visualization includes a variety of charts, each designed to present data in a
clear and meaningful way. From simple bar and line charts to advanced visuals like
heatmaps and scatter plots, the right chart helps turn raw data into useful insights.
Bar Charts
• Bar charts are used to compare values across different categories using
rectangular bars. X-axis shows categories while Y-axis represents values. Common
types include horizontal, stacked and grouped bar charts.
• When to Use:
o To compare different categories
o To rank values from highest to lowest
o To show relationships between multiple variables
Unit 3 Machine Learning and Data analytics Using Python

• Below is the Example of Bar Chart:

2. Line Charts
• Line charts show how values change over time by connecting data points with
lines. They help visualize trends like increases, decreases or stability.
• When to Use:
o To track changes over time
o To compare trends across multiple data series
o For time series analysis
Unit 3 Machine Learning and Data analytics Using Python

3. Pie Charts
Pie charts are round charts divided into slices, where each slice shows a part of the
whole. The size of each slice represents its percentage.
When to Use:
• To show how different parts contribute to a whole
• To highlight a dominant category

Scatter Chart (Plots)


Scatter charts use dots to show relationship between two numerical variables. X-
axis shows the independent variable and Y-axis shows the dependent variable.
When to Use:
• To observe relationships between two variables
• To detect patterns, clusters or outliers in data
Unit 3 Machine Learning and Data analytics Using Python

Histogram
A histogram displays the distribution of numerical data by grouping values into
intervals (bins) and showing their frequency as bars. It helps reveal the shape,
spread and patterns in the data.
When to Use:
• To visualize the distribution of numerical data
• To explore patterns, trends and outliers

Heatmap
A heatmap displays data in a matrix format using color to represent values. It's
ideal for spotting patterns, correlations and variations in large datasets.
When to Use:
• To identify clusters or groupings in data
• To visualize correlations between variables
• For risk analysis in fields like finance or network security
Unit 3 Machine Learning and Data analytics Using Python

Box Plot (Box-and-Whisker Plot)


A box plot displays distribution of numerical data, showing median, quartiles and
outliers. It’s useful for understanding variability and detecting unusual values.
When to Use:
• To detect outliers in data
• To compare distributions across groups
• To visualize spread and variability
Unit 3 Machine Learning and Data analytics Using Python

Statistical Analysis
Statistical analysis forms the backbone of data science, enabling us to derive
meaningful insights from complex and diverse datasets. It involves the systematic
process of collecting, organizing, interpreting, and presenting data to identify
patterns, trends, and relationships. Whether working with numerical, categorical,
or qualitative data, statistical methods help simplify complexity and guide data-
driven decision-making.
By leveraging these techniques, we can uncover trends, assess risks, and make
predictions—transforming raw data into actionable insights that support
informed strategies and innovations.

Types of Statistical Analysis


1. Descriptive Statistical Analysis
Descriptive analysis helps simplify and summarize data in an understandable
format, often through tables, charts, and numerical summaries.
Key Elements:
• Measures of Frequency:
o Count: Number of occurrences of each value.
o Frequency Distribution: Visual representation via histograms/bar
charts.
o Relative Frequency: Proportion of each value to the total.
• Measures of Central Tendency:
o Mean: Arithmetic average.
o Median: Middle value in sorted data.
o Mode: Most frequently occurring value.
• Measures of Dispersion:
o Range: Difference between maximum and minimum.
o Variance & Standard Deviation: Indicate spread or variability in
data.
Descriptive statistics provide a snapshot of data distribution and help reveal its
underlying structure.
Unit 3 Machine Learning and Data analytics Using Python

2. Inferential Statistical Analysis


Inferential analysis draws conclusions about a population based on a sample,
allowing for hypothesis testing and predictions.
Common Techniques:
• Hypothesis Testing: Validates assumptions with statistical significance (p-
values).
• t-tests: Compares means between two groups.
• Chi-Square Test: Assesses relationships between categorical variables.
• ANOVA (Analysis of Variance): Compares means across multiple groups.
• Non-parametric Tests: For data that doesn’t meet parametric assumptions
(e.g., Kruskal-Wallis, Wilcoxon test).
This type of analysis enables generalization beyond observed data, enhancing the
reliability of findings.
3. Predictive Statistical Analysis
Predictive analytics uses historical data to make forecasts about future outcomes,
trends, or behaviors.
Key Components:
• Data Preparation: Cleaning and formatting historical data.
• Modeling: Applying algorithms to identify trends (e.g., linear regression,
decision trees).
• Forecasting: Applications like sales prediction, churn analysis, and
demand forecasting.
4. Prescriptive Statistical Analysis
Prescriptive analytics suggests optimal actions based on predictive insights,
enabling proactive decision-making.
Core Functions:
• Optimization Models: Identify best-case scenarios or resource allocations.
• Decision Engines: Recommend actions to achieve specific goals using
historical data and simulations.
Prescriptive analysis is valuable in fields like logistics, finance, and operations for
strategy formulation.
Unit 3 Machine Learning and Data analytics Using Python

5. Causal Analysis
Causal analysis goes beyond correlation by establishing cause-and-effect relationships.
Importance:
• Identifies the root causes of events or outcomes.
• Enables intervention-based strategies.
• Used in experimental designs, A/B testing, and root cause analysis.
Statistical Analysis Process
1. Understanding the Data
Explore the dataset’s type, context, and attributes for proper analysis.
2. Sampling and Representativeness
Ensure that the sample reflects the population accurately to support valid
generalizations.
3. Modeling Relationships
Develop models (e.g., regression, classification) to capture patterns and
relationships among variables.
4. Model Validation
Test the reliability of models using evaluation metrics and real-world data.
5. Predictive Application
Use validated models for future forecasting and strategy planning.
Importance of Statistical Analysis
• Pattern Recognition: Identifies trends, correlations, and key variables.
• Data Cleaning: Helps detect outliers, missing values, and anomalies.
• Feature Engineering: Aids in selecting and transforming variables for ML models.
• Risk Evaluation: Supports risk prediction in sectors like finance, insurance, and
healthcare.
• Process Optimization: Drives operational efficiency and cost savings.
• Model Assessment: Uses performance metrics (e.g., accuracy, precision, recall,
F1-score) for model evaluation.
Risks and Limitations of Statistical Analysis
• Misinterpretation of Results: Correlation ≠ Causation. Misleading insights may
arise if relationships are misunderstood.
• Sampling Bias: Poor sample selection can distort outcomes and reduce
generalizability.
Unit 3 Machine Learning and Data analytics Using Python

• Overreliance on Models: Models are simplifications; real-world decisions should


account for complexity.
• Uncertainty Ignorance: Failure to communicate confidence intervals or error
margins can lead to overconfidence in results.
Hypothesis Testing
Hypothesis testing compares two opposite ideas about a group of people or
things and uses data from a small part of that group (a sample) to decide which
idea is more likely true. We collect and study the sample data to check if the claim
is correct.

For example, if a company says its website gets 50 visitors each day on average,
we use hypothesis testing to look at past visitor data and see if this claim is true
or if the actual number is different.
Defining Hypotheses
• Null Hypothesis (H₀): The starting assumption. For example, "The
average visits are 50."
• Alternative Hypothesis (H₁): The opposite, saying there is a difference.
For example, "The average visits are not 50."
Key Terms of Hypothesis Testing
To understand the Hypothesis testing firstly we need to understand the key terms which
are given below:
• Significance Level (α): How sure we want to be before saying the claim is false.
Usually, we choose 0.05 (5%).
• p-value: The chance of seeing the data if the null hypothesis is true. If this is less
than α, we say the claim is probably false.
Unit 3 Machine Learning and Data analytics Using Python

• Test Statistic: A number that helps us decide if the data supports or rejects the
claim.
• Critical Value: The cutoff point to compare with the test statistic.
• Degrees of freedom: A number that depends on the data size and helps find the
critical value.
Types of Hypothesis Testing
1. One-Tailed Test
Used when we expect a change in only one direction either up or down, but not
both. For example, if testing whether a new algorithm improves accuracy, we only
check if accuracy increases.
There are two types of one-tailed test:
• Left-Tailed (Left-Sided) Test: Checks if the value is less than expected.
Example: H0:μ≥50μ≥50 and H1: μ<50μ<50
• Right-Tailed (Right-Sided) Test: Checks if the value is greater than
expected. Example: H0 : μ≤50μ≤50 and H1:μ>50μ>50
2. Two-Tailed Test
Used when we want to see if there is a difference in either direction higher or
lower. For example, testing if a marketing strategy affects sales, whether it goes up
or down
• Example: H0: μ= 50 and H1: μ≠50
What are Type 1 and Type 2 errors in Hypothesis Testing?
In hypothesis testing Type I and Type II errors are two possible errors that can
happen when we are finding conclusions about a population based on a sample of
data. These errors are associated with the decisions we made regarding the null
hypothesis and the alternative hypothesis.
• Type I error: When we reject the null hypothesis although that hypothesis
was true. Type I error is denoted by alpha(αα).
• Type II errors: When we accept the null hypothesis but it is false. Type II
errors are denoted by beta(ββ).
Working of Hypothesis Testing
Step 1: Define Hypotheses:
• Null hypothesis (H₀): Assumes no effect or difference.
• Alternative hypothesis (H₁): Assumes there is an effect or difference.
Example: Test if a new algorithm improves user engagement.
Unit 3 Machine Learning and Data analytics Using Python

Step 2: Choose significance level


We select a significance level (usually 0.05). This is the maximum chance we accept
of wrongly rejecting the null hypothesis (Type I error). It also sets the confidence
needed to accept results.
Step 3: Collect and Analyze data.
• Now we gather data this could come from user observations or an experiment.
Once collected we analyze the data using appropriate statistical methods to
calculate the test statistic.
• Example: We collect data on user engagement before and after implementing the
algorithm. We can also find the mean engagement scores for each group.
Step 4: Calculate Test Statistic
The test statistic measures how much the sample data deviates from what we did expect if
the null hypothesis were true. Different tests use different statistics:
• Z-test: Used when population variance is known and sample size is large.
• T-test: Used when sample size is small or population variance unknown.
• Chi-square test: Used for categorical data to compare observed vs. expected
counts.
Step 5: Make a Decision
We compare the test statistic to a critical value from a statistical table or use the p-
value:
1. Using Critical Value:
• If test statistic > critical value → reject H0.
• If test statistic ≤ critical value → fail to reject H0.
2. Using P-value:
• If p-value ≤ α → reject H0.
• If p-value > α → fail to reject H0.
Example: If p-value is 0.03 and α is 0.05, we reject the null hypothesis because 0.03
< 0.05.
Step 6: Interpret the Results
Based on the decision, we conclude whether there is enough evidence to support
the alternative hypothesis or if we should keep the null hypothesis.
Unit 3 Machine Learning and Data analytics Using Python

Real life Examples of Hypothesis Testing


A pharmaceutical company tests a new drug to see if it lowers blood pressure in
patients.
Data:
• Before Treatment: 120, 122, 118, 130, 125, 128, 115, 121, 123, 119
• After Treatment: 115, 120, 112, 128, 122, 125, 110, 117, 119, 114
Step 1: Define the Hypothesis
• Null Hypothesis: (H0)The new drug has no effect on blood pressure.
• Alternate Hypothesis: (H1)The new drug has an effect on blood
pressure.
Step 2: Define the Significance level
Usually 0.05, meaning less than 5% chance results are by random chance.
Step 3: Compute the test statistic
Using paired T-test analyze the data to obtain a test statistic and a p-value. The test
statistic is calculated based on the differences between blood pressure
measurements before and after treatment.
t = m/(s/√n)
Where:
• m = mean of the difference i.e X after, X before
• s = standard deviation of the difference (d) di=Xafter,i−Xbefore,i
• n = sample size
then m= -3.9, s= 1.37 and n= 10. we calculate the T-statistic = -9 based on the formula for
paired t test
Step 4: Find the p-value
With degrees of freedom = 9, p-value ≈ 0.0000085 (very small).
Step 5: Result
Since the p-value (8.538051223166285e-06) is less than the significance level
(0.05) the researchers reject the null hypothesis. There is statistically significant
evidence that the average blood pressure before and after treatment with the new
drug is different.
Unit 3 Machine Learning and Data analytics Using Python

Identifying Patterns and Insights from Data in EDA


In Exploratory Data Analysis (EDA), identifying patterns and extracting insights is a
crucial step that helps analysts understand the structure, behavior, and hidden
relationships in the data before applying any machine learning models or statistical
tests.
Purpose of Identifying Patterns in EDA
• To understand the distribution and nature of the data
• To discover relationships between variables
• To detect anomalies, outliers, or data quality issues
• To generate hypotheses and guide feature selection for predictive modelling
Common Techniques to Identify Patterns and Insights
1. Data Visualization
• Histograms: Reveal data distribution and skewness.
• Box Plots: Highlight spread and detect outliers.
• Scatter Plots: Show relationships or trends between two variables.
• Heatmaps: Display correlation between variables using color gradients.
• Pair Plots: Offer a matrix of scatter plots for multi-variable analysis.
2. Summary Statistics
• Mean, Median, Mode: Help identify central tendency.
• Standard Deviation & Variance: Indicate data spread.
• Skewness & Kurtosis: Describe the shape and tail behavior of data distributions.
3. Correlation Analysis
• Measures the strength and direction of relationships between numerical
variables.
• Tools: Pearson’s correlation (linear), Spearman’s rank (non-linear/ordinal).
4. Grouping and Aggregation
• Grouping data by categories and applying aggregation functions (mean, count,
sum) helps spot trends and patterns.
o Example: Average purchase by customer segment, total sales by region.
5. Trend Analysis
• Useful for time series data to detect increasing or decreasing trends over time.
6. Clustering (optional advanced EDA)
• Algorithms like K-Means can be used to identify natural groupings in the data.
Unit 3 Machine Learning and Data analytics Using Python

Examples of Insights from EDA


• High correlation between education level and income
• Sales peaks during holidays or weekends
• Outliers in age or income that might skew model predictions
• Missing values mostly concentrated in a specific column or group
Benefits of Pattern Detection in EDA
• Informs data preprocessing decisions (e.g., removing outliers)
• Helps in feature selection and dimensionality reduction
• Provides business-relevant insights early in the analysis
• Improves model interpretability and performance

5.2 Time Series Analysis


Time series analysis is a robust statistical approach used to study data points gathered at
consistent time intervals, allowing analysts to identify patterns and trends within the
data. This method is widely applied across numerous industries because it supports
informed decision-making and enhances forecasting accuracy by leveraging historical
records. By uncovering insights from the past to anticipate future outcomes, time series
analysis is essential in areas such as finance, healthcare, energy, supply chain operations,
weather prediction, marketing, and many other fields.
5.2.1 Introduction to Time Series Data
• Time series data in the context of machine learning refers to a sequence of data
points collected or recorded at successive, evenly spaced points in time. Unlike
typical datasets where observations may be independent, time series data has a
natural temporal order and dependencies where past values influence future ones.
• This ordered structure allows machine learning and statistical models to analyze
patterns such as trends, seasonality, and autocorrelation in the data to forecast
future events or understand underlying temporal behavior.
• In machine learning, time series data is often univariate, where time is the only
independent variable, and the goal is to predict future values based on previous
observations.
• It can also be multivariate, including additional variables like weather or
demographics that may affect the series.
• Time series analysis techniques are essential for many applications—such as
financial market prediction, weather forecasting, healthcare monitoring, energy
demand analysis, and supply chain management—because they provide a way to
learn from historical sequential data and generate accurate predictions.
Unit 3 Machine Learning and Data analytics Using Python

Why Do Organizations Use Time Series Analysis?


• Organizations use time series analysis because it is an essential tool for making
data-driven decisions that improve business outcomes. By examining patterns and
trends over time, companies gain meaningful insights into their past performance
and can reliably forecast future results in a way that is both relevant and
actionable. This process transforms raw data into valuable information that helps
businesses enhance their operations and monitor historical progress.
• For instance, retailers analyze seasonal sales trends to adjust inventory levels and
design targeted marketing strategies. Energy providers study consumption
patterns to optimize their production schedules efficiently. Time series analysis
also aids in detecting anomalies—such as sudden drops in website traffic—that
might indicate underlying problems or reveal new opportunities. Financial
institutions rely on it to react swiftly to stock market fluctuations, while healthcare
organizations use it to evaluate patient risk in real time.
• Unlike isolated statistics, time series analysis presents a continuous narrative of
evolving business conditions. This dynamic perspective enables companies to plan
proactively, identify issues early, and seize emerging opportunities promptly.

Components of Time Series Data


Time series data is generally comprised of different components that characterize the
patterns and behavior of the data over time. By analyzing these components, we can
better understand the dynamics of the time series and create more accurate models. Four
main elements make up a time series dataset:
• Trends
• Seasonality
• Cycles
• Noise
Components of Time Series Data
1. Trend
• Represents the overall direction of the data over a long period.
• Indicates whether data values are generally increasing, decreasing, or
remaining stable.
• Reflects long-term growth or decline in the observed variable.
• Example: E-commerce sales showing steady growth over five years.
2. Seasonality
• Refers to recurring, predictable patterns within fixed time intervals.
Unit 3 Machine Learning and Data analytics Using Python

• These patterns repeat regularly with consistent timing, direction, and


magnitude.
• Example: Increased electricity consumption every summer due to air
conditioning use.
3. Cycles
• Fluctuations that occur without a fixed or regular period.
• Typically last longer than a year and vary in length and intensity.
• Often linked to economic or business conditions, such as expansions and
recessions.
• Example: Business cycles alternating between growth and decline phases.
4. Noise
• Represents the random, unpredictable variation in the data not explained
by trend, seasonality, or cycles.
• Consists of erratic deviations that appear as residuals after removing the
other components.
ARIMA Model for Time Series Forecasting
• ARIMA stands for Autoregressive Integrated Moving Average, a popular and
powerful statistical method used for forecasting time series data.
• It combines three components:
• AR (AutoRegressive part, order p): Uses the relationship between an
observation and a number of lagged observations (past values).
• I (Integrated part, order d): Differencing the time series data to make it
stationary (i.e., mean and variance constant over time).
• MA (Moving Average part, order q): Uses past forecast errors in a
regression-like model.
• The ARIMA model is denoted as ARIMA(p, d, q), where p, d, q refer to the orders of
AR, I, and MA respectively.
• Suitable for univariate time series data that can be made stationary by
differencing.
Key Concepts
• Stationarity: A stationary time series has statistical properties like mean,
variance, and autocorrelation that remain constant over time.
• Differencing: A method to stabilize the mean of a time series by removing
changes in the level of a time series, eliminating trend and seasonality.
• Autocorrelation Function (ACF): Measures correlation between observations
of a time series separated by k time units; used to identify q.
Unit 3 Machine Learning and Data analytics Using Python

• Partial Autocorrelation Function (PACF): Measures correlation between


observations separated by lag k accounting for the effects of intermediate lags;
used to identify p.
Steps to Build an ARIMA Model
1. Visualize the time series data for trends, seasonality, and outliers:
Start by plotting the data against time to identify overall upward or downward
movements (trends), repeating seasonal patterns (like yearly sales spikes), and
unusual data points (outliers) that deviate significantly from typical behavior.
Visualization helps understand the data’s structure and guides further analysis.
2. Check for stationarity using statistical tests like Augmented Dickey-Fuller
(ADF) test:
Stationarity means the statistical properties of the series — like mean and
variance — do not change over time. Many forecasting models, including ARIMA,
assume stationarity. The ADF test assesses whether the series is stationary or
not, based on hypothesis testing.
3. If non-stationary, difference the data (d times) until it becomes stationary:
If the series is non-stationary (shows trends or changing variance), apply
differencing — subtracting the current observation from the previous one — one
or more times until the data stabilizes. This step removes trends and helps meet
the model’s assumptions.
4. Identify p and q using ACF and PACF plots:
The Autocorrelation Function (ACF) plot helps determine the order of the
moving average component (q), showing correlation of the data with its lagged
values. The Partial Autocorrelation Function (PACF) plot helps find the order of
the autoregressive component (p), showing the correlation with lagged values
after removing intermediate correlations. These plots guide selecting optimal
values of p and q.
5. Fit the ARIMA(p, d, q) model using historical data:
Using the chosen orders of autoregression (p), differencing (d), and moving
average (q), train the ARIMA model on the historical dataset to learn the
underlying patterns and relationships.
6. Evaluate model residuals for randomness:
After fitting, analyze the residuals (differences between actual and predicted
values). They should behave like white noise — random with no patterns. Non-
random residuals indicate the model may be missing important information.
7. Make forecasts on future data points:
Use the trained ARIMA model to predict future values based on learned
relationships, providing estimates and confidence intervals for upcoming time
steps.
Unit 3 Machine Learning and Data analytics Using Python

8. Optionally, tune parameters to improve model accuracy:


Adjust p, d, q values or use model selection criteria like AIC/BIC to find the model
with the best forecasting performance. This step helps refine the model.
Python Implementation Example
The example uses a sample dataset ("Shampoo Sales") to demonstrate ARIMA model
building.

pip install pandas numpy matplotlib statsmodels


Step 1: Import necessary libraries
import pandas as pd
import numpy as np
import [Link] as plt
from [Link] import ARIMA
from [Link] import adfuller
from [Link] import plot_acf, plot_pacf

Step 2: Load and visualize data


# Load Shampoo Sales dataset (or replace with your file)
url = '[Link]
data = pd.read_csv(url, header=0, index_col=0, parse_dates=True)

# Plot time series


[Link]()
[Link]('Shampoo Sales Over Time')
[Link]()

Step 3: Check stationarity with ADF test


def test_stationarity(timeseries):
result = adfuller(timeseries)
print('ADF Statistic:', result[0])
print('p-value:', result[1])
if result[1] > 0.05:
print("Series is non-stationary; consider differencing.")
else:
print("Series is stationary.")

test_stationarity(data['Sales'])

Step 4: Differencing to make data stationary (if needed)


data_diff = data['Sales'].diff().dropna()

# Test again after differencing


Unit 3 Machine Learning and Data analytics Using Python

test_stationarity(data_diff)

# Plot differenced data


data_diff.plot()
[Link]('Differenced Series')
[Link]()

Step 5: Plot ACF and PACF to identify p and q

fig, ax = [Link](2,1, figsize=(12,8))

plot_acf(data_diff, ax=ax[0], lags=20)


plot_pacf(data_diff, ax=ax[1], lags=20)

[Link]()

• From ACF plot, identify q (lag after which autocorrelation drops off).
• From PACF plot, identify p (lag after which partial autocorrelation drops off).

Step 6: Fit ARIMA model


Assume from plots or domain knowledge, parameters chosen are p=2, d=1, q=2
(example).

model = ARIMA(data['Sales'], order=(2,1,2))


model_fit = [Link]()

print(model_fit.summary())

Step 7: Diagnostics of residuals


residuals = model_fit.resid
[Link]()
[Link](211)
[Link](title='Residuals')
[Link](212)
[Link](kind='kde', title='Density of Residuals')
[Link]()

print([Link]())

Residuals should look like white noise, normally distributed around zero.
Unit 3 Machine Learning and Data analytics Using Python

Step 8: Make predictions (forecasting)


# Forecast next 5 time points
forecast = model_fit.forecast(steps=5)
print("Forecasted values:\n", forecast)

# Plot observed and forecast


[Link](figsize=(10, 6))
[Link]([Link], data['Sales'], label='Observed')
forecast_index = pd.date_range(start=[Link][-1], periods=6, freq='M')[1:]
[Link](forecast_index, forecast, label='Forecast', color='red')
[Link]()
[Link]()

Prophet Model for Time Series Forecasting


• Prophet is an open-source forecasting tool developed by Facebook designed to
handle univariate time series data.
• It uses an additive model where non-linear trends are combined with seasonality
and holiday effects.
• The model is especially useful for time series with strong seasonal effects, multiple
seasonality (daily, weekly, yearly), and known events or holidays.
• Prophet is robust to missing data and outliers and requires less manual tuning
than classical methods like ARIMA.
• It automatically detects trend changes by estimating changepoints in the time
series.
Components of the Prophet Model
• Trend: Models the long-term increase or decrease in the data, either linearly or
with logistic growth.
• Seasonality: Models repeating cycles such as daily, weekly, and yearly patterns.
• Holidays/Events: Allows inclusion of known events and holiday effects that cause
irregular but recurring shifts.
• Error Term: Captures irregular fluctuations not modeled by the above
components.
The general form of Prophet's additive model is:
y(t)=g(t)+s(t)+h(t)+ϵt
• g(t)g(t) = trend function
• s(t)s(t) = seasonality function(s)
• h(t)h(t) = holiday effects
• ϵtϵt = error/noise
Unit 3 Machine Learning and Data analytics Using Python

3. Steps to Use Prophet Model


1. Prepare the dataset: Dataframe with two columns:
• ds for date (datetime format)
• y for the numeric value to forecast
2. Create and fit the Prophet model: Initialize the model, fit it on training data.
3. Make a future dataframe: Add periods for which you want to predict.
4. Generate forecast: Use predict() on the future dataframe to get predicted values
and uncertainty intervals.
5. Visualize results: Plot forecast components (trend, seasonality) and the overall
forecast.
6. Evaluate model performance: Using error metrics like MAE, MSE if true future
data available.

Python Implementation Example


Step 1: Import necessary libraries
import pandas as pd
from prophet import Prophet
from matplotlib import pyplot as plt
from [Link] import mean_absolute_error

Step 2: Load and prepare data


We will use the monthly car sales dataset as an example.
# Load dataset
url = '[Link]
[Link]'
df = pd.read_csv(url, header=0)

# Rename columns as required by Prophet


[Link] = ['ds', 'y']

# Convert 'ds' to datetime


df['ds'] = pd.to_datetime(df['ds'])

print([Link]())

Step 3: Create training and test set (e.g., last 12 months as test)
train = df[:-12]
test = df[-12:]

Step 4: Initialize and fit Prophet model


model = Prophet()
[Link](train)
Unit 3 Machine Learning and Data analytics Using Python

Step 5: Make a future dataframe and forecast


# Create future dataframe for 12 months ahead
future = model.make_future_dataframe(periods=12, freq='M')

# Predict the future


forecast = [Link](future)

# Show forecasted values


print(forecast[['ds', 'yhat', 'yhat_lower', 'yhat_upper']].tail())

Step 6: Evaluate the model


# Extract predictions corresponding to the test period
y_true = test['y'].values
y_pred = forecast['yhat'][-12:].values

# Calculate Mean Absolute Error


mae = mean_absolute_error(y_true, y_pred)
print(f'Mean Absolute Error: {mae:.2f}')

Step 7: Visualize the forecast


# Plot actual vs predicted
[Link](figsize=(10, 6))
[Link](train['ds'], train['y'], label='Training Data')
[Link](test['ds'], y_true, label='Actual')
[Link](test['ds'], y_pred, label='Forecast')
plt.fill_between(test['ds'],
forecast['yhat_lower'][-12:],
forecast['yhat_upper'][-12:], color='gray', alpha=0.2)
[Link]()
[Link]()

Step 8 (Optional): Visualize forecast components (trend, seasonality)


model.plot_components(forecast)
[Link]()

Key Advantages of Prophet


• Handles multiple seasonalities and holidays/events easily.
• Robust to missing data and outliers.
• Minimal parameter tuning needed; automatic changepoint detection.
• Well documented with visualization tools.
• Suitable for business forecasting and operational planning.
Unit 3 Machine Learning and Data analytics Using Python

Evaluating time series models


1. Mean Absolute Error (MAE):The average of the absolute differences between
predicted and actual values. MAE is easy to interpret since it is in the same units
as the data and less sensitive to outliers.
2. Mean Squared Error (MSE):The average of the squares of the errors, which
penalizes larger errors more heavily than MAE. It is sensitive to outliers and useful
when large errors are particularly undesirable.
3. Root Mean Squared Error (RMSE):The square root of MSE, bringing the metric
back to the data's original units. RMSE emphasizes larger errors and is commonly
used to compare models.
4. Mean Absolute Percentage Error (MAPE):Measures the average absolute
percent difference between predicted and actual values, useful when you want a
scale-independent error measurement. However, it can be problematic when
actual values are near zero.
5. Mean Absolute Scaled Error (MASE):A scaled error metric comparing the
forecast performance to a naïve baseline forecast. A value less than 1 indicates
better performance than the baseline.
In addition to these, evaluating a time series model also involves:
• Visual diagnostics of residuals: Residuals (prediction errors) should behave
like white noise with no obvious patterns. Plotting residuals and using
autocorrelation plots (ACF) helps in assessing model fit.
• Cross-validation adapted for time series: Techniques like rolling forecast
origin or walk-forward validation help ensure the model’s robustness on
unseen data respecting time order.
• Choice of metric depends on the problem context: For example, MAE is
straightforward and robust, RMSE penalizes large errors, and MAPE allows
percentage error interpretation. Model evaluation should align with the
business or research priorities, considering which types of errors are more
costly.
Thus, evaluating time series forecasting models is a mix of quantitative error metrics and
diagnostic checks to confirm that the model accurately captures data patterns and
produces reliable future predictions.

5.3 Integrating Machine Learning Models


"Integrating Machine Learning Models" typically refers to the process of embedding
trained ML models into applications, systems, or pipelines to make predictions or
decisions in real time or batch processing. Integration is the process of embedding
trained ML models into software systems so that they can be used to make predictions
on new data in real-world environments (web apps, mobile apps, IoT systems, business
dashboards, etc.).

Deployment of machine learning models:


Machine learning deployment refers to the process of integrating a trained ML model into
a real-world environment where it can make predictions or decisions based on incoming
data. This step transforms your model from a research or development artifact into a
practical tool that influences business processes, user experiences, or automated systems.
Unit 3 Machine Learning and Data analytics Using Python

Why is deployment important?


• It makes the model’s insights and predictions actionable in live environments.
• Enables automation, like real-time fraud detection, personalized
recommendations, or dynamic pricing.
• Allows continuous improvement through feedback loops and monitoring.
For example, a fraud detection model deployed on a payment platform can instantly
flag suspicious transactions, cutting losses and enhancing security.
Detailed Step-by-Step Process to Deploy ML Models
1. Develop and Train Models in a Controlled Environment
• This is where data scientists build models using historical or simulated
data.
• Multiple model versions or types (like decision trees, neural networks,
etc.) might be tested.
• The goal is to select a high-performing model based on accuracy,
precision, recall, or other relevant metrics.
• Tools such as Jupyter notebooks, Python/R environments, and ML
libraries like TensorFlow or Scikit-learn are commonly used.
2. Optimize and Test the Model Code
• Once the best model is identified, the accompanying code is cleaned and
optimized for production.
• This may involve rewriting code for efficiency, removing debug logs, and
ensuring scalability.
• Testing includes unit tests (checking small pieces of functionality),
integration tests (ensuring components work together), and load tests
(seeing how the system behaves under stress).
• Automated testing pipelines (CI/CD — Continuous
Integration/Continuous Deployment) are often set up so updates can be
tested and released reliably.
3. Preparing for Containerization
• Containerization packages the model and all its dependencies into a
portable container, using technologies like Docker.
• Containers ensure consistency — the model behaves the same across
different environments (development, staging, production).
• Containers facilitate easy deployment, upgrades, scaling, and rollbacks
since the environment is isolated and reproducible.
• In large-scale systems, orchestration tools like Kubernetes manage these
containers, handling load balancing, scaling, and fault tolerance.
4. Continuous Monitoring and Maintenance
• Models deployed in production must be continuously monitored to make
sure they maintain quality.
• Data drift (changes in input data patterns) can degrade model
performance, so feedback loops and alerts are necessary.
• Monitoring tracks metrics like prediction accuracy, latency, throughput,
and failure rates.
• Retraining or fine-tuning models with fresh data is often required to
respond to changes in underlying patterns.
Unit 3 Machine Learning and Data analytics Using Python

• Maintenance also includes updating dependencies, patching security


vulnerabilities, and ensuring compliance with regulations.
In-Depth Look at Deployment Strategies
• Shadow Deployment
• The new model runs alongside the live (old) model without influencing
outcomes.
• Input data is fed to both models simultaneously; only the existing model’s
predictions affect users.
• The new model’s output is logged and analyzed offline to identify any
issues before switching.
• This helps avoid deployment risks and uncovers subtle performance
problems.
• Canary Deployment
• The new model is rolled out to a small subset of users first (e.g., 5-10%).
• This controlled exposure allows observing system behavior and user
impact in a limited fashion.
• If the new model performs well, its user coverage is gradually increased.
• This strategy supports quick rollbacks if problems are detected, mitigating
risk.
• A/B Testing
• Different user groups receive different model variants simultaneously.
• Their performance is compared based on key metrics (conversion rate,
accuracy, engagement).
• This experimental approach provides empirical evidence of which model
works best.
• It’s valuable for testing new algorithms, user interface changes, or feature
sets.
Tools and Platforms: Their Roles and Strengths
• Kubernetes
• An open-source platform for automating container deployment, scaling,
and management.
• It abstracts the complexity of running containers across many machines.
• Kubernetes handles automatic load balancing, failover, and resource
optimization.
• For ML, it ensures your model-serving services stay available and
responsive under varying loads.
• Kubeflow
• A machine learning toolkit built on top of Kubernetes.
• It simplifies ML workflows — from data ingestion, training, to
deployment.
• Provides components for hyperparameter tuning, experiment tracking,
and pipeline automation.
• Ideal for organizations scaling up ML production at enterprise level.
• MLflow
• Focuses on the ML lifecycle management.
• Tracks experiments to compare different model runs.
Unit 3 Machine Learning and Data analytics Using Python

• Stores models and dependencies for easy deployment to various


platforms.
• Supports versioning and reproducibility crucial for collaboration and
compliance.
• TensorFlow Serving
• Specifically optimized for serving TensorFlow models.
• Built to handle high throughput and low latency requests.
• Supports model version management, enabling seamless upgrades
without downtime.
• Integrates easily with TensorFlow’s ecosystem.
Best Practices for Successful Deployment
• Automated Testing
• Automate tests to catch issues early — reduces manual errors.
• Include tests for data validation, model logic, integration with other
services.
• Version Control
• Track changes in code, data, and models rigorously.
• Enables reproducibility, collaborative development, and traceability.
• Facilitates rollback to previous stable versions if needed.
• Security
• Protect models from unauthorized access, tampering, or data leakage.
• Secure data pipelines, endpoints, and storage.
• Implement role-based access controls, encryption, and regular audits.

Building web applications with flask and Django


When building web applications to deploy machine learning models, Flask and Django
are the two most popular Python web frameworks you can consider. Both let you build
web APIs or user-facing apps where your ML models can make predictions dynamically.
Flask and Django in ML Web Apps: Key Points
• Flask is a lightweight, micro-framework that is easy to learn and flexible. It’s well-
suited for simpler or smaller machine learning applications where you want full
control over components like routing, database, and authentication. Flask doesn’t
include built-in database ORM or user management, so you add extensions as
needed. Its simplicity makes Flask ideal for beginners or when you want to quickly
create a REST API to serve your ML model predictions. Flask is popular with
companies like Netflix and Reddit for these reasons.
• Django is a more full-featured, batteries-included web framework that follows the
Model-View-Controller (MVC) pattern. It comes with built-in ORM, authentication,
form handling, and admin interface. This makes Django a great choice for larger
or more complex applications that require features like user management, admin
dashboards, form validation, or integrating an ML model into a broader system.
Django is favored by companies like Instagram and Pinterest.
Unit 3 Machine Learning and Data analytics Using Python

Steps to Build ML Web App (Common for Both)


1. Train and Save the ML Model
o Use sklearn, xgboost, tensorflow, etc.
o Save model:
import joblib
[Link](model, "[Link]")
2. Set Up Web App (Flask or Django)
3. Load Model in Backend & Create Predict View
4. Build HTML Templates or APIs for Input & Output
5. Deploy the App
• Options: Heroku, Render, AWS, PythonAnywhere, Docker

Example 1: Flask + ML (Iris Flower Predictor)


[Link] – Flask backend

from flask import Flask, request, render_template


import joblib
import numpy as np

app = Flask(__name__)
model = [Link]("iris_model.pkl")

@[Link]('/')
def home():
return render_template('[Link]')

@[Link]('/predict', methods=['POST'])
def predict():
features = [float(x) for x in [Link]()]
prediction = [Link]([features])
return render_template('[Link]', prediction_text=f'Predicted: {prediction[0]}')

if __name__ == "__main__":
[Link](debug=True)

templates/[Link]
<form method="POST" action="/predict">
Sepal Length: <input name="sepal_length"><br>
Sepal Width: <input name="sepal_width"><br>
Petal Length: <input name="petal_length"><br>
Petal Width: <input name="petal_width"><br>
<input type="submit" value="Predict">
</form>
<p>{{ prediction_text }}</p>
Unit 3 Machine Learning and Data analytics Using Python

Example 2: Django + ML (House Price Predictor)


Setup
django-admin startproject mlapp
cd mlapp
python [Link] startapp predictor

In [Link]
from [Link] import render
import joblib
model = [Link]("price_model.pkl")

def home(request):
if [Link] == 'POST':
sqft = float([Link]['sqft'])
rooms = int([Link]['rooms'])
result = [Link]([[sqft, rooms]])
return render(request, '[Link]', {'result': result[0]})
return render(request, '[Link]')

In [Link]
from [Link] import path
from predictor import views
urlpatterns = [
path('', [Link]),
]

Template: [Link]
<form method="POST">
{% csrf_token %}
Area in Sqft: <input name="sqft"><br>
Number of Rooms: <input name="rooms"><br>
<input type="submit" value="Predict Price"><br>
</form>
{% if result %}
<h3>Predicted Price: ₹{{ result }}</h3>
{% endif %}
Unit 3 Machine Learning and Data analytics Using Python

Case Studies on Real-World Applications of Machine Learning


Machine Learning (ML) is widely applied in real-world systems to make
intelligent decisions, automate processes, and improve efficiency based on
historical data. From predicting diseases to driving cars autonomously, ML is
changing industries.

1. Healthcare: Diabetic Retinopathy Detection by Google DeepMind


• Problem: Early diagnosis of diabetic retinopathy (an eye disease) requires expert
analysis of retinal images, a time-consuming and specialist-dependent process.
• Solution: DeepMind developed a deep learning model that analyzes optical
coherence tomography (OCT) and fundus images to detect diabetic retinopathy
automatically.
• Implementation: The model was trained on a large, labeled dataset of eye scans
from clinical partners, learning to identify subtle disease markers that even
specialists find challenging.
• Results: The model achieved diagnostic accuracy on par with ophthalmologists,
accelerating diagnosis and enabling early interventions. This is especially critical
in underserved regions lacking specialists, expanding scalable access to eye care.
2. Financial Services: Fraud Detection at PayPal
• Problem: Online payment platforms like PayPal are vulnerable to real-time fraud
including identity theft and unauthorized transactions.
• Solution: PayPal uses ML models that analyze transaction data and user behavior
patterns to detect fraudulent activities dynamically.
• Implementation: The system processes millions of transactions in real-time,
adapting continuously to new fraud patterns by retraining and updating the
model.
• Results: Fraud detection rates improved by up to 30-50%, significantly lowering
losses and false positives, which improves legitimate user experience without
unnecessary transaction blocks.
3. E-Commerce: Personalized Recommendations at Amazon
• Problem: With thousands of products, personalizing recommendations improves
customer engagement and sales.
• Solution: Amazon deploys machine learning algorithms, including collaborative
filtering and deep learning, to analyze user behavior such as purchase history,
search queries, and browsing patterns.
• Implementation: These recommendations appear throughout the platform—in
search results, promotional emails, and advertisements—making the shopping
experience tailored to individual preferences.
• Results: Personalized recommendations increased conversion and sales by
around 15-30%, driving customer retention and satisfaction.
4. Manufacturing & Infrastructure: Predictive Maintenance at General Electric (GE)
• Problem: Unplanned machinery failures cause costly downtime on factory floors.
• Solution: ML models analyze sensor data streams to predict equipment failures
before they occur.
• Implementation: Continuous monitoring of operational metrics feeds into
predictive algorithms that schedule maintenance proactively.
Unit 3 Machine Learning and Data analytics Using Python

• Results: This approach reduced machine downtime by about 25%, optimizing


maintenance schedules and saving millions in operational costs.
5. Autonomous Driving: Tesla Autopilot
• Problem: Enhance vehicle safety and convenience by automating driving tasks.
• Solution: Tesla’s Autopilot uses ML models to process data from cameras, radar,
and sensors, enabling functions like auto-steering, lane keeping, and adaptive
cruise control.
• Implementation: The model improves continuously using fleet data collected
from millions of vehicles, delivered as over-the-air updates.
• Results: While requiring driver supervision, Autopilot reduces accidents caused
by human error and improves driving comfort[previous content].
6. Entertainment: Content Recommendations by Netflix
• Problem: Keeping subscribers engaged by recommending personalized content.
• Solution: Netflix applies ML to analyze vast user behavior data—viewing history,
ratings, and search patterns—to predict what shows or movies users might like.
• Implementation: Recommendations are dynamically integrated into the user
interface, influencing browsing and viewing habits.
• Result: These tailored experiences minimize churn, boost viewing time, and
improve overall user satisfaction[previous content].
7. Data Center Efficiency: Cooling Optimization by Google DeepMind
• Problem: Data centers consume enormous energy for cooling, incurring high
costs and environmental impact.
• Solution: ML models forecast cooling demand using historical data and
environmental conditions to optimize system settings dynamically.
• Implementation: The model is integrated with data center controls to adjust
cooling in real time for maximum efficiency.
• Results: Google achieved up to 40% reduction in energy used for cooling,
significantly lowering operational expenses and carbon footprint.
8. Real Estate: Automated Valuation by Zillow
• Problem: Buyers and sellers need accurate, rapid home price estimates.
• Solution: Zillow’s “Zestimate” uses ML models that analyze property features,
sales history, geographic data, and market trends to estimate home values.
• Implementation: Continuous data ingestion and model retraining ensure
valuations stay current and relevant.
• Results: Users get reliable price guidance, improving decision-making in real
estate transactions.
9. Music Streaming: Personalized Playlists at Spotify
• Problem: Enhance user retention by recommending music aligned with
individual tastes.
• Solution: ML algorithms analyze listening behaviors and preferences using
collaborative filtering and content-based methods.
• Implementation: Recommendations update dynamically, driving playlist
creation and music discovery.
• Results: Increased user engagement and more frequent listening
sessions[previous content].
Unit 3 Machine Learning and Data analytics Using Python

10. Ride-Sharing: Demand Prediction and Driver Allocation at Uber


• Problem: Reduce rider wait times and improve driver utilization.
• Solution: Uber uses ML to forecast rider demand by region and time, adjusting
driver incentives and positioning accordingly.
• Implementation: The model factors in historical ride data, weather, events, and
traffic patterns for predictive accuracy.
• Results: Average wait times decreased by 15%, driver earnings increased, and
rider experience improved noticeably.
11. Agriculture: Crop Yield Improvement by Bayer
• Problem: Optimize agricultural practices to increase yields and sustainability.
• Solution: ML platforms analyze satellite imagery, soil data, and weather forecasts
to provide specific planting, fertilization, and irrigation advice.
• Implementation: Farmers receive customized recommendations tailored to local
conditions and crop types.
• Results: Crop yields improved by up to 20%, water and chemical use were
reduced, supporting sustainable farming practices.

Additional Industry Use Cases


• Retail Inventory Management: ML predicts demand patterns to avoid
overstocking and stockouts, reducing costs and improving sales.
• Telecommunications Customer Feedback: Natural language processing (NLP)
models classify and summarize customer issues, aiding quicker service
resolutions.
• Traffic Congestion Prediction: ML models forecast real-time traffic to enable
dynamic rerouting and improved flow.
• Food Delivery Route Optimization: Genetic algorithms optimize delivery
schedules lowering costs and times.

You might also like