0% found this document useful (0 votes)
36 views13 pages

Unit 3

This document covers key concepts in model evaluation, data visualization, and management in machine learning. It explains various metrics such as accuracy, precision, recall, F1-score, and AUC, emphasizing their importance in evaluating model performance, especially in imbalanced datasets. Additionally, it discusses effective data visualization techniques and data management activities to ensure data is accurately collected, stored, organized, and processed.

Uploaded by

deva maurya
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
36 views13 pages

Unit 3

This document covers key concepts in model evaluation, data visualization, and management in machine learning. It explains various metrics such as accuracy, precision, recall, F1-score, and AUC, emphasizing their importance in evaluating model performance, especially in imbalanced datasets. Additionally, it discusses effective data visualization techniques and data management activities to ensure data is accurately collected, stored, organized, and processed.

Uploaded by

deva maurya
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 13

UNIT-3 Model Evaluation, Data Visualization and

Management

1.Accuracy:
Accuracy is one of the most commonly used metrics to evaluate the performance
of a classification model. It tells us how many predictions the model got correct
out of all the predictions it made. Accuracy is especially useful when the dataset is
balanced (i.e., when the number of samples in each class is roughly equal).

How Accuracy is calculated

Accuracy is calculated using the formula:

Accuracy= (Number of Correct Predictions / Total Number of Predictions) × 100

Example of Accuracy Calculation

Imagine you have a model that classifies emails as either "Spam" or "Not Spam."
You test the model on 100 emails, and it correctly classifies 90 of them.

Using the formula: Accuracy= (90/100) ×100=90%

So, the model has 90% accuracy, meaning it correctly predicted 90% of the
emails.

Confusion Matrix and Accuracy

To understand accuracy better, let’s look at the confusion matrix, which breaks
down predictions into four categories:

 True Positive (TP): Correctly predicted spam emails


 True Negative (TN): Correctly predicted not spam emails
 False Positive (FP): Emails wrongly classified as spam (but they are not)
 False Negative (FN): Emails wrongly classified as not spam (but they are
spam)

Now, accuracy can be written as: Accuracy=TP+TN / TP+TN+FP+FN

This means accuracy is the total number of correct predictions (TP + TN) divided
by the total number of predictions made.

When Accuracy is misleading: Accuracy works well when the dataset is


balanced; meaning the number of positive and negative samples is nearly equal.
However, it can be misleading when the dataset is imbalanced (one class is much
larger than the other).
Example of Accuracy Being Misleading

Imagine a dataset where 95% of emails are "Not Spam" and only 5% are
"Spam". If a model predicts "Not Spam" for every email, its accuracy will be
95%, even though it never actually identified any spam emails.

2. Precision

Precision is a classification metric that tells us how many of the positive


predictions made by a model are actually correct. It is especially important when
false positives (incorrectly predicting something as positive) need to be
minimized.

Understanding Precision with an Example


Imagine you have a model that predicts whether an email is Spam or Not Spam.
If the model predicts an email as spam, but it’s actually a normal email, that’s a
False Positive (FP). Precision helps answer the question:
Precision Formula

Precision is calculated using the formula: Precision=TP / TP+FP

Where:

 TP (True Positives) = Correctly predicted spam emails


 FP (False Positives) = Normal emails that were incorrectly classified as
spam

Example Calculation

Suppose a model predicts 20 emails as spam, but after checking manually:

 15 are actually spam (TP)


 5 are wrongly classified as spam (FP)

Using the formula: Precision=15 /15+5=15 / 20=0.75 or 75%

This means when the model predicts an email as spam, it is correct 75% of
the time.

Why is Precision Important?

Precision is useful when false positives are costly. For example:

 Medical Diagnosis: If a test predicts cancer, but it’s a false positive, it may
cause unnecessary stress and treatments.
 Spam Detection: If precision is low, important emails (not spam) might be
incorrectly classified as spam and lost.
 Fraud Detection: If a bank fraud detection system has low precision, many
normal transactions might be wrongly flagged, causing inconvenience to
users.

3. Recall

Recall is a classification metric that tells us how well a model identifies all the
actual positive cases in the dataset. It is important when false negatives
(missed positive cases) need to be minimized.

Understanding Recall with an Example

Imagine you have a model that predicts whether an email is Spam or Not Spam.
If the model fails to detect a spam email and classifies it as "Not Spam," that’s a
False Negative (FN).

Formula for Recall

Recall is calculated using the formula: Recall=TP / TP+FN

Where:

 TP (True Positives) = Correctly predicted spam emails


 FN (False Negatives) = Spam emails that were incorrectly classified as
"Not Spam"

Example Calculation

Let’s say there are 30 spam emails in total, but the model only identifies 20
correctly and misses 10.

Using the formula: Recall=20 / 20+10=20 / 30=0.67 or 67%

This means the model correctly identifies 67% of all actual spam emails, but
33% are missed.

Why is Recall Important?

Recall is useful when missing a positive case is more dangerous than having
a false positive. Some examples include:

 Medical Diagnosis: If a test fails to detect a person with cancer (false


negative), the disease might go untreated, leading to severe consequences.
 Fraud Detection: If a fraud detection system misses fraudulent
transactions (false negatives), banks may lose money.
 Security Systems: If a face recognition system fails to detect an intruder
(false negative), security is compromised.
4. F1-Score

F1-score is a classification metric that balances precision and recall into a single
number. It is especially useful when you need a trade-off between correctly
identifying positive cases (recall) and ensuring that the predicted positive cases
are actually correct (precision).

Why Do We Need F1-Score?

Sometimes, focusing only on precision or recall is not enough.

 If precision is high but recall is low, the model is very careful in


predicting positives but misses many real positive cases.
 If recall is high but precision is low, the model detects most positive
cases but also includes many false positives.

F1-score helps by finding a balance between the two.

Formula for F1-Score

The F1-score is the harmonic mean of precision and recall, calculated as:

F1=2× (Precision × Recall) / (Precision + Recall)

Where:

 Precision = How many of the predicted positive cases were actually correct
 Recall = How many of the actual positive cases were correctly identified

Example Calculation

Let’s say we have:

 Precision = 80% (0.8) (80% of predicted spam emails are actually spam)
 Recall = 60% (0.6) (60% of actual spam emails were detected)

Using the formula: F1=2× (0.8×0.6) / (0.8+0.6)

So, the F1-score is 68.57%, which gives a single number balancing both
precision and recall.

When to Use F1-Score?

 The dataset is imbalanced (e.g., detecting rare diseases, fraud detection).


 Both false positives and false negatives are important to minimize.
 You need a single metric to evaluate a model instead of looking at
precision and recall separately.
5. under the Curve (AUC) in Classification Metrics – Explained Simply

The Area under the Curve (AUC) is a metric used to evaluate the performance
of a classification model, especially in imbalanced datasets (where one class is
much smaller than the other). It measures how well the model can distinguish
between different classes (e.g., "Spam" vs. "Not Spam" or "Disease" vs. "No
Disease").

AUC is often used with the Receiver Operating Characteristic (ROC) curve, so
you will commonly see AUC-ROC as a term.

Understanding the ROC Curve

To understand AUC, we first need to know about the ROC curve (Receiver
Operating Characteristic curve).

A ROC curve is a graph that shows the trade-off between:

 True Positive Rate (Recall) – How many actual positive cases were
correctly predicted?
 False Positive Rate – How many negative cases were incorrectly predicted
as positive?

The ROC curve is drawn by plotting the True Positive Rate (y-axis) against the
False Positive Rate (x-axis) for different threshold values.

What Does AUC Measure?

 AUC (Area under the Curve) represents the total area under the ROC
curve.
 It tells us how well the model is at separating positive and negative cases.
 The higher the AUC, the better the model at distinguishing between
classes.

Q.2 Data Visualization and Communication

Data visualization and communication play crucial roles in the field of machine
learning. Effective visualization not only helps in understanding and exploring the
data but also aids in communicating the results and insights derived from machine
learning models. Data visualization is the representation of data through use of
common graphics, such as charts, plots, info graphics, and even animations. These
visual displays of information communicate complex data relationships and data-
driven insights in a way that is easy to understand.

Principles of Effective Data Visualization

1. Understand the Problem Domain: Gain a deep understanding of the problem


domain and the goals of your machine learning project to create visualizations that
are relevant and impactful.
2. Visualize Model Performance: Use visualizations to communicate the
performance of machine learning models. ROC curves, precision-recall curves, and
confusion matrices are particularly useful for classification models.

3. Feature Importance: Visualize feature importance to communicate the


contribution of each feature to the model's predictions. This can help stakeholders
understand the factors driving the model's decisions.

4. Model Interpretability: Employ visualizations that enhance the interpretability of


complex models. Techniques like partial dependence plots, SHAP (SHapley
Additive explanations), and LIME (Local Interpretable Model- agnostic
Explanations) can be effective.

5. Evaluate Bias and Fairness: Use visualizations to assess and communicate


potential biases in the model predictions. Visualize performance metrics across
different demographic groups to identify and address fairness concerns.

6. Dynamic Model Exploration: Create interactive visualizations that allow users to


explore model predictions, understand decision boundaries, and investigate how
changes in input features impact the output.

7. Visualizing Time Series and Temporal Patterns: If working with time-series data,
use visualizations such as line charts, stacked area charts, or heat maps to
highlight temporal patterns and trends.

8. Ensemble Model Visualization: When using ensemble models, visualize the


combined output of multiple models. Understanding the decision-making process
of an ensemble can be valuable.

9. Uncertainty Visualization: If your model provides uncertainty estimates (e.g.,


Bayesian models), visualize the uncertainty to convey the level of confidence in
predictions.

10. Data Quality and Preprocessing: Use visualizations to explore the distribution
of your input features, identify outliers, and assess the effectiveness of
preprocessing steps.

11. Interactive Reporting for Stakeholders: Create interactive reports and


dashboards to present machine learning results to stakeholders. Allow them to
interact with the data and model outputs for deeper insights.

12. Avoid Overfitting: Visualize learning curves to check for overfitting or


underfitting. Understanding the model's behavior during training can inform
adjustments to hyper parameters.

13. Communication with Domain Experts: Collaborate with domain experts to


create visualizations that are meaningful and align with the domain-specific
understanding of the problem.
14. Version Control for Visualizations: If your machine learning project involves
multiple iterations or models, consider version control for visualizations to track
changes and improvements over time.

Q.3 Types of Visualizations – Explained Simply

Data visualization helps present complex data in an easy-to-understand way using different types of
charts and graphs. Choosing the right type of visualization depends on what you want to communicate.
Below are some common types of visualizations, explained simply.

1. Bar Chart

A bar chart is used to compare different categories. It consists of rectangular bars,


where the length of each bar represents the value of a category. The bars can be
arranged either vertically or horizontally.

Example: Imagine you run a store and want to compare sales of different
products. A bar chart can show which product sold the most and which one sold
the least.

2. Line Chart

A line chart is used to show trends over time. It connects data points with a line,
making it easy to see how values increase or decrease over a period.

Example: A company wants to track its monthly revenue for the past year. A line
chart will clearly show whether the revenue is going up, down, or staying the
same.

3. Pie Chart

A pie chart is used to show proportions or percentages. It divides data into slices,
where each slice represents a portion of the whole.

Example: If you survey 100 people about their favourite fruit and 50% say apples,
30% say bananas, and 20% say oranges, a pie chart will show how big each group
is in relation to the whole.

4. Scatter Plot

A scatter plot is used to show the relationship between two variables. Each data
point is plotted on a graph, and the pattern of the points helps identify
correlations.

Example: A researcher wants to see if there is a relationship between hours of


study and test scores. If students who study more tend to get higher scores, the
scatter plot will show a positive correlation.

5. Histogram

A histogram is similar to a bar chart, but it is used to show the distribution of


numerical data. It groups data into ranges (or bins) and shows how many values
fall into each range.

Example: A teacher wants to analyse students' test scores. A histogram can show
how many students scored between 50-60, 60-70, 70-80, and so on.
6. Heatmap

A heatmap uses colors to represent data values. Darker or brighter colors usually
indicate higher values, while lighter colors indicate lower values.

Example: A website owner wants to see where visitors click the most. A heatmap
can show which parts of the website get the most interaction based on color
intensity.

7. Box Plot (Box-and-Whisker Plot)

A box plot is used to show the distribution and variability of data. It highlights the
median, quartiles, and any outliers.

Example: A scientist studying rainfall over several years can use a box plot to see
the spread of data, the average rainfall, and any extreme values.

8. Bubble Chart

A bubble chart is like a scatter plot but adds a third dimension by using the size of
the bubbles to represent another variable.

Example: A business wants to analyse sales revenue (X-axis), customer


satisfaction (Y-axis), and number of stores (bubble size). A bubble chart can
represent all three factors in one graph.

9. Tree Map

A tree map is used to display hierarchical data using nested rectangles. The size of
each rectangle represents a proportion of the whole.

Example: A company’s budget can be displayed using a tree map to show how
much money is allocated to different departments like marketing, sales, and IT.

10. Gantt Chart

A Gantt chart is used to visualize project schedules. It shows tasks along a


timeline, indicating start and end dates.

Example: A project manager plans a software development project. A Gantt chart


shows when different tasks will start, how long they will take, and if there are any
overlaps.

Q.4 Data Management Activities

Data management is the process of collecting, storing, organizing, and


maintaining data so that it can be used effectively. Proper data management
ensures that data is accurate, secure, and accessible when needed.

There are several key activities involved in managing data, each playing an
important role in handling and utilizing data efficiently. Let’s go through them one
by one.

1. Data Collection
Data collection is the process of gathering data from different sources. This data
can come from surveys, databases, social media, sensors, or business
transactions. The goal is to collect reliable and relevant data that can be used
for decision-making.

For example, an e-commerce company collects data on customer purchases,


website visits, and feedback to understand buying patterns.

2. Data Storage

Once data is collected, it needs to be stored securely. Storage methods depend on


the type and size of data. Some common storage systems include:

 Databases (SQL, NoSQL) for structured data.


 Cloud storage (Google Drive, AWS, Azure) for large-scale access.
 Data warehouses for business analytics.

For example, banks store customer transactions in databases to track spending


and detect fraud.

3. Data Organization

Data must be properly structured and categorized so that it can be easily retrieved
and analyzed. This involves:

 Creating consistent formats (dates, names, currencies).


 Using metadata (descriptions, tags) for easy searching.
 Grouping related data into tables or files.

For example, in a hospital, patient records are categorized by name, age, medical
history, and doctor’s notes, making it easy to find information when needed.

4. Data Processing and Integration

Raw data is often messy and unstructured, so it needs to be processed before use.
This involves:

 Cleaning data (removing duplicates, fixing errors).


 Merging data from different sources.
 Standardizing formats (e.g., all dates in YYYY-MM-DD).

For example, a company might collect customer details from online forms and
phone calls. To analyze trends, they need to merge this data into one system
and remove errors.

5. Data Security and Privacy


Data security is critical to protect information from theft, leaks, or
unauthorized access. Companies must follow data protection laws like GDPR
(Europe) and CCPA (California).

Security measures include:

 Encryption (converting data into unreadable format for security).


 Access control (only authorized users can access certain data).
 Regular security audits to detect weaknesses.

For example, online payment systems use encryption to protect customer credit
card details from hackers.

6. Data Backup and Recovery

Data loss can happen due to system crashes, cyberattacks, or human errors.
Having backup copies ensures that data can be recovered if something goes
wrong.

Common backup strategies include:

 Cloud backups (remote storage for easy recovery).


 On-site backups (physical storage like external hard drives).
 Automated backups (scheduled backups to prevent data loss).

For example, banks regularly back up customer data to avoid losing transaction
history in case of a system failure.

7. Data Analysis and Reporting

Once data is processed, it needs to be analyzed to extract useful insights.


Companies use data analytics to identify patterns, trends, and opportunities.

Methods of data analysis include:

 Descriptive analytics (summarizing past data).


 Predictive analytics (forecasting future trends).
 Business intelligence (BI) tools (Tableau, Power BI).

For example, an online store analyzes customer purchase history to recommend


similar products, improving sales.

8. Data Governance and Compliance

Data governance ensures that data is managed according to rules and policies
to maintain accuracy, security, and ethical use.

This involves:
 Setting standards for data quality.
 Ensuring compliance with legal regulations.
 Defining roles and responsibilities for data management.

For example, hospitals must follow HIPAA regulations to protect patient medical
records and ensure privacy.

Q.5 Data Pipelines – Explained in Detail

A data pipeline is a system that moves data from one place to another,
transforming and processing it along the way. It helps organizations automate the
flow of data so it can be collected, cleaned, analysed, and stored efficiently.

Imagine a water pipeline that carries water from a source (like a river) to a city’s
water supply. Similarly, a data pipeline carries data from different sources
(databases, APIs, sensors) to a storage system (data warehouse, Data Lake) and
processes it for analysis.

Why Are Data Pipelines Important?

Modern businesses generate massive amounts of data from different sources.


Without a structured pipeline, managing this data manually would be inefficient
and error-prone. Data pipelines help by:

 Automating data movement from sources to destinations.


 Ensuring data quality by cleaning and filtering errors.
 Improving efficiency by handling large amounts of data quickly.
 Enabling real-time analytics for fast decision-making.

For example, an e-commerce company uses a data pipeline to collect customer


orders, clean the data, and send it to a dashboard for sales analysis.

Components of a Data Pipeline

A data pipeline consists of multiple stages, each playing a crucial role in


processing data. Let’s break down these stages:

1. Data Ingestion (Collecting Data)

This is the first step, where data is collected from different sources. Common data
sources include:

 Databases (SQL, NoSQL)


 APIs (External data services)
 IoT sensors (Weather stations, smart devices)
 Logs (Website traffic, application logs)
 Streaming data (Social media, stock market feeds)

Example: An online retailer collects product sales data from its website, customer
app, and point-of-sale (POS) system.

2. Data Processing (Cleaning and Transforming Data)

Once data is collected, it often contains errors, duplicates, or missing values. This
stage cleans and transforms the data into a structured format for analysis.

 Cleaning: Removing duplicates, correcting errors, handling missing values.


 Transformation: Converting data into a usable format (e.g., changing text
to numerical values).
 Filtering: Keeping only relevant data (e.g., removing spam emails).

Example: A healthcare system collects patient data from different hospitals. The
pipeline standardizes patient names and formats dates correctly before storing the
data.

3. Data Storage (Saving Processed Data)

After processing, the cleaned data needs to be stored for future use. The choice of
storage depends on the use case:

 Data warehouses (Google Big Query, Amazon Redshift) – For structured


business analytics.
 Data lakes (Amazon S3, Azure Data Lake) – For raw and unstructured data.
 Databases (PostgreSQL, MySQL) – For transactional data storage.

Example: A streaming platform stores user watch history in a data warehouse to


recommend movies based on past preferences.

4. Data Orchestration (Managing the Workflow)

Orchestration tools control the flow of data between different pipeline stages.
They schedule tasks, monitor errors, and automate processes.

Popular orchestration tools:

 Apache Airflow – For complex data workflows.


 AWS Step Functions – For cloud-based automation.
 Prefect – For modern, scalable pipelines.

Example: A marketing team uses Apache Airflow to automate daily reports by


collecting ad performance data, cleaning it, and sending summaries via email.

5. Data Analysis and Consumption


The final stage is where users access and analyse the processed data. This can
be done through:

 Dashboards (Tableau, Power BI) – For business intelligence.


 Machine Learning Models – For predictive analytics.
 APIs – To serve data to applications.

Example: A ride-sharing app analyses user demand data to adjust pricing during
peak hours.

You might also like