Lecture 3 Design of a ML System
Lecture 3 Design of a ML System
1
1/6/2025
Machine Learning (CSC601B) By: Prof. (Dr.) Vineet Mehan 7 Machine Learning (CSC601B) 8
• Clearly identify the problem to solve and its scope. • Objective: Predict whether a customer will churn (stop using a
service) based on their usage patterns and demographics.
• Specify the input, output, and type of ML task (e.g., classification,
regression, clustering). • Inputs: Customer attributes (age, location, subscription type) and
behavioural data (session duration, payment history).
• Type of ML Task: Supervised binary classification. • Gather data relevant to the problem.
2
1/6/2025
• Splitting: Split the dataset into 70% training, 15% validation, and 15% • Purpose: The training set is used to teach the model to identify
testing subsets. patterns and learn from the data. It forms the foundation of the
model's understanding of the problem.
• Size: Allocating 70% of the dataset ensures that the model has a
sufficient amount of data to learn from, reducing the risk of
underfitting (where the model doesn't learn enough).
• Purpose: The validation set is used to fine-tune the model. This subset
helps: • Purpose: The testing set evaluates the model's performance on
• Monitor the model's performance during training. unseen data after training is complete. It gives an unbiased estimate
• Tune hyperparameters (e.g., learning rate, number of layers). of how the model will perform in real-world scenarios.
• Detect overfitting, which occurs when the model performs well on the training data
but poorly on unseen data.
• Size: Reserving 15% ensures enough data to reliably assess the
• Size: A 15% allocation provides a good balance to evaluate the model model’s generalization capability.
during training without sacrificing too much data from the training subset.
3
1/6/2025
• Choose an algorithm suitable for the task and data type. • Algorithm: Start with a Logistic Regression as a baseline due to its
simplicity. Then move to Random Forest for better handling of mixed
data types and non-linearity.
• Compare traditional ML models (e.g., Random Forest, SVM) and deep
learning models (e.g., CNNs, RNNs).
• Baseline Model: Logistic Regression to establish a minimum expected
accuracy.
• Select a baseline model for benchmarking.
• Define the architecture or configuration of the selected model. • Model: Random Forest with the following hyperparameters:
• Number of trees: 100.
• Train the model using the training dataset and tune hyperparameters. • Max depth: 10.
• Minimum samples per leaf: 2.
• Monitor metrics during training to avoid overfitting or underfitting.
• Training Process: Train on the 70% training set.
5. Evaluation 5. Evaluation
• Theory: • Let's use a simple example to explain cross-validation, specifically 3-
fold cross-validation, with a small dataset.
• Use appropriate metrics to measure the model’s performance.
• Dataset:
• Conduct cross-validation to ensure robustness. • Imagine we have a dataset of 6 data points:
• Data: [A, B, C, D, E, F]
• Perform error analysis to identify areas for improvement. • Labels: [1, 1, 0, 0, 1, 0]
4
1/6/2025
5. Evaluation 5. Evaluation
• Goal • Step-by-Step Process
• We want to evaluate a model’s performance using cross-validation. • Step 1: Split Data into 3 Folds
We'll use 3-fold cross-validation, which means:
• We divide the dataset into 3 parts (folds):
1.The dataset will be split into 3 equal parts (folds). • Fold 1: [A, B]
• Fold 2: [C, D]
2.Each fold will take turns as the test set, while the other two are used • Fold 3: [E, F]
as the training set.
Machine Learning (CSC601B) 25 Machine Learning (CSC601B) 26
Final Result
The cross-validation process tells us the model's average
5. Evaluation 5. Evaluation accuracy is 50%. This is a more reliable estimate of the model's
performance than using a single train-test split, as it tests the
model on all parts of the dataset.
5. Evaluation 6. Deployment
• Results: • Theory:
• Accuracy: 92%.
• Precision: 85%. • Deploy the trained model into a production environment.
• Recall: 78%.
• F1-score: 81%. • Make predictions accessible via APIs or integrated systems.
5
1/6/2025
• Bias Mitigation:
• Design the system to handle growing amounts of data and users.
• Check if the model unfairly predicts churn for specific demographics (e.g.,
age or location).
• Use techniques like caching, parallel processing, and distributed
• Compliance: systems for scalability.
• Follow Government regulations by anonymizing customer data.
• Provide explanations for churn predictions using SHAP values to
stakeholders.
SHapley Additive exPlanations
Machine Learning (CSC601B) 35 Machine Learning (CSC601B) 36
6
1/6/2025
Task REFERENCES
• Explain the steps involved in preprocessing data for a machine 1. ChatGPT
learning model. How would you handle missing values, categorical
variables, and scaling for numerical features in a churn prediction
system? 2. Gemini
Machine Learning (CSC601B) By: Prof. (Dr.) Vineet Mehan 39 Machine Learning (CSC601B) 40
THANK YOU