0% found this document useful (0 votes)
504 views49 pages

AWS ML Engineer Associate Master Cheat Sheet

Uploaded by

filiphpop
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
504 views49 pages

AWS ML Engineer Associate Master Cheat Sheet

Uploaded by

filiphpop
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

SKILLCERTPRO

AWS ML Engineer Associate Master Cheat Sheet


Domain 1: Data Engineering
1.1: Create data repositories for ML.

Identify data sources


• Content and location: This step focuses on understanding the nature of your data
(e.g., text, images, numerical data) and where it originates. Data can come from
various sources, both internal and external to your organization.
• Primary sources (such as user data): Primary data sources are those that you collect
directly. Examples include:
o User data: Data generated by user interactions with your applications or
services (e.g., website clicks, app usage, purchase history).
o Sensor data: Data collected from physical devices (e.g., IoT sensors, industrial
equipment).
o Transactional data: Records of business transactions (e.g., sales, orders,
payments).
o Log files: Records of system events and activities.
Determine storage mediums
Choosing the right storage medium is essential for performance, cost-effectiveness, and
scalability. AWS offers a variety of storage services suitable for different ML data needs:
• Databases:
o Amazon RDS: A managed relational database service that supports various
database engines (e.g., MySQL, PostgreSQL, SQL Server). Suitable for
structured data with well-defined schemas.
o Amazon DynamoDB: A NoSQL database service that provides high
performance and scalability. Ideal for key-value and document data.
• Amazon S3: An object storage service that offers high scalability, durability, and
availability. Well-suited for storing large datasets, including raw data, processed data,
and model artifacts.
• Amazon EFS: A scalable file storage service that can be mounted on multiple EC2
instances. Useful for sharing data between ML training instances.
• Amazon EBS: Block storage volumes that can be attached to EC2 instances. Provides
high performance for workloads that require low latency access to data.
Key considerations for choosing storage mediums:
• Data type: Structured, semi-structured, or unstructured.
• Data volume: The amount of data you need to store.
• Access patterns: How frequently and how quickly you need to access the data.
• Cost: The cost of storage and data transfer.

pg. 1
SKILLCERTPRO

• Performance: The required throughput and latency.


• Scalability: The ability to scale storage capacity as needed.
• Security: Protecting sensitive data with appropriate security measures.

1.2: Identify and implement a data ingestion solution.

Data Job Styles


Data job styles refer to the overall approach and methodology used for processing and
managing data. In the context of machine learning, two primary styles are relevant:
• Batch Processing:
o Concept: Batch processing involves collecting data over a period,
accumulating it into a batch, and then processing the entire batch at once.
This is suitable for large datasets where real-time processing is not required.
o Characteristics:
▪ High throughput: Can process large volumes of data efficiently.
▪ Latency: There's a delay between data arrival and processing.
▪ Cost-effective for large datasets.
o Use Cases in ML:
▪ Training machine learning models on historical data.
▪ Generating reports and performing data analysis on large datasets.
▪ Data preprocessing and feature engineering on static datasets.
o AWS Services:
▪ Amazon S3: For storing batch data.
▪ AWS Glue: For ETL (Extract, Transform, Load) operations.
▪ Amazon EMR: For big data processing using Hadoop and Spark.
▪ AWS Batch: For running batch computing jobs.
• Streaming Processing:
o Concept: Streaming processing involves continuously processing data as it
arrives in real-time or near real-time. This is crucial for applications that
require immediate insights and actions based on incoming data.
o Characteristics:
▪ Low latency: Processes data with minimal delay.
▪ Continuous processing: Handles data streams without waiting for
batches.
▪ Suitable for real-time applications.
o Use Cases in ML:
▪ Real-time fraud detection.
▪ Monitoring sensor data for anomalies.

pg. 2
SKILLCERTPRO

▪ Personalized recommendations in real-time.


▪ Real-time analytics and dashboards.
o AWS Services:
▪ Amazon Kinesis: For ingesting and processing streaming data.
▪ Amazon Kinesis Data Analytics: For running real-time analytics on
streaming data.
▪ Amazon Kinesis Data Firehose: For loading streaming data into data
stores.
Data Job Types
Data job types refer to the specific tasks performed on data within a particular job style.
Here are some common types:
• Data Ingestion: The process of collecting data from various sources and bringing it
into a system for processing.
• Data Transformation: The process of cleaning, transforming, and preparing data for
analysis or machine learning. This includes tasks like data cleaning, data integration,
data normalization, and feature engineering.
• Data Analysis: The process of examining data to extract insights, patterns, and
trends.
• Model Training: The process of using data to train machine learning models.
• Model Inference: The process of using trained models to make predictions on new
data.
Why Data Ingestion Pipelines are Crucial for ML
Machine learning models are data-hungry. Efficient and reliable data ingestion pipelines are
essential for:
• Continuous Training: Regularly feeding new data to models to maintain accuracy and
adapt to changes.
• Real-time Predictions: Enabling models to make predictions on streaming data with
minimal latency.
• Scalability: Handling large volumes of data from various sources.
• Data Variety: Processing diverse data formats (structured, semi-structured,
unstructured).
AWS Services for Data Ingestion
Here's a detailed look at the AWS services you listed:
1. Amazon Kinesis
o Purpose: A family of services for real-time processing of streaming data at
massive scale.
o Key Services:
▪ Kinesis Data Streams: For capturing and storing streams of data. Think
of it as a continuous flow of data records.

pg. 3
SKILLCERTPRO

▪ Kinesis Data Firehose: For reliably loading streaming data into data
lakes, data stores, and analytics tools.
▪ Kinesis Data Analytics: For processing and analyzing streaming data
using SQL or Apache Flink.
o Use Cases for ML:
▪ Ingesting real-time sensor data for predictive maintenance.
▪ Capturing website clickstreams for personalized recommendations.
▪ Processing log data for anomaly detection.
2. Amazon Data Firehose
o Purpose: The easiest way to load streaming data into data lakes and data
stores.
o Key Features:
▪ Fully managed: No infrastructure to manage.
▪ Automatic scaling: Scales to match data volume.
▪ Data transformation: Can convert data formats (e.g., JSON to
Parquet).
▪ Data compression: Reduces storage costs.
o Use Cases for ML:
▪ Loading streaming data into S3 for batch processing with Amazon
SageMaker.
▪ Delivering data to Amazon Redshift for real-time analytics and model
training.
3. Amazon EMR (Elastic MapReduce)
o Purpose: A managed Hadoop framework that makes it easy to process vast
amounts of data using open-source tools like Apache Spark, Hive, and
Hadoop.
o Key Features:
▪ Cost-effective: Pay-as-you-go pricing.
▪ Scalable: Easily scale clusters up or down.
▪ Flexible: Supports various data processing frameworks.
o Use Cases for ML:
▪ Preprocessing large datasets for model training.
▪ Running distributed machine learning algorithms.
▪ Performing ETL (Extract, Transform, Load) operations.
4. AWS Glue
o Purpose: A fully managed ETL service that makes it easy to prepare and load
data for analytics and machine learning.
o Key Features:

pg. 4
SKILLCERTPRO

▪ Automated data discovery: Crawls data sources and infers schemas.


▪ Code generation: Generates ETL code in Python or Scala.
▪ Job scheduling: Schedules and monitors ETL jobs.
o Use Cases for ML:
▪ Cleaning and transforming data for model training.
▪ Creating data catalogs for ML workflows.
▪ Orchestrating data pipelines.
5. Amazon Managed Service for Apache Flink
o Purpose: A fully managed service for Apache Flink, a popular open-source
framework for processing streaming data.
o Key Features:
▪ Real-time processing: Processes data with low latency.
▪ Fault tolerance: Ensures data is processed reliably.
▪ Scalability: Scales to handle high data volumes.
o Use Cases for ML:
▪ Real-time feature engineering: Generating features from streaming
data for model predictions.
▪ Online model training: Continuously updating models with new data.
▪ Complex event processing: Detecting patterns and anomalies in real
time.
Choosing the Right Service
The choice of service depends on your specific needs:
• Real-time ingestion and processing: Amazon Kinesis, Amazon Managed Service for
Apache Flink
• Simple streaming data loading: Amazon Data Firehose
• Batch processing and ETL: Amazon EMR, AWS Glue
Schedule Jobs
• Why Schedule Jobs? Many data ingestion and machine learning tasks need to be
performed regularly or at specific times. Scheduling automates these tasks, ensuring
they run without manual intervention.
• AWS Services for Scheduling:
o AWS Lambda: Serverless compute service that can be triggered on a schedule
using CloudWatch Events (now Amazon EventBridge).
o Amazon EventBridge: A serverless event bus that allows you to schedule
events and trigger various AWS services.
o AWS Step Functions: For orchestrating complex workflows with multiple
steps, including scheduled tasks.

pg. 5
SKILLCERTPRO

o Amazon Managed Workflows for Apache Airflow (MWAA): A managed


service for Apache Airflow, an open-source platform for orchestrating
complex workflows.
• Key Considerations:
o Frequency: How often the job needs to run (hourly, daily, weekly, etc.).
o Dependencies: If a job depends on the completion of another job.
o Error Handling: How to handle failures and retries.
o Monitoring: Tracking job execution and performance.
Example Scenario
Imagine you're building a machine learning model to predict customer churn. You might:
1. Ingest customer data daily from a database into Amazon S3 using AWS Glue. This
involves extracting the data, transforming it into a suitable format, and loading it into
S3.
2. Schedule a daily job using Amazon EventBridge to trigger an AWS Lambda function.
This function could then:
o Preprocess the data in S3.
o Train your machine learning model using Amazon SageMaker.
o Deploy the updated model for predictions.

1.3: Identify and implement a data transformation solution.

Transform data in transit (ETL, AWS Glue, Amazon EMR, AWS Batch).
This section focuses on transforming data as it's being moved between storage locations or
processed in real-time. Here's a closer look at the services mentioned:
• ETL (Extract, Transform, Load): ETL is a traditional data integration process that
involves extracting data from various sources, transforming it to meet specific
requirements, and loading it into a target 1 system. On AWS, ETL can be implemented
using a combination of services like AWS Glue, Amazon EMR, and AWS Data
Pipeline.
• AWS Glue: AWS Glue is a fully managed ETL service that makes it easy to prepare and
load data for analytics. It provides features like automated data discovery, schema
generation, and code generation for data transformation. Glue is particularly useful
for serverless ETL and for building data lakes.
• Amazon EMR (Elastic MapReduce): Amazon EMR is a managed Hadoop framework
that can be used for large-scale data processing and transformation. It supports
various processing frameworks like Apache Spark, Hive, and Pig. EMR is suitable for
complex data transformations and for running machine learning algorithms on large
datasets.
• AWS Batch: AWS Batch is a batch processing service that enables you to run batch
computing workloads on the AWS Cloud. It can be used for data transformation tasks

pg. 6
SKILLCERTPRO

that can be broken down into independent jobs. Batch is useful for large-scale data
processing and for running machine learning training jobs.

Key Considerations for Data Transformation in Transit:


• Data format: Consider the format of the data being transferred (e.g., CSV, JSON,
Parquet) and whether it needs to be converted.
• Data volume and velocity: Choose the appropriate service based on the volume and
velocity of the data. For high-volume streaming data, consider using Amazon Kinesis.
• Transformation logic: Implement the necessary transformations, such as data
cleaning, filtering, aggregation, and feature engineering.
• Performance and cost: Optimize the transformation process for performance and
cost-efficiency.
Data Transformation
Data transformation is the process of converting data from one format or structure into
another. In the context of machine learning, this is a crucial step to ensure that the data is
suitable for training and evaluating models. Raw data is often messy, incomplete, and
inconsistent, which can negatively impact model performance. Data transformation
techniques help to clean, normalize, and enrich the data, making it more effective for
machine learning tasks.
Key Data Transformation Techniques:
• Cleaning: Handling missing values (imputation), removing duplicates, and correcting
inconsistencies.
• Integration: Combining data from multiple sources into a unified dataset.
• Reduction: Reducing the volume of data by feature selection, dimensionality
reduction, or sampling.
• Transformation: Converting data into a suitable format, such as scaling numerical
features, encoding categorical variables, or creating new features.
ML-Specific Data Handling
Machine learning often involves dealing with large datasets that may not fit into the
memory of a single machine. To process such data efficiently, distributed computing
frameworks are used. MapReduce is a popular programming model for processing large
datasets in parallel across a cluster of machines.
MapReduce
MapReduce is a programming model and an associated implementation for processing and
generating big datasets with a parallel, distributed algorithm on a cluster. A MapReduce
program is composed of two main functions:
• Map: Takes a set of data and converts it into another set of data, where individual
elements are broken down into key/value pairs.
• Reduce: Takes the output from the map as input and combines those data tuples into
a smaller set of tuples.

pg. 7
SKILLCERTPRO

AWS Services for Data Transformation


AWS provides several services that can be used for data transformation in machine learning:
• Amazon EMR: A managed Hadoop framework that can be used to process large
datasets using MapReduce, Apache Spark, and other big data tools.
• AWS Glue: A fully managed extract, transform, and load (ETL) service that makes it
easy to prepare and transform data for machine learning.
• Amazon SageMaker Data Wrangler: A service that simplifies the process of data
preparation and feature engineering for machine learning.
Apache Hadoop
Apache Hadoop is an open-source framework that provides distributed storage and
processing of large datasets. It includes the Hadoop Distributed File System (HDFS) for
storage and MapReduce for processing.
Apache Spark
Apache Spark is a fast and general-purpose cluster computing system. It provides high-level
APIs in Java, Scala, Python, and R, as well as an optimized engine that supports general
execution graphs. Spark can be used for batch processing, stream processing, machine
learning, and graph processing.
Apache Hive
Apache Hive is a data warehouse system built on top of Hadoop for providing data query and
analysis. Hive provides an SQL-like language called HiveQL for querying data stored in
Hadoop.
Implementing a Data Transformation Solution
To implement a data transformation solution for machine learning on AWS, you would
typically follow these steps:
1. Identify the data sources: Determine the sources of data that will be used for
machine learning.
2. Choose the appropriate AWS services: Select the AWS services that are best suited
for the data transformation tasks.
3. Implement the data transformation logic: Write the code to perform the necessary
data cleaning, integration, reduction, and transformation.
4. Test and validate the data transformation: Ensure that the transformed data is
accurate and suitable for machine learning.
5. Deploy the data transformation pipeline: Automate the data transformation process
so that it can be run on a regular schedule.

pg. 8
SKILLCERTPRO

• For a full set of 355 questions. Go to


[Link]
exam-questions/
• SkillCertPro offers detailed explanations to each question which helps to understand
the concepts better.
• It is recommended to score above 85% in SkillCertPro exams before attempting a real
exam.
• SkillCertPro updates exam questions every 2 weeks.
• You will get life time access and life time free updates
• SkillCertPro assures 100% pass guarantee in first attempt.

Domain 2: Exploratory Data Analysis


2.1: Sanitize and prepare data for modeling.

Identify and handle missing data, corrupt data, and stop words
• Missing Data: Real-world datasets often have missing values. These can occur for
various reasons, such as data entry errors, sensor malfunctions, or incomplete
surveys. Handling missing data is essential to avoid bias and errors in your model.
Common techniques include:
o Imputation: Filling in missing values with estimated values. This can be done
using simple methods like mean or median imputation, or more sophisticated
techniques like k-nearest neighbors imputation or regression imputation.
o Removal: Removing rows or columns with missing values. This is a simpler
approach but can lead to loss of valuable data if missing values are prevalent.
• Corrupt Data: This refers to data that is inaccurate, inconsistent, or invalid. Examples
include:
o Outliers: Data points that are significantly different from the rest of the data.
o Inconsistent formatting: Dates in different formats, inconsistent
capitalization, etc.
o Duplicate records: Identical or near-identical data entries.
Handling corrupt data involves identifying and correcting or removing these errors.
Techniques include:
o Data validation: Implementing checks to ensure data conforms to expected
formats and ranges.
o Outlier detection and removal: Using statistical methods or visualization
techniques to identify and remove outliers.
o Deduplication: Identifying and removing duplicate records.

pg. 9
SKILLCERTPRO

• Stop Words: In natural language processing (NLP), stop words are common words
that are often removed from text data because they don't carry much meaning.
Examples include "the," "a," "is," and "are." Removing stop words can help to
improve the efficiency and accuracy of NLP models.
Format, normalize, augment, and scale data
• Format: This involves structuring the data into a consistent and usable format. This
may include:
o Data type conversion: Converting data to the appropriate data type (e.g.,
numeric, categorical, date).
o Data transformation: Restructuring data, such as pivoting tables or
converting between wide and long formats.
• Normalize: Normalization is the process of scaling data to a standard range, typically
between 0 and 1. This is important because features with larger values can dominate
those with smaller values, which can negatively impact model performance.
Common normalization techniques include:
o Min-Max scaling: Scales data to a range between 0 and 1.
o Z-score standardization: Scales data to have a mean of 0 and a standard
deviation of 1.
• Augment: Data augmentation involves creating new data points from existing data by
applying various transformations. This is particularly useful when you have limited
data, as it can help to improve model generalization and prevent overfitting.
Examples of data augmentation techniques include:
o Image augmentation: Rotating, flipping, cropping, and adjusting the
brightness of images.
o Text augmentation: Back translation, synonym replacement, and random
insertion.
• Scale: Scaling is a more general term that refers to transforming data to a different
range or distribution. Normalization is a specific type of scaling. Other scaling
techniques include:
o Robust scaling: Scales data using the median and interquartile range, which is
less sensitive to outliers.
o Log transformation: Compresses the range of data by taking the logarithm of
the values.
Determine whether there is sufficient labeled data
• Importance of Labeled Data: Supervised learning algorithms rely on labeled data to
learn patterns and make predictions. Labeled data means that each data point has a
corresponding "answer" or "label" associated with it. For example, in an image
classification task, labeled data would consist of images with labels indicating what
objects are present in each image (e.g., "cat," "dog," "car").

pg. 10
SKILLCERTPRO

• Assessing Sufficiency: Determining whether you have enough labeled data depends
on several factors:
o Complexity of the problem: More complex problems generally require more
data.
o Complexity of the model: More complex models (e.g., deep neural networks)
typically need more data to train effectively.
o Variability in the data: If your data has a lot of variability, you'll need more
data to capture all the different patterns.
• Consequences of Insufficient Data: Insufficient labeled data can lead to:
o Underfitting: The model is too simple to capture the underlying patterns in
the data.
o Overfitting: The model learns the training data too well, including noise and
outliers, and performs poorly on new data.
Mitigation strategies
If you determine that you don't have enough labeled data, several mitigation strategies can
be employed:
• Data Augmentation: Create new training data by applying transformations to existing
data (e.g., rotating, cropping, or flipping images).
• Transfer Learning: Use a pre-trained model that has been trained on a large dataset
and fine-tune it on your smaller dataset.
• Active Learning: Select the most informative data points to be labeled by a human,
iteratively improving the model with less overall labeling effort.
• Semi-Supervised Learning: Combine a small amount of labeled data with a large
amount of unlabeled data to train a model.
• Synthetic Data Generation: Generate synthetic data that resembles your real data
using techniques like Generative Adversarial Networks (GANs).
Use data labeling tools (for example, Amazon Mechanical Turk)
• Data Labeling Tools: These tools help streamline the process of labeling data. They
provide interfaces for humans to label data points and can also automate some
aspects of the labeling process.
• Amazon Mechanical Turk (MTurk): MTurk is a crowdsourcing marketplace that can
be used to label data. It provides access to a large pool of human workers who can
perform tasks like image annotation, text categorization, and data transcription.
• Other AWS Labeling Services:
o Amazon SageMaker Ground Truth: A fully managed data labeling service that
makes it easy to build highly accurate training datasets for machine learning. 1
Ground Truth can use MTurk workers, your own private workforce, or vendor-
managed workforces. It also supports automated labeling to reduce labeling
costs.
2.2: Perform feature engineering.

pg. 11
SKILLCERTPRO

Identifying and Extracting Features from Various Data Sources:

o Text:

▪ Tokenization: Breaking down text into individual words or phrases


(tokens).

▪ Stop word removal: Removing common words (e.g., "the," "a," "is")
that don't carry much meaning.

▪ Stemming/Lemmatization: Reducing words to their root form (e.g.,


"running" to "run").

▪ TF-IDF: Measuring the importance of words in a document relative to


a collection of documents.

▪ Word embeddings (Word2Vec, GloVe): Representing words as dense


vectors that capture semantic relationships.

o Speech:

▪ Audio features: Extracting features like Mel-Frequency Cepstral


Coefficients (MFCCs), spectral features, and energy.

▪ Speech-to-text conversion: Using services like Amazon Transcribe to


convert speech to text and then applying text feature engineering
techniques.

o Images:

▪ Pixel values: Using raw pixel data as features.

▪ Edge detection: Identifying edges and boundaries in images.

▪ Feature extraction using pre-trained models: Leveraging models like


ResNet or Inception to extract high-level features.

o Public Datasets:

▪ Understanding the structure and content of public datasets.

▪ Identifying relevant features based on the problem you're trying to


solve.

Feature Engineering Techniques:

o Binning: Grouping continuous values into discrete bins (e.g., age ranges).

pg. 12
SKILLCERTPRO

o One-Hot Encoding: Converting categorical variables into numerical


representations.

o Handling Outliers: Identifying and treating extreme values that can skew
model training.

o Creating Synthetic Features: Generating new features by combining or


transforming existing ones (e.g., creating interaction terms).

o Reducing Data Dimensionality: Techniques like Principal Component Analysis


(PCA) to reduce the number of features while retaining important
information.

AWS Services for Feature Engineering

• Amazon SageMaker: Provides built-in algorithms and tools for feature engineering,
including processing, transforming, and analyzing data.

• AWS Glue: A fully managed ETL (extract, transform, load) service that can be used for
data preparation and feature engineering.

• Amazon Athena: An interactive query service that makes it easy to analyze data in
Amazon S3 using SQL, which can be used for feature exploration and creation.

Analyze and evaluate feature engineering concepts (for example, binning, tokenization,
outliers, synthetic features, one-hot encoding, reducing dimensionality of data).

1. Binning

• Concept: Binning, also known as discretization or bucketing, is the process of


converting continuous numerical features into discrete categorical features by
grouping values into bins or intervals.

• Use Cases:

o Handling outliers by grouping extreme values into a single bin.

o Capturing non-linear relationships between features and the target variable.

o Simplifying complex models by reducing the number of distinct values.

• Example: Converting age into age groups (e.g., 0-18, 19-35, 36-50, 50+).

2. Tokenization

pg. 13
SKILLCERTPRO

• Concept: Tokenization is the process of breaking down text data into individual units
called tokens. Tokens can be words, phrases, characters, or subwords.

• Use Cases:

o Preparing text data for natural language processing (NLP) tasks like text
classification, sentiment analysis, and machine translation.

o Creating a vocabulary of unique tokens for representing text data numerically.

• Example: Tokenizing the sentence "The quick brown fox" into the tokens "The",
"quick", "brown", and "fox".

3. Outliers

• Concept: Outliers are data points that significantly deviate from the rest of the data.
They can be caused by errors in data collection, rare events, or genuine extreme
values.

• Handling Outliers:

o Detection: Visualizing data (e.g., box plots, scatter plots) or using statistical
methods (e.g., Z-score, IQR).

o Treatment: Removing outliers, capping them at a certain value, or using


robust statistical methods that are less sensitive to outliers.

• Impact: Outliers can negatively impact the performance of some machine learning
models by skewing the data distribution and affecting model training.

4. Synthetic Features

• Concept: Synthetic features are new features created from existing features using
mathematical operations or domain knowledge.

• Use Cases:

o Capturing complex relationships between features.

o Improving model accuracy by providing additional information.

• Example: Creating a feature "BMI" (Body Mass Index) from "weight" and "height"
features.

5. One-Hot Encoding

pg. 14
SKILLCERTPRO

• Concept: One-hot encoding is a technique for converting categorical features into


numerical features. Each category is represented by a binary vector, where only one
element is 1 (hot) and the rest are 0.

• Use Cases:

o Preparing categorical data for machine learning models that require


numerical input.

o Avoiding giving artificial ordinal relationships to categories.

• Example: Encoding the colors "red", "green", and "blue" as [1, 0, 0], [0, 1, 0], and [0,
0, 1], respectively.

6. Reducing Dimensionality of Data

• Concept: Dimensionality reduction is the process of reducing the number of features


in a dataset while retaining important information.

• Use Cases:

o Improving model performance by reducing overfitting and computational


complexity.

o Visualizing high-dimensional data in lower dimensions.

• Techniques:

o Feature Selection: Selecting a subset of the most relevant features.

o Feature Extraction: Creating new features that are combinations of the


original features (e.g., Principal Component Analysis (PCA)).

2.3: Analyze and visualize data for ML.

Create Graphs

• Why Graphs Matter: Graphs help you quickly grasp patterns, relationships, and
anomalies in your data that might be hard to spot in raw numbers.

• Types of Graphs

o Scatter Plots: Show the relationship between two continuous variables.


Useful for identifying correlations.

▪ Example: Plotting advertising spend vs. sales revenue.

pg. 15
SKILLCERTPRO

o Time Series: Display data points collected over time. Essential for forecasting
and trend analysis.

▪ Example: Website traffic over the past year.

o Histograms: Show the distribution of a single numerical variable by grouping


data into bins.

▪ Example: Distribution of customer ages.

o Box Plots: Display the distribution of a numerical variable through quartiles,


highlighting the median, potential outliers, and spread.

▪ Example: Comparing the distribution of test scores across different


classes.

Interpret Descriptive Statistics

Descriptive statistics provide a numerical summary of your data's key features.

• Correlation

o Measures the strength and direction of a linear relationship between two


variables.

▪ Range: -1 (perfect negative correlation) to +1 (perfect positive


correlation). 0 means no linear correlation.

▪ Example: A correlation of 0.8 between study time and exam scores


suggests a strong positive relationship.

• Summary Statistics

o Provide a concise overview of your data's distribution and central tendency.

▪ Mean: The average value.

▪ Median: The middle value when data is ordered.

▪ Mode: The most frequent value.

▪ Standard Deviation: Measures the spread or dispersion of data


around the mean.

▪ Variance: The average of the squared differences from the mean.

• P-value

pg. 16
SKILLCERTPRO

o In hypothesis testing, the p-value helps determine the statistical significance


of results.

▪ It indicates the probability of observing the results (or more extreme


results) if there were actually no effect (the null hypothesis is true).

▪ A small p-value (typically ≤ 0.05) suggests strong evidence against the


null hypothesis.

▪ Example: In an A/B test, a p-value of 0.03 for the difference in


conversion rates indicates that the difference is statistically significant.

Perform cluster analysis

Cluster analysis is an unsupervised learning technique used to group similar data points
together. The exam focuses on understanding and applying different clustering algorithms:

• Hierarchical Clustering: This method builds a hierarchy of clusters.


o Agglomerative (bottom-up): Starts with each data point as its own cluster
and successively merges the closest clusters.
o Divisive (top-down): Starts with one cluster containing all data points and
recursively splits it into smaller clusters.
o Dendrogram: A tree-like diagram that visualizes the hierarchy of clusters. It
helps in understanding the relationships between clusters and choosing the
number of clusters.
• k-means Clustering: This algorithm aims to partition n observations into k clusters in
which each observation belongs to the cluster with the nearest mean (cluster 1
center).
o Diagnosis: Assessing the quality of the clusters formed. This can involve
analyzing cluster sizes, distances between cluster centers, and using metrics
like silhouette score.
o Elbow Plot: A technique to determine the optimal number of clusters (k) in k-
means. It plots the within-cluster sum of squares (WCSS) against different
values of k. The "elbow point" of the plot, where the rate of decrease in WCSS
sharply changes, suggests a good value for k.
• Cluster Size: Analyzing the number of data points in each cluster. This can reveal
insights about the data, such as the presence of dominant groups or outliers.

pg. 17
SKILLCERTPRO

o For a full set of questions. Go to


[Link]
mla-c01-exam-questions/
o SkillCertPro offers detailed explanations to each question which helps to
understand the concepts better.
o It is recommended to score above 85% in SkillCertPro exams before attempting
a real exam.
o SkillCertPro updates exam questions every 2 weeks.
o You will get life time access and life time free updates
o SkillCertPro assures 100% pass guarantee in first attempt.

Domain 3: Modeling
3.1: Frame business problems as ML problems.

Determine when to use and when not to use ML.


• When to use ML:
o Complex problems with no clear rules: ML excels at tasks where traditional
programming is difficult, like image recognition, natural language processing,
and fraud detection.
o Large datasets: ML algorithms thrive on data, finding patterns and insights
that humans might miss.
o Need for predictions or insights: ML can forecast future trends, classify data,
and provide recommendations.
o Continuous improvement: ML models can learn and adapt as new data
becomes available.
• When not to use ML:
o Simple problems with clear rules: If a problem can be solved with a
straightforward rule-based system, ML might be overkill.
o Small datasets: ML models need sufficient data to train effectively.
o Need for explainability: Some ML models (like deep neural networks) are
"black boxes," making it hard to understand their decision-making process. If
explainability is crucial, simpler models or rule-based systems might be better.
o Real-time, low-latency requirements: Training and deploying complex ML
models can introduce latency, which might be unacceptable for certain
applications.
Know the difference between supervised and unsupervised learning.

pg. 18
SKILLCERTPRO

• Supervised learning:
o The algorithm learns from labeled data, where the input and the desired
output are provided.
o Examples:
▪ Classification: Predicting a category (e.g., spam or not spam).
▪ Regression: Predicting a continuous value (e.g., house prices).
o Common algorithms: Linear Regression, Logistic Regression, Decision Trees,
Random Forests, Support Vector Machines, Neural Networks.
• Unsupervised learning:
o The algorithm learns from unlabeled data, finding patterns and structures on
its own.
o Examples:
▪ Clustering: Grouping similar data points together (e.g., customer
segmentation).
▪ Dimensionality reduction: Reducing the number of variables while
preserving important information.
o Common algorithms: K-Means Clustering, Principal Component Analysis
(PCA).

Select from among classification, regression, forecasting, clustering, recommendation, and


foundation models.
• Classification: Predicting a category or class.
o Example: Identifying whether an email is spam or not.
o Algorithms: Logistic Regression, Support Vector Machines, Decision Trees,
Random Forests, Naive Bayes.
• Regression: Predicting a continuous numerical value.
o Example: Predicting house prices based on features like size and location.
o Algorithms: Linear Regression, Polynomial Regression, Support Vector
Regression.
• Forecasting: Predicting future values based on historical time-series data.
o Example: Predicting sales for the next quarter.
o Algorithms: ARIMA, Prophet.
• Clustering: Grouping similar data points together.
o Example: Segmenting customers based on their purchasing behavior.
o Algorithms: K-Means Clustering, DBSCAN.
• Recommendation: Suggesting items or content to users based on their preferences.
o Example: Recommending products to customers on an e-commerce website.
o Algorithms: Collaborative Filtering, Content-Based Filtering.

pg. 19
SKILLCERTPRO

• Foundation Models: Large, pre-trained models that can be adapted to a wide range
of tasks.
o Example: Using a large language model for text generation, translation, or
question answering.
o Examples: BERT, GPT, Gemini.

3.2: Select the appropriate model(s) for a given ML problem.

Regression Models (Predicting Continuous Values):


o Linear Regression: Used for predicting a continuous target variable based on
a linear relationship with predictor variables. Simple, interpretable, but
assumes linearity. Suitable for problems where a linear relationship is
expected, like predicting house prices based on size.
o XGBoost (Extreme Gradient Boosting): A powerful ensemble method that
combines multiple decision trees. Handles complex relationships, missing
data, and is robust to outliers. Often used for tabular data in classification and
regression tasks where high accuracy is important.
• Classification Models (Predicting Categories):
o Logistic Regression: Used for binary classification problems (two classes).
Predicts the probability of belonging to a certain class. Simple and
interpretable, but assumes a linear decision boundary. Suitable for problems
like spam detection.
o Decision Trees: Tree-like structures that make decisions based on a series of
if-else conditions. Easy to understand and visualize, but prone to overfitting.
Can be used for both classification and regression.
o Random Forests: An ensemble method that combines multiple decision
trees. Reduces overfitting and improves accuracy compared to single decision
trees. Effective for various classification and regression tasks.
o XGBoost (also used for classification): As mentioned earlier, XGBoost is very
versatile and performs well in classification tasks, often outperforming
Random Forests.
• Clustering Models (Grouping Similar Data Points):
o K-means: Partitions data into k clusters based on distance to cluster centers.
Simple and efficient, but requires specifying the number of clusters (k)
beforehand. Suitable for tasks like customer segmentation.
• Neural Networks (Complex Models for Various Tasks):
o RNN (Recurrent Neural Networks): Designed for sequential data, such as
time series or text. Have memory of past inputs, making them suitable for
natural language processing (NLP) and time series forecasting.

pg. 20
SKILLCERTPRO

o CNN (Convolutional Neural Networks): Primarily used for image recognition


and computer vision tasks. Utilize convolutional layers to extract features
from images. Also used in NLP for tasks like text classification.
• Advanced Techniques:
o Ensemble Methods: Combine multiple models to improve performance.
Examples include Random Forests, XGBoost, and stacking.
o Transfer Learning: Reusing a pre-trained model on a new task with similar
characteristics. Saves training time and resources, particularly useful when
limited data is available. Common in image classification and NLP.
o Large Language Models (LLMs): Very large neural networks trained on
massive text datasets. Capable of generating human-like text, translating
languages, writing different kinds of creative content, and answering your
questions in an informative way.
Intuition Behind Models:
Understanding the intuition behind the models is crucial for effective model selection. Here
are some key intuitive explanations:
• Linear Regression: Imagine trying to fit a straight line through a set of data points.
Linear regression finds the line that minimizes the sum of squared errors between
the line and the points.
• Logistic Regression: Instead of fitting a line, logistic regression fits an "S"-shaped
curve (sigmoid function) to the data. This curve represents the probability of
belonging to a certain class.
• Decision Trees: Imagine asking a series of questions to classify an object. Each
question corresponds to a node in the tree, and the answers lead to different
branches.
• Random Forests: Imagine having multiple decision trees, each trained on a random
subset of the data and features. The final prediction is made by aggregating the
predictions of all trees. This reduces overfitting and improves accuracy.
• K-means: Imagine placing k random points (centroids) in the data space. Then, each
data point is assigned to the nearest centroid. The centroids are then recalculated as
the mean of the assigned points. This process is repeated until the centroids no
longer move significantly.
• RNN: Imagine a loop that allows information to persist from one step to the next.
This loop allows the network to remember past inputs and use them to process
current inputs.
• CNN: Imagine sliding a small window (filter) over an image. This window extracts
features from the image, such as edges and textures. Multiple filters are used to
extract different features.
• Transfer Learning: Imagine learning to ride a motorcycle after already knowing how
to ride a bicycle. You can transfer your knowledge of balance and steering to the new
task.

pg. 21
SKILLCERTPRO

Key Considerations for Model Selection:


• Type of Data: Numerical, categorical, text, images, time series.
• Problem Type: Regression, classification, clustering.
• Data Size: Small datasets may benefit from simpler models or transfer learning. Large
datasets can support more complex models.
• Interpretability: Some models (e.g., linear regression, decision trees) are easier to
interpret than others (e.g., neural networks).
• Performance Metrics: Accuracy, precision, recall, F1-score, RMSE, etc.
• Computational Resources: Some models require more computational resources than
others.

3.3: Train ML models.

Split data between training and validation (for example, cross-validation)


o Data Splitting: In machine learning, you typically split your dataset into three
parts:
▪ Training set: Used to train the model.
▪ Validation set: Used to tune hyperparameters and evaluate the
model's performance during training.
▪ Test set: Used to provide a final, unbiased evaluation of the model's
performance after training.
o Cross-validation: A technique used to assess how well a model generalizes to
independent data. It helps to avoid overfitting and provides a more robust
estimate of model performance. Common types include:
▪ k-fold cross-validation: The dataset is divided into k subsets (folds).
The model is trained k times, each time using a different fold as the
validation set and the remaining k-1 folds as the training set.
▪ Stratified k-fold cross-validation: Similar to k-fold, but ensures that
each fold has the same proportion of target classes as the original
dataset. This is important for imbalanced datasets.
Understand optimization techniques for ML training (for example, gradient descent, loss
functions, convergence)
o Loss functions: A loss function measures how well the model is performing
on the training data. It quantifies the difference between the predicted values
and the actual values. Common loss functions include:
▪ Mean Squared Error (MSE): Used for regression problems.
▪ Binary Cross-Entropy: Used for binary classification problems.
▪ Categorical Cross-Entropy: Used for multi-class classification
problems.

pg. 22
SKILLCERTPRO

o Gradient descent: An iterative optimization algorithm used to find the


minimum of a function (in this case, the loss function). It works by repeatedly
adjusting the model's parameters in the direction of the steepest descent of
the loss function.
▪ Learning rate: A hyperparameter that controls the step size in each
iteration of gradient descent.
▪ Batch size: The number of training examples used in each iteration of
gradient descent.
o Convergence: Refers to the point where the model's performance on the
training data stops improving significantly. It indicates that the optimization
algorithm has found a minimum of the loss function.
AWS Services and Considerations
When working with AWS for machine learning model training, you'll likely use services like:
• Amazon SageMaker: A fully managed service that provides tools for building,
training, and deploying machine learning models. It supports various training
options, including distributed training and hyperparameter tuning.
• AWS Deep Learning AMIs: Amazon Machine Images that are pre-configured with
popular deep learning frameworks like TensorFlow and PyTorch.
• Amazon EC2: Provides virtual servers for training models, giving you more control
over the hardware and software environment.
Choosing Appropriate Compute Resources
The choice of compute resources significantly impacts training time and cost. Here's a
breakdown:
• CPU vs. GPU:
o CPUs (Central Processing Units): General-purpose processors suitable for a
wide range of tasks. They are cost-effective for smaller datasets, simpler
models, and tasks that don't involve heavy matrix operations.
o GPUs (Graphics Processing Units): Specialized processors designed for
parallel processing, particularly effective for matrix operations common in
deep learning. GPUs significantly accelerate training for complex models and
large datasets.
• Distributed vs. Non-Distributed:
o Non-Distributed (Single Instance): Training happens on a single machine. This
is suitable for smaller datasets and models that fit in the memory of a single
instance.
o Distributed (Multiple Instances): Training is distributed across multiple
machines working together. This is essential for very large datasets and
complex models that require more memory and processing power than a

pg. 23
SKILLCERTPRO

single instance can provide. Frameworks like TensorFlow and PyTorch support
distributed training.
Choosing Appropriate Compute Platforms
AWS offers various platforms for machine learning. The choice depends on the scale and
complexity of your project:
• Non-Spark Platforms (e.g., Amazon SageMaker):
o Amazon SageMaker: A fully managed machine learning service. It provides
tools for building, training, and deploying ML models. SageMaker supports
various instance types (CPU and GPU) and offers features like automatic
model tuning, distributed training, and model monitoring. It's a good choice
for most ML projects, especially those that benefit from a managed
environment.
• Spark Platforms (e.g., Amazon EMR):
o Amazon EMR (Elastic MapReduce): A managed Hadoop framework that can
be used for large-scale data processing and machine learning. Apache Spark,
a popular distributed computing framework, runs on EMR. Spark is well-
suited for processing and transforming very large datasets before training ML
models. It also has its own MLlib library for machine learning. Use EMR when
you need to process massive datasets or already have a Spark-based data
processing pipeline.
Updating and Retraining Models
Models need to be retrained periodically to maintain accuracy as new data becomes
available or data patterns change. There are two main retraining strategies:
• Batch Retraining:
o The model is retrained on a batch of new data, typically at scheduled intervals
(e.g., daily, weekly). This is simpler to implement but may not be suitable for
applications that require immediate updates. The provided code example
demonstrates a basic batch retraining scenario.
• Real-time/Online Retraining:
o The model is updated continuously as new data arrives. This is more complex
to implement but allows the model to adapt quickly to changing conditions.
This approach is suitable for applications where data changes rapidly and
timely updates are crucial.

3.4: Perform hyperparameter optimization.

Perform Regularization
Regularization is a technique used to prevent overfitting in machine learning models.
Overfitting occurs when a model learns the training data too well, including its noise and
outliers, leading to poor performance on new, unseen data. Regularization methods add a

pg. 24
SKILLCERTPRO

penalty to the model's complexity, encouraging it to learn simpler, more generalizable


patterns.
Here are two common regularization techniques:
• Dropout: This technique is primarily used in neural networks. During training,
dropout randomly "drops out" (ignores) a fraction of neurons in a layer. This prevents
the network from relying too heavily on any single neuron and encourages it to learn
more robust features. Dropout can be seen as training multiple networks with
different architectures simultaneously.
• L1/L2 Regularization: These techniques add a penalty term to the loss function that
the model tries to minimize during training.
o L1 Regularization (Lasso): Adds the sum of the absolute values of the model's
weights to the loss function. This encourages the model to have sparse
weights, meaning some weights become exactly zero. This effectively
performs feature selection, as features with zero weights are effectively
ignored by the model.
o L2 Regularization (Ridge): Adds the sum of the squares of the model's
weights to the loss function. This encourages the model to have small
weights, but not necessarily zero. This helps to prevent any single feature
from having too much influence on the model.
Perform Cross-Validation
Cross-validation is a technique used to evaluate the performance of a machine learning
model on unseen data. It helps to assess how well the model generalizes to new data and to
avoid overfitting.
Here's how it works:
1. The data is divided into k subsets or "folds" of roughly equal size.
2. The model is trained k times. In each iteration, one fold is held out as the validation
set, and the remaining k-1 folds are used for training.
3. The performance of the model is evaluated on the validation set in each iteration.
4. The average performance across all k iterations is calculated to give an overall
estimate of the model's performance.
A common type of cross-validation is k-fold cross-validation, where k is typically 5 or 10.
Cross-validation helps to:
• Get a more reliable estimate of model performance than a single train-test split.
• Detect overfitting.
• Tune hyperparameters.
Initialize Models

pg. 25
SKILLCERTPRO

Initializing a model refers to setting the initial values of its parameters (e.g., weights in a
neural network) before training begins. Proper initialization is crucial for effective training, as
it can affect the model's convergence speed and final performance.
Here are some common initialization techniques:
• Zero Initialization: Setting all parameters to zero. This is generally not a good idea,
especially in neural networks, as it can lead to symmetry issues where all neurons in
a layer learn the same thing.
• Random Initialization: Setting parameters to small random values. This helps to
break symmetry and allows different neurons to learn different features.
• Xavier/Glorot Initialization: This method sets the initial weights based on the
number of input and output neurons in a layer. It aims to keep the variance of the
activations consistent across layers, which can help with training deep networks.
• He Initialization: This is a variant of Xavier initialization that is specifically designed
for ReLU activation functions.
Neural Network Architecture
• Layers and Nodes:
o Layers: Neural networks consist of interconnected layers of nodes (neurons).
The main types of layers are:
▪ Input Layer: Receives the initial data.
▪ Hidden Layers: Perform computations on the input data. A network
can have multiple hidden layers (deep learning).
▪ Output Layer: Produces the final result.
o Nodes: Each node in a layer receives input from the nodes in the previous
layer, applies a weight to each input, sums them up, adds a bias, and then
passes the result through an activation function.
• Learning Rate:
o The learning rate is a hyperparameter that controls the step size during the
optimization process. It determines how much the weights of the network are
adjusted in response to the error calculated during training.
o A high learning rate can lead to overshooting the optimal solution, while a
low learning rate can result in slow convergence.
• Activation Functions:
o Activation functions introduce non-linearity to the network, allowing it to
learn complex patterns. Common activation functions include:
▪ Sigmoid: Outputs values between 0 and 1.
▪ ReLU (Rectified Linear Unit): Outputs 0 for negative inputs and the
input value for positive inputs.
▪ Tanh (Hyperbolic Tangent): Outputs values between -1 and 1.

pg. 26
SKILLCERTPRO

Tree-Based Models
• Number of Trees:
o Tree-based models like Random Forests and Gradient Boosting Machines
(GBMs) use an ensemble of decision trees.
o Increasing the number of trees generally improves the model's accuracy, but
also increases computational cost and can lead to overfitting.
• Number of Levels (Tree Depth):
o The depth of a decision tree determines the complexity of the model.
o A deeper tree can capture more intricate relationships in the data but is also
more prone to overfitting.
o Limiting the tree depth can help to prevent overfitting and improve
generalization.
Linear Models
• Learning Rate:
o Linear models, such as linear regression and logistic regression, also use a
learning rate during the optimization process (e.g., gradient descent).
o Similar to neural networks, the learning rate in linear models controls the
step size taken towards the optimal solution.
o Choosing an appropriate learning rate is crucial for efficient convergence and
avoiding oscillations or slow progress.

3.5: Evaluate ML models.

Avoid Overfitting or Underfitting


• Overfitting: This occurs when a model learns the training data too well, capturing
noise and specific details that don't generalize. It's like memorizing answers to a test
instead of understanding the underlying concepts. Overfit models have high variance
and low bias.
o Symptoms: High accuracy on training data, low accuracy on validation/test
data. Complex models (e.g., deep decision trees) are more prone to
overfitting.
o Detection: Observing a significant gap between training and validation/test
performance.
o Handling:
▪ Regularization: Techniques like L1 (LASSO) and L2 (Ridge)
regularization add penalties to the model's complexity, discouraging it
from fitting noise.

pg. 27
SKILLCERTPRO

▪ Cross-validation: Techniques like k-fold cross-validation provide more


robust performance estimates by training and evaluating the model
on multiple subsets of the data.
▪ Pruning (for decision trees): Reducing the size of the tree by removing
branches that don't contribute significantly to performance.
▪ Data augmentation: Increasing the size and diversity of the training
data by applying transformations (e.g., rotations, flips) to existing
data.
▪ Early stopping: Halting the training process before the model starts to
overfit, based on performance on a validation set.
• Underfitting: This happens when a model is too simple to capture the underlying
patterns in the data. It's like trying to fit a straight line to a highly curved dataset.
Underfit models have high bias and low variance.
o Symptoms: Low accuracy on both training and validation/test data. Simple
models (e.g., linear regression on complex data) are more prone to
underfitting.
o Detection: Consistently poor performance across all datasets.
o Handling:
▪ Using a more complex model: Switching to a more powerful
algorithm (e.g., from linear regression to a neural network).
▪ Feature engineering: Creating new features that better represent the
data.
▪ Removing constraints on the model: For example, allowing a decision
tree to grow deeper.
Detect and Handle Bias and Variance
• Bias: Represents the error introduced by approximating a real-world problem, which
may be complex, by a simplified model. High bias implies strong assumptions about
the data. A high-bias model is more likely to underfit.
• Variance: Represents the model's sensitivity to small fluctuations in the training data.
High variance means the model is learning noise in the training data. A high-variance
model is more likely to overfit.
• Bias-Variance Tradeoff: The goal is to find a balance between bias and variance.
Reducing bias often increases variance, and vice-versa. The optimal model minimizes
the total error, which is the sum of bias, variance, and irreducible error (inherent
noise in the data).
Evaluate Metrics
Choosing the right evaluation metric depends on the specific problem and the type of
machine learning task (classification, regression, etc.).
• Classification Metrics:

pg. 28
SKILLCERTPRO

o Accuracy: The overall proportion of correctly classified instances. Can be


misleading with imbalanced datasets (where one class has many more
instances than others).
o Precision: Out of all the instances predicted as positive, how many were
actually positive? (True Positives / (True Positives + False Positives)).
Important when minimizing false positives is crucial.
o Recall (Sensitivity): Out of all the actual positive instances, how many were
correctly predicted? (True Positives / (True Positives + False Negatives)).
Important when minimizing false negatives is crucial.
o F1 Score: The harmonic mean of precision and recall. Provides a balance
between precision and recall, especially useful for imbalanced datasets. F1 =
2 * (Precision * Recall) / (Precision + Recall)
o AUC-ROC (Area Under the Receiver Operating Characteristic Curve): The
ROC curve plots the true positive rate (recall) against the false positive rate at
various threshold settings. AUC measures the ability of the classifier to
distinguish between classes. A higher AUC (closer to 1) indicates better
performance.
Regression Metrics:
o Root Mean Square Error (RMSE): The square root of the average of the
squared differences between predicted and actual values. Sensitive to
outliers.
o Mean Absolute Error (MAE): The average of the absolute differences
between predicted and actual values. Less sensitive to outliers than RMSE.
o R-squared (Coefficient of Determination): Represents the proportion of
variance in the dependent variable that is predictable from the independent
variables. A higher R-squared (closer to 1) indicates a better fit.
Confusion Matrices
A confusion matrix is a table that visualizes the performance of a classification model by
summarizing the counts of correct and incorrect predictions. It helps you understand where
the model is making mistakes and provides insights beyond simple accuracy metrics.
Key elements of a confusion matrix:
• True Positive (TP): The model correctly predicts the positive class.
• True Negative (TN): The model correctly predicts the negative class.
• False Positive (FP): The model incorrectly predicts the positive class (Type I error).
• False Negative (FN): The model incorrectly predicts the negative class (Type II error).
Example:
Consider a binary classification model that predicts whether an email is spam or not. A
confusion matrix might look like this:
Predicted: Spam Predicted: Not Spam

pg. 29
SKILLCERTPRO

Actual: Spam TP = 100 FN = 20


Actual: Not Spam FP = 10 TN = 970
Export to Sheets
Metrics derived from the confusion matrix:
• Accuracy: The overall correctness of the model's predictions. Calculated as (TP + TN)
/ (TP + TN + FP + FN).
• Precision: The proportion of true positives among all predicted positives. Calculated
as TP / (TP + FP).
• Recall (Sensitivity): The proportion of true positives among all actual positives.
Calculated as TP / (TP + FN).
• F1-score: The harmonic mean of precision and recall, balancing both metrics.
Calculated as 2 * (Precision * Recall) / (Precision + Recall).
Interpreting the confusion matrix and these metrics helps you understand the specific
strengths and weaknesses of your model, such as whether it tends to produce more false
positives or false negatives.
Offline Model Evaluation
Offline evaluation involves assessing the model's performance on a held-out dataset that
was not used during training. This provides an estimate of how the model is likely to perform
on unseen data.
Common offline evaluation techniques:
• Train/Test Split: Divide the data into training and testing sets. Train the model on the
training set and evaluate its performance on the testing set.
• K-Fold Cross-Validation: Divide the data into k equal folds. Train the model k times,
each time using a different fold as the testing set and the remaining folds as the
training set. Average the performance across all folds to get a more robust estimate.
Key offline evaluation metrics:
• For classification: Accuracy, precision, recall, F1-score, AUC-ROC.
• For regression: Mean Squared Error (MSE), Root Mean Squared Error (RMSE), Mean
Absolute Error (MAE).
Online Model Evaluation (A/B Testing)
Online evaluation assesses the model's performance in a real-world production environment
by deploying it and observing its impact on actual users or systems. A/B testing is a common
technique for online evaluation.
A/B testing:
• Create two versions of your system or application: one with the new model (version
B) and one with the existing model or no model at all (version A).
• Divide your users or traffic into two groups and expose each group to one of the
versions.

pg. 30
SKILLCERTPRO

• Measure and compare the performance of the two versions based on relevant
metrics, such as conversion rates, click-through rates, or user engagement.
• Use statistical analysis to determine if there is a significant difference between the
two versions.
Benefits of online evaluation:
• Provides a more realistic assessment of the model's performance in real-world
conditions.
• Captures user behavior and feedback, which may not be reflected in offline
evaluation.
• Allows you to measure the actual business impact of the model.
Considerations for online evaluation:
• Requires careful experimental design and statistical analysis.
• May be more complex and time-consuming than offline evaluation.
• Need to consider potential risks and impact on users during the experiment.
Comparing Models Using Metrics
When you train multiple machine learning models on the same dataset, you need a way to
compare their performance and select the best one. This is where metrics come in. Different
metrics are suitable for different types of machine learning problems. Here's a breakdown of
common metrics and considerations:
• Time to Train a Model: This metric measures the computational time required to
train a model. It's crucial in scenarios where rapid model development or frequent
retraining is necessary. Complex models or large datasets can lead to longer training
times.
• Quality of Model (Performance Metrics): This is the core of model evaluation. The
specific metrics used depend on the type of machine learning task:
o Regression (predicting continuous values):
▪ Mean Squared Error (MSE): Average of the squared differences
between predicted and actual values. Lower MSE indicates better
performance. Sensitive to outliers.
▪ Root Mean Squared Error (RMSE): Square root of MSE. Easier to
interpret as it's in the same units as the target variable.
▪ Mean Absolute Error (MAE): Average of the absolute differences
between predicted and actual values. Less sensitive to outliers than
MSE.
▪ R-squared (Coefficient of Determination): Measures the proportion
of variance in the dependent variable that is predictable from the
independent variables. Higher R-squared (closer to 1) indicates a
better fit.
o Classification (predicting categories):

pg. 31
SKILLCERTPRO

▪ Accuracy: Proportion of correctly classified instances. Can be


misleading with imbalanced datasets.
▪ Precision: Proportion of true positives among all predicted positives.
Measures how many of the predicted positive cases were actually
positive.
▪ Recall (Sensitivity): Proportion of true positives among all actual
positives. Measures how many of the actual positive cases were
correctly identified.
▪ F1-score: Harmonic mean of precision and recall. Balances precision
and recall, especially useful with imbalanced datasets.
▪ Area Under the ROC Curve (AUC): Measures the ability of the model
to distinguish between classes. Higher AUC (closer to 1) indicates
better performance.
▪ Confusion Matrix: A table that summarizes the performance of a
classification model by showing the counts of true positives, true
negatives, false positives, and false negatives.
• Engineering Costs: This includes factors like:
o Model Complexity: More complex models may require more computational
resources for training and inference.
o Deployment Costs: Deploying large models can be more expensive due to
infrastructure requirements.
o Maintenance Costs: Complex models may be harder to maintain and debug.
o Inference Time (Latency): The time it takes for a model to make a prediction.
This is critical for real-time applications.
Example Comparison:
Imagine you're building a fraud detection system. You train two models:
• Model A: High accuracy (99%), but low recall (50%). This means it correctly classifies
most legitimate transactions but misses many fraudulent ones.
• Model B: Lower accuracy (95%), but high recall (90%). This means it catches most
fraudulent transactions but also flags some legitimate ones as suspicious (false
positives).
Depending on the cost of false positives (e.g., customer inconvenience), you might choose
Model B despite its lower accuracy because catching fraud is the priority.
Performing Cross-Validation
Cross-validation is a technique used to assess how well a model generalizes to unseen data.
It helps prevent overfitting, where a model performs well on the training data but poorly on
new data. The most common type is k-fold cross-validation:
1. Divide the dataset: Split the dataset into k equal-sized folds.
2. Iterate: For each fold:

pg. 32
SKILLCERTPRO

o Use the current fold as the validation set.


o Use the remaining k-1 folds as the training set.
o Train the model on the training set.
o Evaluate the model on the validation set and record the performance metrics.
3. Average: Calculate the average of the performance metrics across all k folds.
This provides a more robust estimate of the model's performance than a single train-test
split. Common values for k are 5 or 10.
Benefits of Cross-Validation:
• Reduced overfitting: Provides a more realistic estimate of model performance on
unseen data.
• Better use of data: Uses all data for both training and validation.
• More robust performance estimate: Averaging results across multiple folds reduces
the impact of random data splits.

• For a full set of 355 questions. Go to


[Link]
exam-questions/
• SkillCertPro offers detailed explanations to each question which helps to understand
the concepts better.
• It is recommended to score above 85% in SkillCertPro exams before attempting a real
exam.
• SkillCertPro updates exam questions every 2 weeks.
• You will get life time access and life time free updates
• SkillCertPro assures 100% pass guarantee in first attempt.

Domain 4: Machine Learning Implementation and Operations


4.1: Build ML solutions for performance, availability, scalability, resiliency, and fault
tolerance.

Log and Monitor AWS Environments


Effective monitoring and logging are crucial for maintaining the health and performance of
your ML systems. AWS provides powerful tools for this:
• AWS CloudTrail: This service logs API calls made to your AWS account. It provides a
detailed audit trail of who did what, when, and from where. This is essential for
security analysis, compliance auditing, and troubleshooting.

pg. 33
SKILLCERTPRO

o Key Use Cases for ML:


▪ Tracking changes to SageMaker endpoints, training jobs, and model
deployments.
▪ Auditing access to your ML models and data.
▪ Identifying unauthorized access or suspicious activity.
• Amazon CloudWatch: This service provides monitoring and observability for your
AWS resources and applications. It collects metrics, logs, and events, allowing you to
visualize performance, set alarms, and automate actions.
o Key Use Cases for ML:
▪ Monitoring CPU utilization, memory usage, and network traffic of your
ML instances.
▪ Tracking model performance metrics like accuracy, precision, and
recall.
▪ Setting alarms for anomalies or performance degradation.
▪ Collecting logs from your ML applications and services.
Building Error Monitoring Solutions
Going beyond general monitoring, you need specific error monitoring for your ML
applications. This involves:
• Logging Application Errors: Implement robust logging within your ML code to
capture exceptions, errors, and warnings. Include relevant context like timestamps,
input data, and stack traces.
• Centralized Log Management: Use CloudWatch Logs to centralize logs from all your
ML components. This makes it easier to search, analyze, and correlate errors.
• Error Rate Monitoring: Track the frequency of different types of errors. Set alarms
for increasing error rates or critical errors.
• Alerting and Notifications: Configure CloudWatch alarms to notify you when errors
occur. Integrate with notification services like SNS for email or SMS alerts.
• Automated Remediation: Where possible, automate responses to common errors.
For example, you could automatically restart a failed instance or redeploy a model.
Example Scenario:
Imagine you have a SageMaker endpoint serving a fraud detection model. To ensure
performance, availability, and fault tolerance, you would:
• Deploy the endpoint across multiple AZs.
• Use autoscaling to adjust the number of instances based on request volume.
• Monitor endpoint latency and error rates using CloudWatch.
• Set CloudWatch alarms to trigger if latency exceeds a threshold or if the error rate
increases.
• Use CloudTrail to log all API calls to the endpoint for auditing purposes.

pg. 34
SKILLCERTPRO

• Implement detailed logging within your model's inference code to capture any errors
during prediction.
• Use CloudWatch Logs to aggregate these logs and analyze error patterns.
Deploy to multiple AWS Regions and multiple Availability Zones
• AWS Regions: These are geographically distinct locations around the world where
AWS has data centers. Each region is completely independent. Deploying to multiple
regions provides:
o Disaster recovery: If one region experiences an outage, your application can
continue running in another.
o Reduced latency: Users closer to a specific region experience faster response
times.
o Compliance: Some data residency regulations require data to be stored in
specific geographic locations.
• Availability Zones (AZs): These are distinct locations within a region that are
engineered to be isolated from failures in other AZs. Deploying to multiple AZs
provides:
o High availability: If one AZ fails, your application can continue running in
other AZs within the same region.
o Fault tolerance: Protects against failures of individual data centers within a
region.
How it relates to Machine Learning:
• Model training: You might distribute your training jobs across multiple AZs or regions
to speed up the process or handle large datasets.
• Model serving: Deploying your trained models to multiple AZs and regions ensures
high availability and low latency for your applications that use the models.
Create AMIs and golden images
• Amazon Machine Images (AMIs): These are templates that contain a software
configuration (operating system, application server, applications) required to launch
an EC2 instance. You can create your own AMIs with pre-installed machine learning
frameworks (like TensorFlow or PyTorch), libraries, and dependencies.
• Golden Images: These are a type of AMI that is standardized and pre-configured with
all the necessary software and settings for a specific purpose. They promote
consistency and reduce deployment time.
How it relates to Machine Learning:
• Reproducible environments: AMIs and golden images ensure that your ML models
are trained and deployed in consistent environments, minimizing the risk of errors
due to software discrepancies.
• Faster deployments: Pre-configured AMIs with ML frameworks and dependencies
speed up the setup of training and inference instances.
Create Docker containers

pg. 35
SKILLCERTPRO

• Docker containers: These are lightweight, portable, and self-sufficient packages that
contain everything needed to run an application, including code, runtime, system
tools, system libraries, and settings.
How it relates to Machine Learning:
• Consistent environments: Containers provide a consistent runtime environment for
your ML models, regardless of where they are deployed (cloud, on-premises, or edge
devices).
• Simplified deployments: Containers make it easier to package and deploy ML
models, along with their dependencies.
• Microservices architecture: Containers enable you to build modular and scalable ML
applications using a microservices architecture.
Deploy Auto Scaling groups
• Auto Scaling groups: These allow you to automatically adjust the number of EC2
instances in your application based on demand. This ensures that you have enough
resources to handle traffic spikes and reduces costs during periods of low demand.
How it relates to Machine Learning:
• Scalable model serving: Auto Scaling can be used to automatically scale the number
of instances serving your ML models based on the number of incoming requests.
• Cost optimization: By automatically scaling down the number of instances during
periods of low demand, you can reduce your infrastructure costs.
• High availability: If an instance serving your ML model fails, Auto Scaling can
automatically replace it with a new one.
Rightsize Resources
• What it means: Choosing the most cost-effective and performant AWS resources for
your machine learning workloads. This involves selecting the appropriate instance
types, storage options, and network configurations based on your specific needs.
• Why it's important: Over-provisioning resources leads to unnecessary costs, while
under-provisioning can result in performance bottlenecks and slow down your
machine learning tasks.
• Key considerations:
o Instance type: Select instances with the right balance of CPU, GPU, memory,
and network capacity for your training and inference workloads. Consider
using EC2 instance types optimized for machine learning, such as P4d, P3,
Inf1, and G4dn instances.
o Storage: Choose the appropriate storage solution based on your data size,
access patterns, and performance requirements. Options include Amazon S3
for large datasets, Amazon EBS for persistent storage for your instances, and
Amazon EFS for shared file storage.
o Provisioned IOPS: If you're using Amazon EBS, you can provision IOPS to
guarantee a certain level of I/O performance for your workloads. This is

pg. 36
SKILLCERTPRO

important for applications that require consistent and low-latency access to


storage.
• Tools and services:
o AWS Compute Optimizer: Provides recommendations for optimal EC2
instance types based on your workload's historical utilization.
o Amazon CloudWatch: Monitor the utilization of your resources and identify
potential bottlenecks.
o AWS Cost Explorer: Analyze your AWS spending and identify opportunities to
reduce costs.
Perform Load Balancing
• What it means: Distributing incoming traffic across multiple instances to ensure high
availability, fault tolerance, and optimal performance for your machine learning
applications.
• Why it's important: Load balancing prevents any single instance from becoming
overloaded, which can lead to performance degradation or application downtime.
• Types of load balancers:
o Application Load Balancer (ALB): Best for HTTP/HTTPS traffic and provides
advanced features like content-based routing and host-based routing.
o Network Load Balancer (NLB): Best for TCP/UDP traffic and provides high
throughput and low latency.
o Classic Load Balancer (CLB): Older generation load balancer that supports
both HTTP/HTTPS and TCP/UDP traffic.
• Key considerations:
o Traffic patterns: Understand the expected traffic patterns for your machine
learning applications and choose the appropriate load balancer type.
o Health checks: Configure health checks to ensure that the load balancer only
sends traffic to healthy instances.
o Scaling: Use Auto Scaling to automatically adjust the number of instances
behind the load balancer based on traffic demand.
Follow AWS Best Practices
• What it means: Adhering to AWS's recommended guidelines for building and
operating secure, reliable, and cost-effective machine learning solutions.
• Why it's important: Following best practices helps you avoid common pitfalls,
improve the performance of your applications, and reduce your overall costs.
• Key best practices:
o Security: Implement strong security measures to protect your data and
applications. This includes using IAM roles and policies to control access to
your resources, encrypting data at rest and in transit, and regularly patching
your systems.

pg. 37
SKILLCERTPRO

o Reliability: Design your applications to be fault-tolerant and highly available.


This includes using multiple Availability Zones, implementing redundancy, and
using Auto Scaling.
o Performance: Optimize the performance of your machine learning workloads
by choosing the right instance types, using efficient algorithms, and tuning
your models.
o Cost optimization: Rightsize your resources, use reserved instances or savings
plans, and monitor your spending to minimize costs.
o Operational excellence: Automate your deployments, monitor your
applications, and implement robust logging and alerting.

4.2: Recommend and implement the appropriate ML services and features for a given
problem.

ML on AWS (application services), for example:


Amazon Polly
• What it is: Amazon Polly is a text-to-speech (TTS) service that uses advanced deep
learning technologies to synthesize natural-sounding human speech. It can convert
text into lifelike speech in a variety of voices, languages, and accents.
• Key features:
o Wide selection of voices and languages: Offers a broad portfolio of voices
across various languages.
o Customizable speech: Adjust speech rate, pitch, and volume. Add pauses and
other speech effects.
o Streaming audio: Enables real-time streaming of audio.
o SSML support: Supports Speech Synthesis Markup Language (SSML) for fine-
grained control over speech output.
• Use cases:
o Building voice-enabled applications (e.g., interactive voice response systems,
chatbots).
o Creating audio content (e.g., audiobooks, podcasts, news readers).
o Improving accessibility for users with visual impairments.
Amazon Lex
• What it is: Amazon Lex is a service for building conversational interfaces (chatbots)
into any application using voice and text. It's powered by the same conversational
engine that drives Amazon Alexa.
• Key features:
o Automatic speech recognition (ASR): Converts speech to text.
o Natural language understanding (NLU): Understands the intent of user
input.

pg. 38
SKILLCERTPRO

o Context management: Maintains context throughout a conversation.


o Integration with other services: Easily integrates with AWS Lambda, Amazon
Connect, and other services.
• Use cases:
o Building chatbots for customer service, information retrieval, and task
automation.
o Creating voice interfaces for mobile apps and IoT devices.
Amazon Transcribe
• What it is: Amazon Transcribe is an automatic speech recognition (ASR) service that
makes it easy to convert speech to text. It can analyze audio and video files and
provide high-quality transcriptions.
• Key features:
o Accurate transcription: Uses deep learning models to provide accurate
transcriptions.
o Support for multiple languages: Transcribes audio in various languages.
o Speaker identification: Identifies different speakers in a conversation.
o Punctuation and formatting: Automatically adds punctuation and formatting
to transcriptions.
o Customizable vocabulary: Improves accuracy for domain-specific terms.
• Use cases:
o Transcribing meeting recordings, customer service calls, and video content.
o Generating subtitles and captions for videos.
o Analyzing audio data for sentiment analysis and other insights.
Amazon Q
• What it is: Amazon Q is a generative AI powered assistant designed for work that can
be tailored to your business. Connect Q to your company's information repositories,
code, and enterprise systems to instantly get answers to questions, summarize
information, and generate content.
• Key features:
o Generative AI: Leverages large language models (LLMs) to generate human-
like text.
o Contextual awareness: Understands the context of conversations and user
queries.
o Multi-turn conversations: Supports complex, multi-turn conversations.
o Integration with enterprise data: Connects to internal data sources for
relevant responses.
• Use cases:
o Building intelligent chatbots for customer support and internal help desks.

pg. 39
SKILLCERTPRO

o Generating creative content, such as marketing copy and product


descriptions.
o Automating tasks, such as summarizing documents and answering
questions.
Understanding AWS Service Quotas:
AWS service quotas, formerly known as limits, are the maximum values for resources,
actions, and items in your AWS account. These quotas help prevent accidental overspending
and ensure fair usage of AWS resources. In the context of machine learning with Amazon
SageMaker, understanding service quotas is crucial for planning and executing your ML
projects effectively.
Key aspects of AWS service quotas for machine learning:
• Types of quotas: AWS imposes quotas on various aspects of SageMaker, including:
o Training jobs: Maximum training time, number of concurrent training jobs,
size of training data.
o Endpoints: Number of endpoints, requests per second, instance types.
o Models: Model size, number of models.
o Data processing: Size of data processed, number of processing jobs.
• Default quotas: AWS provides default quotas for each service. These defaults are
usually sufficient for initial development and small-scale projects.
• Adjustable quotas: Many quotas can be increased by submitting a quota increase
request through the AWS Management Console. You'll need to provide justification
for the increase.
• Monitoring quotas: You can monitor your quota usage using the AWS Management
Console, AWS CLI, or AWS SDKs.
• Impact of exceeding quotas: Exceeding a quota can lead to job failures, throttling, or
inability to create new resources.
Why understanding quotas is important for the MLS-C01 exam:
• Exam questions: The exam may present scenarios where you need to consider
service quotas when designing or troubleshooting ML solutions.
• Real-world relevance: In practical applications, you need to be aware of quotas to
avoid unexpected issues and ensure your ML workloads can scale as needed.
Determining When to Build Custom Models and When to Use Amazon SageMaker Built-in
Algorithms:
Amazon SageMaker provides both built-in algorithms and the flexibility to build custom
models. Choosing the right approach depends on several factors:
Amazon SageMaker Built-in Algorithms:
• Advantages:
o Ease of use: Built-in algorithms are pre-optimized and readily available,
requiring minimal coding.

pg. 40
SKILLCERTPRO

o Performance: These algorithms are often highly optimized for performance


on AWS infrastructure.
o Cost-effective: Using built-in algorithms can be more cost-effective for
common ML tasks.
• Use cases:
o Standard ML tasks like classification, regression, and clustering.
o When you need a quick solution or have limited ML expertise.
o When performance is critical and you want to leverage AWS optimizations.
Building Custom Models:
• Advantages:
o Flexibility: You have complete control over the model architecture, training
process, and hyperparameters.
o Customization: You can implement cutting-edge research or tailor models to
specific requirements.
o Advanced techniques: You can use deep learning frameworks like TensorFlow
or PyTorch for complex tasks.
• Use cases:
o When built-in algorithms don't meet your specific needs.
o When you need to implement custom loss functions or evaluation metrics.
o When you're working on research or developing novel ML solutions.
Factors to consider when choosing:
• Problem complexity: Simple problems may be solved effectively with built-in
algorithms, while complex problems may require custom models.
• Data characteristics: The nature of your data may influence the choice of algorithm
or model.
• Performance requirements: If you have strict performance requirements, you may
need to fine-tune a custom model.
• ML expertise: Building custom models requires more ML expertise than using built-in
algorithms.
• Time and resources: Developing and optimizing custom models can be more time-
consuming and resource-intensive.
Understanding AWS Infrastructure and Cost Considerations
When working with machine learning on AWS, it's crucial to be aware of the underlying
infrastructure and how it impacts costs. Here's a quick overview:
• Instance Types: AWS offers a wide variety of instance types optimized for different
workloads. For machine learning, you'll often encounter instances with powerful
GPUs (like those in the P3, P4, and G4 families) or specialized chips like AWS Trainium
(Trn1 instances) for deep learning training.

pg. 41
SKILLCERTPRO

• Cost Considerations: Running these powerful instances can be expensive. To


optimize costs, you should consider:
o Right-sizing: Choose the instance type that meets your performance needs
without overspending on unnecessary resources.
o Spot Instances: Take advantage of spare EC2 capacity offered at significantly
reduced prices.
o Reserved Instances or Savings Plans: For long-term workloads, commit to
using specific instance types in exchange for lower hourly costs.
Using Spot Instances to Train Deep Learning Models with AWS Batch
Now, let's dive into the core topic:
• Spot Instances: These are spare EC2 instances that AWS offers at steep discounts (up
to 90% off on-demand prices). The catch is that AWS can reclaim these instances
with a short notice (2 minutes) if the capacity is needed elsewhere.
• AWS Batch: This is a fully managed batch processing service that enables you to run
batch computing workloads on AWS. It dynamically provisions the optimal quantity
and type of compute resources (like EC2 instances) based on your job requirements.
How to Combine Spot Instances and AWS Batch for Deep Learning Training
1. Define your training job: Package your deep learning training script and
dependencies into a Docker container.
2. Create a Batch compute environment: Specify the instance types you want to use
(including Spot Instances), the maximum number of vCPUs, and other settings.
3. Submit your job to AWS Batch: Batch will automatically provision Spot Instances to
run your training job.
4. Handle interruptions: Since Spot Instances can be interrupted, you need to
implement fault tolerance in your training process. This can include:
o Checkpointing: Regularly save your model's progress so that you can resume
training from the last saved checkpoint if an instance is interrupted.
o Using instance interruption notifications: AWS provides notifications when a
Spot Instance is about to be reclaimed, giving you a short window to save
your work.
Benefits of this approach
• Significant cost savings: Spot Instances can drastically reduce your training costs.
• Scalability: AWS Batch makes it easy to scale your training jobs by automatically
provisioning and managing Spot Instances.
• Fault tolerance: By implementing checkpointing and using interruption notifications,
you can ensure that your training jobs complete successfully even with Spot Instance
interruptions.

pg. 42
SKILLCERTPRO

4.3: Apply basic AWS security practices to ML solutions.

AWS Identity and Access Management (IAM)


• Core Concept: IAM enables you to manage access to AWS services and resources
securely. You control who (users) and what (applications) can access your AWS
resources, including those used in your ML workflows.
• Key Components:
o Users: Represent individuals or entities that need access to AWS.
o Groups: Collections of users, making it easier to manage permissions for
multiple users at once.
o Roles: Define a set of permissions that can be assumed by anyone who needs
them, without needing long-term credentials. This is crucial for granting
permissions to EC2 instances, Lambda functions, and other AWS services that
need to access your ML resources.
o Policies: Define permissions in JSON format. You attach policies to users,
groups, or roles to grant specific permissions.
• Best Practices for ML:
o Principle of Least Privilege: Grant only the permissions necessary to perform
a task. For example, a data scientist might need read access to S3 buckets
containing training data but not delete access.
o Use Roles for Services: When an EC2 instance running your ML training needs
to access S3, assign an IAM role to the instance instead of embedding access
keys.
o Regularly Review and Rotate Credentials: Ensure that access keys are rotated
regularly and that any unused users or roles are removed.

S3 Bucket Policies
• Core Concept: S3 bucket policies are access control policies that you attach directly
to S3 buckets. They define who can access the bucket and what actions they can
perform.
• Key Use Cases for ML:
o Controlling Access to Training Data: You can use bucket policies to restrict
access to your training data to specific IAM users, roles, or even AWS
accounts.
o Granting Access to SageMaker: You can use bucket policies to allow
SageMaker to read data from your input buckets and write model artifacts to
output buckets.
• Example: A bucket policy can allow only a specific IAM role used by your SageMaker
training job to read objects from the training data bucket.

pg. 43
SKILLCERTPRO

Security Groups
• Core Concept: Security groups act as virtual firewalls for your EC2 instances (and
other resources like Lambda functions in VPCs). They control inbound and outbound
traffic at the instance level.
• Relevance to ML: If you're using EC2 instances for training or inference, security
groups are essential for controlling network access.
• Key Considerations:
o Inbound Rules: Define which traffic is allowed to reach your instance (e.g.,
SSH for administration, specific ports for your ML application).
o Outbound Rules: Define which traffic your instance is allowed to send (e.g.,
access to external data sources, communication with other AWS services).
o Stateful Inspection: Security groups track the state of connections and
automatically allow return traffic.
• Example: You might create a security group that allows SSH access from your IP
address and allows your ML application to communicate on specific ports.
VPCs (Virtual Private Clouds)
• Core Concept: A VPC is a logically isolated section of the AWS Cloud where you can
launch AWS resources in a virtual network that you define.
• Importance for ML: VPCs provide network isolation and security for your ML
infrastructure.
• Key Benefits:
o Network Segmentation: You can divide your VPC into subnets to isolate
different parts of your ML environment (e.g., training, inference).
o Private Subnets: You can create subnets without internet access for sensitive
operations like model training.
o Network Access Control Lists (NACLs): Provide an additional layer of network
security at the subnet level.
• Example: You can launch your SageMaker training jobs in a private subnet within
your VPC, ensuring that they are not directly accessible from the internet.
Encryption and Anonymization
• Encryption: Protects data at rest and in transit.
o At Rest: Encrypting data stored in S3 buckets, EBS volumes, and other storage
services. AWS Key Management Service (KMS) is commonly used to manage
encryption keys.
o In Transit: Using HTTPS for communication between your applications and
AWS services.
• Anonymization: Techniques used to remove personally identifiable information (PII)
from your data.
o Data Masking: Replacing sensitive data with placeholder values.

pg. 44
SKILLCERTPRO

o Tokenization: Replacing sensitive data with unique tokens.


o Differential Privacy: Adding noise to data to protect individual privacy while
still allowing for aggregate analysis.
• Relevance to ML: Essential for protecting sensitive data used in ML models.
• Example: Encrypting training data stored in S3 using KMS and anonymizing customer
data before using it to train a model.

4.4: Deploy and operationalize ML solutions.

Expose endpoints and interact with them:


• What it means: This involves creating a way for other applications or users to access
your trained model. This is typically done through an API endpoint (often RESTful).
When a request is sent to the endpoint, it's processed by the model, and a prediction
or inference is returned.
• AWS Services:
o Amazon SageMaker Endpoints: SageMaker provides managed endpoints for
hosting models. You deploy your model to an endpoint, and SageMaker
handles the infrastructure, scaling, and monitoring. You can choose different
instance types based on your performance and cost requirements.
SageMaker also supports real-time and batch inference.
o AWS Lambda: For simpler models or event-driven inference, you can deploy
your model as a Lambda function. This is a serverless approach where you
only pay for the compute time used during inference.
o Amazon API Gateway: Often used in conjunction with SageMaker Endpoints
or Lambda, API Gateway allows you to create, publish, maintain, monitor, and
secure APIs for your models. It provides features like authorization, rate
limiting, and request transformation.
• Interaction: Interacting with endpoints involves sending requests (typically HTTP
requests containing input data) and receiving responses (containing predictions). This
can be done programmatically using SDKs (AWS SDK for Python (Boto3), AWS SDK for
Java, etc.) or through tools like curl or Postman.
• Example: Imagine you have a model that predicts house prices. You would create an
endpoint. An application could then send a request to the endpoint with features of
a house (e.g., size, number of bedrooms, location), and the endpoint would return
the predicted price.
Understand ML models:
• What it means: While you don't need to be a deep learning expert for the MLS-C01
exam, you should have a solid understanding of different model types and their
characteristics, especially in the context of deployment and operationalization.
• Key Concepts:

pg. 45
SKILLCERTPRO

o Model Formats: Understand common model formats like ONNX, TensorFlow


SavedModel, PyTorch models, and how these are used in different
deployment scenarios. SageMaker supports various formats.
o Model Size and Performance: Model size impacts memory requirements and
inference latency. Larger models may require more powerful instances.
o Inference Latency: This is the time it takes for a model to generate a
prediction. It's a crucial metric for real-time applications.
o Cold Starts: In serverless environments (like Lambda), the first invocation of a
function can take longer due to initialization. This is known as a cold start.
o Model Explainability: Understanding how a model arrives at its predictions is
important for debugging, trust, and compliance. Tools like SHAP (SHapley
Additive exPlanations) can help.
• Relevance to Deployment: Understanding these aspects helps you choose the right
deployment strategy, instance types, and optimization techniques.
Perform A/B testing:
• What it means: A/B testing (also known as split testing) is a method of comparing
two versions of a model (or any other component) to see which performs better. In
the context of ML, you might compare a new model version with the current
production model.
• How it works: You direct a portion of your incoming traffic to each version of the
model and then measure key metrics (e.g., accuracy, conversion rate, click-through
rate) to determine which performs best.
• AWS Services and Techniques:
o SageMaker Model Monitor: Can be used to monitor model performance and
detect drift, which can inform A/B testing decisions.
o Canary Deployments: A type of deployment where you gradually roll out a
new version to a small subset of users before fully deploying it. This can be
used for A/B testing.
o Custom Solutions: You can implement A/B testing yourself by routing traffic
based on a percentage or using feature flags.
• Metrics: Define clear metrics for comparison. These should align with your business
goals.
• Statistical Significance: Ensure your results are statistically significant before making
decisions.
• Example: You have a model that recommends products. You train a new version of
the model and want to see if it improves click-through rates. You perform an A/B test
by directing 50% of users to the old model and 50% to the new model. You then
compare the click-through rates of each group to see which model performs better.
Retrain Pipelines
• Why Retrain? Machine learning models are trained on specific data. Over time, the
real-world data they encounter can change (a phenomenon known as data drift or

pg. 46
SKILLCERTPRO

concept drift). This can lead to a decline in model performance. Retraining is the
process of re-training your model on updated data to maintain its accuracy and
relevance.
• How to Retrain:
o Data Collection: Gather new data that reflects the current reality.
o Data Preparation: Clean, transform, and prepare the new data, ensuring
consistency with the original training data.
o Model Retraining: Use the updated data to retrain your ML model. You might
use the same model architecture or experiment with new ones.
o Evaluation: Evaluate the retrained model's performance on a held-out
dataset to ensure it generalizes well.
o Deployment: Deploy the retrained model to replace the old one in your
production environment.
• Retraining Strategies:
o Periodic Retraining: Retrain the model at fixed intervals (e.g., weekly,
monthly).
o Trigger-Based Retraining: Retrain the model when a specific trigger is
activated, such as a significant drop in performance or the availability of a
substantial amount of new data.
Debug and Troubleshoot ML Models
• Debugging ML Models: Debugging ML models is more complex than debugging
traditional software. It involves understanding not just code errors but also issues
related to data, model architecture, and training process.
• Common Issues:
o Data Issues: Incorrect or missing data, data leakage, data drift.
o Model Issues: Overfitting, underfitting, incorrect model selection.
o Training Issues: Insufficient training data, improper hyperparameter tuning.
• Debugging Techniques:
o Data Analysis: Thoroughly analyze your data for inconsistencies, biases, and
other issues.
o Visualization: Visualize data and model behavior to identify patterns and
anomalies.
o Experimentation: Conduct controlled experiments to test different
hypotheses about the cause of errors.
Detect and Mitigate Drops in Performance
• Detecting Performance Drops:
o Monitoring Metrics: Continuously monitor key performance metrics such as
accuracy, precision, recall, F1-score, AUC, etc.
o Alerting: Set up alerts to notify you when performance metrics fall below a
predefined threshold.

pg. 47
SKILLCERTPRO

o Statistical Process Control: Use statistical methods to detect significant


deviations from expected performance.
• Mitigating Performance Drops:
o Retraining: As discussed earlier, retraining is a primary method for mitigating
performance drops caused by data drift.
o Model Tuning: Adjust model hyperparameters or try different model
architectures.
o Data Augmentation: Increase the diversity of your training data by applying
transformations or generating synthetic data.
o A/B Testing: Compare the performance of the current model with a new or
retrained model in a live environment.
Monitor Performance of the Model
• Importance of Monitoring: Continuous monitoring is crucial for maintaining the
reliability and effectiveness of ML models in production.
• What to Monitor:
o Performance Metrics: Track relevant metrics to assess model accuracy and
effectiveness.
o Data Quality: Monitor data for changes in distribution, missing values, and
other anomalies.
o Resource Utilization: Monitor CPU usage, memory consumption, and other
resource metrics to ensure efficient operation.
• Monitoring Tools and Techniques:
o Cloud Monitoring Services: AWS CloudWatch, Amazon SageMaker Model
Monitor.
o Custom Monitoring Solutions: Implement custom monitoring using logging,
metrics collection, and visualization tools.
o Logging: Log model predictions, input data, and other relevant information
for analysis and debugging.

o For a full set of 355 questions. Go to


[Link]
mla-c01-exam-questions/
o SkillCertPro offers detailed explanations to each question which helps to
understand the concepts better.
o It is recommended to score above 85% in SkillCertPro exams before attempting
a real exam.
o SkillCertPro updates exam questions every 2 weeks.
o You will get life time access and life time free updates
o SkillCertPro assures 100% pass guarantee in first attempt.

pg. 48
SKILLCERTPRO

Disclaimer: All data and information provided on this site is for informational purposes only.
This site makes no representations as to accuracy, completeness, correctness, suitability, or
validity of any information on this site & will not be liable for any errors, omissions, or delays
in this information or any losses, injuries, or damages arising from its display or use. All
information is provided on an as-is basis.

pg. 49

You might also like