Machine Leafning
Machine Leafning
Machine Learning is a technique used to create models from data for tasks like image and
speech recognition. However, it faces challenges such as data quality, overfitting and
underfitting, computational resource demands, interpretability, generalization, ethical concerns,
and security. The learning process involves training a model on data and then applying it to real-
world data, as shown in the vertical and horizontal flows of the provided figure.
Machine Learning faces a fundamental challenge due to the distinctness of training data and
input data. A core issue is that a model trained with data from one source may fail to perform
well on data from another source, such as different handwriting styles. Achieving desired results
requires unbiased training data that accurately reflects real-world data. The process of ensuring
model performance is consistent across varied data is called generalization, and the success of
Machine Learning hinges on effective generalization.
overfitting
Overfitting occurs when a model is too closely fitted to the training data, causing it to perform
well on that data but poorly on new, unseen data. This term refers to the model's inability to
generalize beyond the training data. The explanation is better understood with a case study:
consider a classification problem where we need to divide position (or coordinate) data into two
groups based on training data points. The goal is to find a curve that accurately defines the
border between the two groups using the given training data.
When we judge this curve, there are some points that are not correctly classified according to
the border. What about perfectly grouping the points using a complex curve, as shown in
Figure 1-8?
This model yields the perfect grouping performance for the training data. How does it look? Do
you like this model better? Does it seem to reflect correctly the general behavior? Now, let’s use
this model in the real world. The new input to the model is indicated using the symbol ■, as
shown in Figure 1-9.
The previously error-free model is now identifying new data as a class ∆. However, this
classification is doubtful due to the general trend of the training data. Grouping the data as a
class • seems more reasonable. This raises questions about the model's performance, despite
its earlier 100% accuracy with the training data.
Noisy Data: The data contains outliers and noise that disrupt the boundary between
groups.
Overfitting: Machine Learning models consider all data, including noise, leading to
overfitting. Overfitting results in a model that fits the training data too closely but lacks
generalizability to new, unseen data.
Balancing Accuracy: The goal is to derive an accurate model from the training data
without purposefully making it less accurate.
Dilemma: Reducing training data error leads to overfitting, which affects generalizability.
The text hints at techniques to prevent overfitting in the following section.
The central theme is finding the balance between fitting the training data well and maintaining
the model's ability to generalize to new data
Confronting Overfitting
In the world of machine learning, overfitting is a common problem that can significantly affect
model performance. Tackling this issue effectively separates the pros from the amateurs. The
following discusses two typical methods used to confront overfitting: regularization and
validation.
Regularization simplifies the model to avoid overfitting, even if it means sacrificing some
performance. It uses a numerical method to construct a simpler model structure, which reflects
the overall characteristics of the data better.
Validation is a process that uses a reserved part of the training data to monitor the model's
performance. This reserved data, called the validation set, is not used during the training
process. By checking the model's performance on this validation set, we can determine if the
model is overfitted and make necessary adjustments.
In essence, these methods help in creating models that generalize better to new, unseen data.
When validation is involved, the training process of Machine Learning proceeds by the following
steps: 1. 2. 3. Divide the training data into two groups: one for training and the other for
validation. As a rule of thumb, the ratio of the training set to the validation set is 8:2. Train the
model with the training set. Evaluate the performance of the model using the validation set. a.
If the model yields satisfactory performance, finish the training. b. If the performance does not
produce sufficient results, modify the model and repeat the process from Step 2. Cross-
validation is a slight variation of the validation process. It still divides the training data into
groups for the training and validation, but keeps changing the datasets. Instead of retaining the
initially divided sets, cross-validation repeats the division of the data. The reason for doing this
is that the model can be overfitted even to the validation set when it is fixed. As the cross-
validation maintains the randomness of the validation dataset, it can better detect the
overfitting of the model. Figure 1-11 describes the concept of cross-validation. The dark shades
indicate the validation data, which is randomly selected throughout the training process.