Machine Learning Theory
Machine Learning Theory
“A subfield of computer science that gives computers the ability to learn without being explicitly
programmed”
SUPERVISED LEARNING
To train and direct the machine learning to predict model of future instances. For Supervised Learning the
data must be labelled.
Advantages of Regression:
- Very fast to analyze.
- It is not requiring tuning parameters.
- It is Easy to understand.
- It is Highly interpretable.
- Test on a portion of train set: When the Test-Set is a portion of the Train-Set. The benefits are
high training accuracy and low out-of-sample accuracy.
- Train/Test Split: It is mutually exclusive with more accurate evaluation on out sample accuracy
and highly dependent on which datasets the data is trained and tested.
- K-Fold Cross Validation: Using multiple train/test split resulting the average to produce a more
consistent accuracy.
- Multiple Linear Regression: A model used for many independent variables to predict a dependent
variable.
- Ordinary least Squares: Using Linear algebra operations and for dataset with less 10k values.
- Optimization Algorithms: Using Gradient Descent for dataset of 10k values or more.
- Decision Trees: It is used to go from observations about an item (represented in the branches)
to conclusions about the item's target value (represented in the leaves). The model is all about
finding the highest information and weighted entropy.
- Support Vector Machine (SVM): It is a supervised algorithm that classifies data finding a
separator. It is also mapping data to a high-dimensional feature space using different predictions
models (Kernelling). Using to image recognition, text category assignment, detecting spam,
sentiment analysis, gene expression classification.
Advantages and Disadvantages of using this algorithm:
1. A. Accurate in high-dimensional spaces.
2. A. Memory efficient.
3. D. Prone to over-fitting.
4. D. No probability estimation.
5. D. Small datasets.
- Confusion Matrix: It’s used to calculate the value of F-score, each value of the matrix represents
the number of correct and wrong predictions. A value of F-score nearest to 1 have more
accuracy.
- Log Loss: Using for probabilities between 0 and 1 of a class labels instead of the label. A value
nearest 0 have better accuracy.
- K-means Algorithms: It is used for portioning clustering dividing the data into non-overlapping
subsets without any cluster-internal structure. The examples within a cluster are very similar and
very different across different clusters. K-means are used for med and large sized databases,
produces sphere like clusters and needs numbers of cluster. The features of this algorithms are:
Intra-Cluster: Distances within examples inside a cluster (minimized).
Inter-Cluster: Distances across examples inside a cluster (maximized).
Advantages of DBSCAN:
RECOMMENDER SYSTEMS
It is a process that capture the pattern of people’s behavior and use it to predict what else they might
want or like. The applications are what to buy, where to eat, which job to apply, who you should be friends
with, personalize your experience on the web. The advantages are broader exposure, possibility of
continual usage or purchase of products and provides better experience.
- Memory Based: uses the entire user-item dataset to generate a recommendation. Uses statistical
techniques to approximate users of items (Pearson correlation, cosine similarity, Euclidean
distance, etc).
- Model-Based: develops a model of users in an attempt to learn their preferences and models can
be created using machine learning techniques like regression, clustering, classification.