Midterm Notes P1
Midterm Notes P1
in the data for which there is known - Cosine: If d1 and d2 are two vectors, cos(d1,d2) = (d1∙ d2) / ||d1|| * ||d2||
- Why use data mining? (1) Volume of data being collected, (2) Computers reason while outliers are abnormalities in the data for which the reason is - Range of Values for Cosine Similarity Metric: [-1, 1]
are cheaper and more powerful, (3) Provide better, customized services for unknown and needs to be investigated. - Example: d1 = 3,2,0,5,0,0,0,2,0,0 / d2 = 1,0,0,0,0,0,0,1,0,2
an edge, (4) Data collected and stored at enormous speeds, (5) Traditional - Reason for missing data: (1) Information is no collected, (2) Attributes d1*d2=3*1 +2*0+0*0+5*0+0*0+0*0+0*0+2*1+ 0*0 = 5
techniques infeasible for raw data, (6) Data mining may help scientists (in may not be applicable to all cases. || d1 || = (3*3+2*2+0*0+5*5+0*0+0*0+0*0+2*2+0*0+0*0)0.5= 6.481
classification, segmentation, hypothesis formation). - Handling missing data: (1) Eliminate data objects, (2) Estimate missing || d2 || = (1*1+0*0+0*0+0*0+0*0+0*0+0*0+1*1+0*0+2*2) 0.5= 2.256
- Data Mining Definitions: (1) Non-trivial extraction of implicit, previously values, ignore the missing values during analysis cos(d1,d2) = 5 / (6.841 * 2.256) = 0.343
unknown and potentially useful information from data, (2) Exploration & - Duplicate Data: Data set may include data objects that re duplicates or - Extended Jaccard Coefficient: Variation for continuous or count attributes
analysis by automatic or semi-automatic means of large quantities of data in almost duplicates of one another. It a major issue when merging data from T(p,q) = p∙q / [ ||p||2 + ||q||2 - (p∙q) ]
order to discover meaningful patterns. heterogenous sources. Use deal cleaning to solve issue. - Correlation(x,y) = Cov(x,y) / (SX*SY) = SXY / (SX*SY)
- NOT data mining: (1) Looking up a phone number in a phone directory, - Data cleaning is the process of preparing data for analysis by removing or - Correlation is always in the range -1 to 1.
(2) querying a search engine for information about a topic, (3) Discovering a modifying data that is duplicated. - Corr. -1 means that the vectors have a perfect negative linear relationship.
student name in the class roster. - Types of data preprocessing: (1) Aggregation, (2) Sampling, (3) - A correlation of 1(-1) means that the vectors have a perfectly positive
- Data mining: (1) Which names are more prevalent in certain US locations? Dimensionality reduction, (4) Feature subset selection, (5) Feature creation, (negative) linear relationship, that is PK = [Link]+b, where a and b are
(2) Which documents can be grouped together by similarity according to the (6) Discretization and binarization, and (7) Attribute transformation. constants.
results of search engines according to their context? (3) Detecting - Aggregation combines two or more attributes (or objects) into a single - Example: x = [-3,-2,-1,0,1,2,3] y= [9,4,1,0,1,4,9]
anomalous transactions in credit card. attribute (or object). Perfect non-linear relationship: y(k) = x(k)2, but corr(x,y) = 0.
- Data mining can be performed on BOTH small data and big data. - Purpose of Aggregation: (1) Data reduction, (2) Change of scale, - Density-based clustering require a notion of density. Examples are
- Fields related to data mining: (1) Database system, (2) Machine learning/ (3) Provide “stable” data Euclidean Density, Probability Density, Graph-Based Density.
pattern recognition, (3) Statistics/ Artificial intelligence. - Sampling is the main technique employed for data selection. - Euclidean density is the # of points within a specified radius of the point.
- Data Mining Tasks: (1) Predictive: Learning patterns from existing data - Sampling occurs because it is too expensive to obtain the entire dataset and
Week 2
and classifying unknown data using the learned pattern. (2) Descriptive: too expensive to process the entire database.
- Classification: (1) Training set is a given collection of records. (2) Each
Find human-interpretable patterns that describe the data. - Effective Sampling: (1) Using a sample will work almost as well as using
record contains a set of attributes. One attribute is the class. (3) Find a
- Predictive Types: Classification, Regression, Deviation Detection entire data sets, if the sample is rep., (2) A sample is rep. if it has
model for class attribute as a function of the values of other attributes. (4)
- For predictive methods, a set of labeled examples is required. approximately the same property (of interest) as the original set of data.
Assign previously unseen records with a class. A test set is used to
- Descriptive Types: Clustering, Association Rule Discovery - Type of Sampling: (1) Simple random sampling, (2) Sampling without
determine the accuracy of the model.
- For a classification task, what are the commonly used data elements? (1) replacement, (3) Sampling with replacement, (4) Stratified sampling
- Classification Techniques: (1) Decision Tree-based Methods, (2) Rule-
Data Samples, (2) Attributes, (3) Label vectors. - Curse of Dimensionality: (1) Small intrinsic dimension, (2) Computational
Based Methods, (3) Memory-Based Reasoning, (4) Neural Networks, (5)
- Why do we extract patterns from data? (1) To show investors which part of expensive, (3) Difficult to visualize past 3 dimensions, (4) Difficult to store.
Naïve Bayes, Bayesian Belief Network, (5) Support Vector Machines.
data is important, (2) To make knowledge extraction algorithms faster, (3) - True about curse of dimensionality? A large number of dimensions make
- Typical Assumptions: (1) Training set has the same distribution of classes
To avoid getting stuck with unimportant information. knowledge extraction slower and can confuse knowledge extraction engines.
as the test set, (2) Training set has the same statistical features as test set.
- Examples of features that can be generated from a data sample: (1) Mean, - Purpose of Dimensionality Reduction: (1) Avoid curse of dimensionality,
standard deviation, (2) Coefficients of Fast Fourier transform, (3) Raw data. (2) Reduce time and memory required by algorithms, (3) Allows easier - Root Node: 0 incoming edges and 0 or more outgoing edges
- Label vectors can be any attribute as long as it is not the ID. visualization, (4) Eliminates irrelevant features or reduce noise. - Internal Nodes: Exactly 1 incoming edge and 2 or more outgoing edges
- If a classification application has D dimensional data samples and N data - Techniques of Dim. Reduction: (1) Principle Component Analysis (PCA), - Leaf Node: Exactly 1 incoming edge and no outgoing edge (Reps. class)
samples, what is the dimension of the data matrix? N*(D+1) Singular Value Decomposition (SVD), (3) Supervised, (4) Non-Linear - Decision Tree Induction: (1) Hunt’s Algorithm, (2) CART, (3) ID3, C4.5,
- Primarily characteristics of Big Data: (1) Velocity: Analysis on the - Dim. reduction is an unsupervised learning and feature extraction tech. (4) SLIQ, SPRINT
database to be performed fast, (2) Veracity: High information content of - PCA: (Goal) Find a projection that captures the largest amount of variable - Greedy Str: Split the records on an attribute the optimizes certain criterion.
data, (3) Volume: large amount of data needs to be analyzed at a given time. in data, (Steps) Find the eigenvectors of the covariance matrix of the feature - Tree Induction Issues: (1) Determine how to split the records. How to
matrix. The eigenvectors define the new space. (Order) Project matrix → specify the attribute test condition? How to determine the best split? (2)
- Difference between “attributes” and “attributes values”? Attributes are
original data matrix → reduced data matrix → project matrix→ Determine when to stop splitting.
categories that express properties of data, while attributes values are actual
reconstructed data matrix - Decision applies to both discrete classes and continuous data (by applying
numerical instances of the categories.
- True about PCA? The update feature matrix is obtained by multiplying the thresholding).
- Same attribute can be mapped to a different attribute value.
eigenvector matrix to the old matrix. - Specify Test Condition? (1) Depends on attribute type (nominal, ordinal,
- Object is a collection of attributes described.
- Feature Subset Selection: (1) Remove features (require domain continuous), (2) Depends on # of ways to split (2-way spit, multi-way split).
- Object is also known as a record, point, case, sample, entity, or instance.
knowledge), (2) Irrelevant features. - Bad split [C0:5, C1:5]: Non-homogenous, High degree of impurity
- Types of attributes: Nominal, Ordinal, Interval, Ratio
- Feature Subset Selection Techniques: (1) Brute-force Approach: Try all - Good split [C0:9, C1:1]: Homogenous, Low degree of impurity
- Properties of Attribute Values: Distinctness (=,≠), Order (>,<),on (+,-),
possible feature subsets as input to data mining algorithm | infeasible: 2 N - Measures of Node Impurity: Gini Index, Entropy, Misclassification Error
Addition (+,-), Multiplication (*, /)
possibilities for n attributes , (2) Embedded Approach: Feature selection - Gini Index for a given node t:
- Properties of Nominal: Distinctness
occurs naturally as part of the data mining algorithm, (3) Filter Approaches: 2
- Properties of Ordinal: Distinctness, Order
- Properties of Interval: Distinctness, Order, Addition,
Feature are selected before data mining algorithm is run, (4) Wrapper Gini(t)=∑ [ p ( j|t ) ]
Approach: Use the data mining as a black box to find best subset of
- Properties of Ratio: Distinctness, Order, Addition, Multiplication j
attributes, typically without enumerating all possible subsets.
- Exa. of Nominal: Zip codes, Employee ID, Eye color, Sex, Hair Colors - Maximum(1-1/NC) when records are equally distributed among all classes,
- Feature Creation Methods: (1) Feature Extraction (Domain specific), (2)
- Exa. of Ordinal: Hardness of Mineral {good, better, best}, Grades, Street # implying least interesting information.
Mapping Data to New Space, (3) Feature Construction (Combining features)
- Exa. of Interval: Calendar dates, Temperature, Centigrade Scale - Minimum(0.0) when all records belong one class, implying most
- Mapping Data to New Space: (1) Fourier Tra., (2) Discrete Wavelet Tra.
- Exa. of Ratio: Monetary quantities, Counts, Age, Mass, Degree Kevlin interesting information.
- Attribute Trans. : A function that maps that entire set of values of a given
- Discrete Attribute: (1) Has only a finite or countably infinite set of values,
attribute a new set of replacement values such that each old value can be
(2) Often represented as integer variables, (3) Binary attributes
identified with one of the new values. Examples are pow(), log(), natural
- Continuous Attribute: (1) Has real numbers as attribute values, (2) real
log(), abs(), standardization and normalization.
values can only be measured represented using an infinite number of digits,
K
(3) Often represent as floating-point variables.
- Similarity: (1) Numerical measure of how alike two data objects. (2) Is
NI
- Symmetric Attribute: (1) A binary variable contains two possible outcomes
higher when objects are more alike. (3) Range is 0 to 1. - GINI SPLIT =∑ GINI ( I )
(1,0), (2) Both outcomes are equally valuable and important.
- Asymmetric Attribute: (1) Outcomes of a binary variable are not equally
- Dissimilarity: (1) Numerical measure of how different two data objects
are. (2) Lower when objects are alike. (3) Minimum dissimilarity is 0. (4) I=1 N
Upper limit varies.
important, (2) Only presence (non-zero attribute value) is important.
- Proximity: Refers to a similarity or dissimilarity.
- Types of Data Sets: Record, Graph, Ordered
- Record Types: Data Matrix, Document Data, Transaction Data
- Graph Types: World Wide Web, Molecular Structures
- Ordered Types: Spatial Data, Temporal Data, Sequential Data, Genetic
Sequence Data / Sequences of transactions
- Important Characteristics of Structured Data: Dimensionality, Sparsity
(only presence counts), Resolution
- Curse of Dimensionality refers to the idea that error increases with the
increase of the number of features.
- Sparsity is a characteristic of data and is not measure of usefulness of data.
The sparsity characteristics can be exploited to increase speed of analysis
and extract meaningful patterns.
- Euclidean Dis: d(x,y) = sqrt [ (X1-X0) 2 + (Y2-Y1) 2]
- Entropy(t)=∑ p ( j|t ) log p ( j∨t)
- Minkowski Distance:
- Sparsity does not imply that a sparse dimension can be ignored for j
classification. Sparsity definitely gives us options to represent and it also
- Common pro. of a distance: (1) d(p,q) >= 0 for p and q and d(p,q) = 0 only
affects the classification algorithm that should be used.
if p = q. [+ definiteness], (2) d(p,q) = d(q,p) for all p and q. [symmetry)] (3)
- Data matrix: (1) If data objects have the same fixed set of numeric
d(p,r) <= d(p,q) + d(q,r) for all points p, q, and r. [Triangle Inequality].
attributes the data objects can be thought of as points in a multi-dimensional
- Possible methods of feature extraction: (1) Fast Fourier Transform, (2)
space, (2) Each dimension represents a distinct attribute, (3) Data set can be - Entropy is induced using information gain.
Peak Detection, (3) Discrete Cosine Transformation, Using Raw Data
represented by an m by n matrix, where there are m rows, one for each - Classification Error(t) = 1 – maxi P(i | t)
- Feature extraction is getting the features from raw data and selection is
object, and n columns, one for each attribute.
obtaining the right features from a set of features.
- Transaction data is a special type of record, where each record
(transaction) involves a set of items. - Common pro. of a similarity: (1) s(p,q) = 1 (or max similar) only if p = q.
- Examples of data quality problems: (1) Noise and outliers, (2) Missing (2) s(p,q) = s(q,p) for all p and q. (symmetry)
- Gain:
values, (3) Duplicate data. - Simple Matching Coefficients (SMC) = # of matches / # of attributes
- Noise refers to modification of original values (meaningless data). = (M11 + M00) / (M01 + M10 + M11 + M00)
- Outliers are data objects with characteristics that are considerably different
- Jaccard Coefficients (J) = # of Matches11 / # of non-both-zero attributes
than most of the other data objects in the data set. = (M11) / (M01 + M10 + M11)
- Gain is the information contained in the parent minus the weighted
- Exa:
This study source was downloaded by 100000818863340 -from [Link] on 08-27-2022 [Link] GMT -05:00information contained in the nodes.
Cosine & Jaccard-Coefficient are typical similarity measurement methods.
[Link]
- Impurity measures like Gini index and entropy tend favor attributes that - Law of Total Probability:
have a large # of distinct values.
- Impurity mea. results in a large # of partitions, which are small, but pure. P( A)=∑ P( A∨Bn )⋅ P( Bn )⇒
If the # of records of associated with each partition, it is difficult to make
n
reliable predictions.
P ( A|B 1 ) ⋅ P ( B1 ) + P ( A|B2 ) ⋅P ( B2 ) ...⇒ P ( A
- Gain Ratio: - Accuracy = (TP+TN) / (TP+TN+FP+FN) - Conditional Independence: Let X, Y, and Z denote three random variables.
- True Positive Rate = TP / (TP+FN) | True Negative Rate = TN / (TN+FP) X is said to be conditionally independent of Y given Z, if the following
- False Positive Rate = FP / (FP+TN) | False Negative Rate = FN / (FN+TP) condition holds: P(X | Y,Z) = P(X | Z)
- Precision: p = TP / (TP+FP) | Recall: r = TP / (TP+FN) - Conditional Independence implies: P(X,Y | Z) = P(X|Z)*P(Y|Z)
- Example: Parent node has 10 records. - Higher precision implies a lower # of false positive errors created by - Bayes Class: P(C|X1,X2,X3) = [P(C)*P(X1,X2,X3|C)] / P(X1,X2,X3) =
- Case 1: Consider 10 partitions with 1 record in each. classifier. Higher recall implies very few positive examples as misclassified [ P(C)*P(X1|C)*P(X2|C)*P(X3|C) ] / P(X1,X2,X3)
- (NI / N) = 0.1 / Split Info = -10*(0.1)*log(0.1) = log(10) = 3.32 as the negative class. - Gaussian Distribution:
- Case 2: Consider 2 partitions containing 4 and 6 records. - Good Model: (1) Maximize values of both precision and recall, (2) Easy to - Probability Esti.: Original [P(AI | C) = NIC/ NC)] , Laplace [P (AI | C) =
Split Info = -(4/10)*log(4/10)-(6/10)*log(6/10) = 0.97 construct models that maximize one metric, but not the other, (3) A model ( NIC + 1) / (NC + c)], M-estimation [P (AI | C) = ( NIC + mp) / (NC + m) ]
- Stopping Criteria for Tree Induction: (1) Stop expanding a node when all that assigns a positive class to every test record that matches one of the - A Naïve Bayes Classifier assumes that the attributes are conditionally
the records belong to the same class, (2) Stop expanding a node when all positive records in the training set has very high precision but low recall. independent given the class. Dependence on attribute is computation
records have similar attribute values, (3) Early Termination. - F1 Measure = 2*r*p / (r+p) expensive.
- Stopping for Hunt’s Algorithm for Decision Tree: (1) No more attributes to - Methods of Estim.: Holdout, Random subsampling, Cross Valid, Bootsteps - Naïve Bayes: (1) Robust (strong) to isolated noise points are averaged out
consider, (2) Node is Empty, (3) Node has only examples from one class. - Holdout: (1) Partition the set of labeled samples into two disjoint sets, (2) when estimating conditional probabilities, (2) Handles missing values by
- Advantages of DT (1) Inexpensive to construct, (2) Extremely fast at Reserve 2/3 for training and 1/3 for test, (3) Accur. estimated on the test set. ignoring the instance during probability estimate calculations, (3) Robust to
classifying unknown records, (3) Easy to interpret for small-sized trees, (4) - Random subsampling: (1) Repeated holdout, (2) Improved estimation of a irrelevant attributes, and (4) correlated attributes can degrade the
Accurar. is comparable to other class. Techniques for many simple data sets. class’s performance, (3) Overall accuracy is an avg of individual accuracy. performance of classifiers as the conditional independence assumption no
- C4.5: (1) Simple depth-first construction, (2) Uses information gain, (3) - Cross validation: (1) Partition data into k disjoint subsets of equal size, (2) longer holds.
Sorts continuous attributes at each node, (4) Needs entire data to fit in k-fold: train on k-1 partitions, test on the remaining one, (3) Repeat the - # of Para. (with conditional independence): 2N + 1 → Naïve Bayes
memory, (5) Unsuitable for large datasets. process for all the partitions, (4) Computationally expensive, (4) Accuracy is - # of Para. (without conditional independence): 2(2N+1) → Bayes Classifier
- Why is impurity for decision tree? (1) Helps determine which nodes to be estimates as the avg accuracy over all the k-folds, (5) Test sets are all
split, (2) An impure node may lead to higher generalization, needs to be - Perceptron is a supervised learning algorithm of binary classifiers. It is a
mutually exclusive and cover the entire data dataset, (6) Leave-one-out:
split. Pure nodes do not need to split. linear classifier (makes its prediction based on a linear predictor function
Special case with k=n; each test set contains only one record.
- Given the impurity of the nodes a split, how do you determine the impurity combining a set of weights the feature vector). It allows for online learning.
- Consider N data samples in a class. Problem. If we want to do one out
of the entire split? Take the weighted sum of the impurities of the nodes of It processes elements in training set 1 at a time.
cross valid, then what is max # of folds of cross valid is possible? N.
the split, weighed by the fraction of samples present in the node w.r.t the # - A perceptron is a linear class. A multilayer perceptron is a non-linear class.
- Bootstep: (1) Training records are sampled with replacement, (2) The same
of samples present in the parent node. - Learning Perceptron Model: In the weight update formula, the weights
training record can be selected multiple times, (3) Records not included in
should not be changed too drastically between one iteration and the next.
- Occam’s Razor: Given two models of similar generalization errors, one the bootstrap sample form the test set, (4) Process repeated, and the avg.
- The learning rate (lambda) has a value between 0 and 1 and is used to
should prefer the simpler model over the more complex model. For complex accuracy is computed.
control the amount of adjustment made in each iteration. In some cases, an
models, there is a greater chance that it was fitted accidentally by errors in - Receiver Operating Characteristic (ROC): (1) Developed in 1950s for
adaptive value of lambda can be used.
data. Should include model complexity when evaluating a mode. signal detection theory to analyze noisy signals, (2) ROC curves TPR (on
- Perceptron guaranteed to converge to an optimal solution (as long as the
- What happens to test error when you increase the # of training samples? the y-axis) against FPR (on the x-axis), (3) Each point on the curve
learning rate is sufficiently small) for linearly separable class. problems.
The overall test error decreases. corresponds to one of the models induced by the class, (4) Changing the
- Order of Neural Network: Input Layer, Hidden layers, Output layers
- Estimating Gen. Errors: (1) Re-substitution errors (Error on training), (2) threshold of algorithm, sample distribution or cost matrix changes the
- Common Activation Funs: Linear function, Sigmoid function, Tanh
Generalization errors (Error on testing, Optimistic, Pessimistic) location of the point.
function, Sign function
- Generalization errors refer to test errors. Week 3 - ANN Mode: Algorithm: Need an efficient algorithm that converges to the
- Validation: (1) Divide the original training data into two smaller subsets, Instance Based Classifiers: Rote-Learner, Nearest Neighbor right solution when sufficient training data is provided, Approach: Treat
(2) Train the model on 1 subset, (3) Use other subset to estimate - Rote-Leaner: Memorize entire training data and performs classification each hidden node or output node in the network as an independent
generalization errors, (4) Model attained with lowest error on validation set only if attributes of record match one of the training examples exactly. perceptron and apply weight update, Disadvantage: Have no knowledge
is used as final model. - Nearest neighbor: Use ‘k’ closest points for performing classification. about the true outputs of the hidden nodes. Difficult to determine the error
- Address Overfitting: (1) Pre-pruning: Stop the algorithm before it becomes - Nearest neighbor requires: (1) set of stored records, (2) distance metric, term associate with each node. Neural networks are sensitive to noise.
a fully-grown tree (Stop if all instances belong to the same class. Stop if all (3) the value of ‘k’, the # of nearest neighbors - Why do we need backpropagation, why not use perceptron learning
the attribute values are the same), (2) More restrictive: Stop if # of instances - Nearest neighbors classifies a record by: (1) compute distance to other algorithm? In ANN, we don’t have the true error at each hidden layer.
is less than some user-specified threshold, if expanding the current node
does not improve impurity measures over a certain threshold, difficult to
records, (2) identify the k nearest neighbors, (3) use class labels of nearest
d
choose the right threshold.
neighbors to determine the class label of unknown record. - Gradient Descent Algorithm: θ=θ−a J (θ)
- Address Overfitting: Post-pruning: (1) Grow decision tree to its entirety,
- Weight factor: w = 1 / d2
- Too small ‘k’ can cause data to be sensitive to noise points.
dθ
trim the nodes of the decision tree in a bottom-up fashion. (2) If general. - Too big ‘k’ can cause neighborhood to include points from other points. - The goal of the ANN learning algorithm is to determine a set of weights w
error improves after trimming, replace sub-tree by a leaf node (3) Class label - Scaling issue: Attributes may have to be scaled to prevent distance that minimize the total sum of squared errors.
of leaf node is determined from majority class of instances in the sub-tree. measures from being dominated by one of the attributes. - Grade Descent Model: Algorithms based on gradient descent method have
- Optimistic : E’(t) = e(t) / Total | Refers to training error - Issue with data dimensionality: Curse of dimensionality been developed to efficiently solve the optimization problem.
- k-NN is a lazy learner because (1) does not build models explicitly, (2) it is
d E (W )
unlike eager learners such as decision induction, (3) classifying unknown - Weight update formula: W j=W j−λ
records are relatively expensive.
- k-NN spends mores computation time in the testing phase than in the
dW j
training phase. - Design Issues in ANN: (1) Input Layer: Determine # of nodes. Assign an
- Parallel Exemplar-Based Learning System (PEBLS): Works with both input node to each of the input features, (2) Output Layer: Determine # of
continuous and nominal features. Each record is assigned a weight factor. nodes, (3) Select network topology: Weights and biases need to be
Number of nearest neighbor (k) is 1. initialized. Perform random initialization. Training examples with missing
value should be removed or replaced.
- Experiment produces exactly one out of several possible outcomes. Sample
- Characteristics of ANN: (1) Multilayer Neural Networks: Important to
- Pessimistic: E’(t) = (e(t) * Leaf x 0.5) / Total space is the set of all possible outcomes. Event is a subset of the sample
select the appropriate network topology, (2) Redundant Features: Can be
- Pessimistic refers to training error + penalty for # of leaf nodes. space.
used as weights, (3) Training Data: Neural networks are sensitive to the
- Probability Mass Function: A discrete random variable X has an associated
presence of noise. Gradient descent method used for learning the weights of
probability mass function Px(X), which denotes the probability that the
ANN converges to some local optimum. Time consuming process.
random variable X can assume a value x.
- Probability Density Function: A continuous random variable X has an - The learning task in an SVM can be formalized as
associated probability density function Px(x), which denotes the probability
2
that the random variable X can assume a value within a given interval.
||w||
- Normal Distribution:
This study source was downloaded by 100000818863340 from [Link] on 08-27-2022 [Link] GMT -05:00
[Link]
- Non-separable case:
- Non-linear: (1) Lower dimension space: Input Space, Original Att. Space,
(2) Higher Dimensional Space: Feature Space, Kernel Space, Hilbert Space.
- Similarity function: (1) Function K is called kernel function, (2) Cannot be
used as kernel function, (3) Kernel function computed for a pair of vectors
should be equivalent to the dot product between vectors in the feature space,
(4) Ensured by Mercer’s Theorem.
- Mercer’s: (1) Theorem: The function k(:,:) is a valid kernel if and only if
the corresponding Gram matrix is symmetric and positive semi-definite
(PSD), (2) Components: Mercer kernels, Kernel trick, (3) Steps: Valid
kernel, entire computation is done in the lower dimensional input space,
curse of dimensionality can be avoided, computing similarity using kernel
functions much cheaper.
- SVM Characteristics: (1) SVM learning problem can be formulated as a
convex optimization problem, efficient algorithms can be used to find the
global minimum of the objective function, (2) SVMs maximize the margin
of the decision boundary; they also called maximum margin models, (3) The
used needs to provide parameters like the type of the kernel function and the
kernel parameters and the value of the constant C to introduce the slack
variables, (4) SVM can be applied to categorical data by introducing dummy
variables for each categorical attribute value present in the data, (5) SVMs
can be used for multiclass class. using the 1-against-rest or 1-against-1.
- The dimensionality of data doesn’t affect performance of SVMs.
This study source was downloaded by 100000818863340 from [Link] on 08-27-2022 [Link] GMT -05:00
[Link]
Powered by TCPDF ([Link])