Lecture 17: Supervised Learning RecapMachine LearningApril 6, 2010
Last TimeSupport Vector MachinesKernel Methods
TodayShort recap of Kernel MethodsReview of Supervised LearningUnsupervised Learning(Soft) K-means clusteringExpectation MaximizationSpectral ClusteringPrinciple Components AnalysisLatent Semantic Analysis
Kernel MethodsFeature extraction to higher dimensional spaces.Kernels describe the relationship between vectors (points) rather than the new feature space directly.
When can we use kernels?Any time training and evaluation are both based on the dot product between two points.SVMsPerceptronk-nearest neighborsk-meansetc.
Kernels in SVMsOptimize αi’s and bias w.r.t. kernelDecision function:
Kernels in PerceptronsTrainingDecision function
Good and Valid KernelsGood: Computing K(xi,xj) is cheaper than ϕ(xi)Valid: Symmetric: K(xi,xj) =K(xj,xi) Decomposable into ϕ(xi)Tϕ(xj)Positive Semi Definite Gram MatrixPopular KernelsLinear, PolynomialRadial Basis FunctionString (technically infinite dimensions)Graph
Supervised LearningLinear RegressionLogistic RegressionGraphical ModelsHidden Markov ModelsNeural NetworksSupport Vector MachinesKernel Methods
Major conceptsGaussian, Multinomial, Bernoulli DistributionsJoint vs. Conditional DistributionsMarginalizationMaximum LikelihoodRisk MinimizationGradient DescentFeature Extraction, Kernel Methods
Some favorite distributionsBernoulliMultinomialGaussian
Maximum LikelihoodIdentify the parameter values that yield the maximum likelihood of generating the observed data.Take the partial derivative of the likelihood functionSet to zeroSolveNB: maximum likelihood parameters are the same as maximum log likelihood parameters
Maximum Log LikelihoodWhy do we like the log function?It turns products (difficult to differentiate) and turns them into sums (easy to differentiate)log(xy) = log(x) + log(y)log(xc) = clog(x)
Risk MinimizationPick a loss functionSquared lossLinear lossPerceptron (classification) lossIdentify the parameters that minimize the loss function.Take the partial derivative of the loss functionSet to zeroSolve
Frequentistsv. BayesiansPoint estimates vs. PosteriorsRisk Minimization vs. Maximum LikelihoodL2-Regularization	Frequentists: Add a constraint on the size of the weight vectorBayesians: Introduce a zero-mean prior on the weight vectorResult is the same!
L2-RegularizationFrequentists:Introduce a cost on the size of the weightsBayesians:Introduce a prior on the weights
Types of ClassifiersGenerative ModelsHighest resource requirements.  Need to approximate the joint probabilityDiscriminative ModelsModerate resource requirements. Typically fewer parameters to approximate than generative modelsDiscriminant FunctionsCan be trained probabilistically, but the output does not include confidence information
Linear RegressionFit a line to a set of points
Linear RegressionExtension to higher dimensionsPolynomial fittingArbitrary function fittingWaveletsRadial basis functionsClassifier output
Logistic RegressionFit gaussians to data for each classThe decision boundary is where the PDFs crossNo “closed form” solution to the gradient.Gradient Descent
Graphical ModelsGeneral way to describe the dependence relationships between variables.Junction Tree Algorithm allows us to efficiently calculate marginals over any variable.
Junction Tree AlgorithmMoralization“Marry the parents”Make undirectedTriangulationRemove cycles >4Junction Tree ConstructionIdentify separators such that the running intersection property holdsIntroduction of EvidencePass slices around the junction tree to generate marginals
Hidden Markov ModelsSequential ModelingGenerative ModelRelationship between observations and state (class) sequences
PerceptronStep function used for squashing.Classifier as Neuron metaphor.
Perceptron LossClassification Error vs. Sigmoid ErrorLoss is only calculated on MistakesPerceptrons usestrictly classificationerror
Neural NetworksInterconnected Layers of Perceptrons or Logistic Regression “neurons”
Neural NetworksThere are many possible configurations of neural networksVary the number of layersSize of layers
Support Vector MachinesMaximum Margin ClassificationSmall MarginLarge Margin
Support Vector MachinesOptimization FunctionDecision Function
Visualization of Support Vectors30
Questions?Now would be a good time to ask questions about Supervised Techniques.
ClusteringIdentify discrete groups of similar data pointsData points are unlabeled
Recall K-MeansAlgorithmSelect K – the desired number of clustersInitialize K cluster centroidsFor each point in the data set, assign it to the cluster with the closest centroidUpdate the centroid based on the points assigned to each clusterIf any data point has changed clusters, repeat
k-means output
Soft K-meansIn k-means, we force every data point to exist in exactly one cluster.This constraint can be relaxed.Minimizes the entropy of cluster assignment
Soft k-means example
Soft k-meansWe still define a cluster by a centroid, but we calculate the centroid as the weighted mean of all the data pointsConvergence is based on a stopping threshold rather than changed assignments
Gaussian Mixture ModelsRather than identifying clusters by “nearest” centroidsFit a Set of k Gaussians to the data.
GMM example
Gaussian Mixture ModelsFormally a Mixture Model is the weighted sum of a number of pdfs where the weights are determined by a distribution,
Graphical Modelswith unobserved variablesWhat if you have variables in a Graphical model that are never observed?Latent VariablesTraining latent variable models is an unsupervised learning applicationuncomfortableamusedlaughingsweating
Latent Variable HMMsWe can cluster sequences using an HMM with unobserved state variables We will train the latent variable models using Expectation Maximization
Expectation MaximizationBoth the training of GMMs and Gaussian Models with latent variables are accomplished using Expectation MaximizationStep 1: Expectation (E-step)Evaluate the “responsibilities” of each cluster with the current parametersStep 2: Maximization (M-step)Re-estimate parameters using the existing “responsibilities”Related to k-means
Questions	One more time for questions on supervised learning…
Next TimeGaussian Mixture Models (GMMs)Expectation Maximization

Lecture 17: Supervised Learning Recap

  • 1.
    Lecture 17: SupervisedLearning RecapMachine LearningApril 6, 2010
  • 2.
    Last TimeSupport VectorMachinesKernel Methods
  • 3.
    TodayShort recap ofKernel MethodsReview of Supervised LearningUnsupervised Learning(Soft) K-means clusteringExpectation MaximizationSpectral ClusteringPrinciple Components AnalysisLatent Semantic Analysis
  • 4.
    Kernel MethodsFeature extractionto higher dimensional spaces.Kernels describe the relationship between vectors (points) rather than the new feature space directly.
  • 5.
    When can weuse kernels?Any time training and evaluation are both based on the dot product between two points.SVMsPerceptronk-nearest neighborsk-meansetc.
  • 6.
    Kernels in SVMsOptimizeαi’s and bias w.r.t. kernelDecision function:
  • 7.
  • 8.
    Good and ValidKernelsGood: Computing K(xi,xj) is cheaper than ϕ(xi)Valid: Symmetric: K(xi,xj) =K(xj,xi) Decomposable into ϕ(xi)Tϕ(xj)Positive Semi Definite Gram MatrixPopular KernelsLinear, PolynomialRadial Basis FunctionString (technically infinite dimensions)Graph
  • 9.
    Supervised LearningLinear RegressionLogisticRegressionGraphical ModelsHidden Markov ModelsNeural NetworksSupport Vector MachinesKernel Methods
  • 10.
    Major conceptsGaussian, Multinomial,Bernoulli DistributionsJoint vs. Conditional DistributionsMarginalizationMaximum LikelihoodRisk MinimizationGradient DescentFeature Extraction, Kernel Methods
  • 11.
  • 12.
    Maximum LikelihoodIdentify theparameter values that yield the maximum likelihood of generating the observed data.Take the partial derivative of the likelihood functionSet to zeroSolveNB: maximum likelihood parameters are the same as maximum log likelihood parameters
  • 13.
    Maximum Log LikelihoodWhydo we like the log function?It turns products (difficult to differentiate) and turns them into sums (easy to differentiate)log(xy) = log(x) + log(y)log(xc) = clog(x)
  • 14.
    Risk MinimizationPick aloss functionSquared lossLinear lossPerceptron (classification) lossIdentify the parameters that minimize the loss function.Take the partial derivative of the loss functionSet to zeroSolve
  • 15.
    Frequentistsv. BayesiansPoint estimatesvs. PosteriorsRisk Minimization vs. Maximum LikelihoodL2-Regularization Frequentists: Add a constraint on the size of the weight vectorBayesians: Introduce a zero-mean prior on the weight vectorResult is the same!
  • 16.
    L2-RegularizationFrequentists:Introduce a coston the size of the weightsBayesians:Introduce a prior on the weights
  • 17.
    Types of ClassifiersGenerativeModelsHighest resource requirements. Need to approximate the joint probabilityDiscriminative ModelsModerate resource requirements. Typically fewer parameters to approximate than generative modelsDiscriminant FunctionsCan be trained probabilistically, but the output does not include confidence information
  • 18.
    Linear RegressionFit aline to a set of points
  • 19.
    Linear RegressionExtension tohigher dimensionsPolynomial fittingArbitrary function fittingWaveletsRadial basis functionsClassifier output
  • 20.
    Logistic RegressionFit gaussiansto data for each classThe decision boundary is where the PDFs crossNo “closed form” solution to the gradient.Gradient Descent
  • 21.
    Graphical ModelsGeneral wayto describe the dependence relationships between variables.Junction Tree Algorithm allows us to efficiently calculate marginals over any variable.
  • 22.
    Junction Tree AlgorithmMoralization“Marrythe parents”Make undirectedTriangulationRemove cycles >4Junction Tree ConstructionIdentify separators such that the running intersection property holdsIntroduction of EvidencePass slices around the junction tree to generate marginals
  • 23.
    Hidden Markov ModelsSequentialModelingGenerative ModelRelationship between observations and state (class) sequences
  • 24.
    PerceptronStep function usedfor squashing.Classifier as Neuron metaphor.
  • 25.
    Perceptron LossClassification Errorvs. Sigmoid ErrorLoss is only calculated on MistakesPerceptrons usestrictly classificationerror
  • 26.
    Neural NetworksInterconnected Layersof Perceptrons or Logistic Regression “neurons”
  • 27.
    Neural NetworksThere aremany possible configurations of neural networksVary the number of layersSize of layers
  • 28.
    Support Vector MachinesMaximumMargin ClassificationSmall MarginLarge Margin
  • 29.
    Support Vector MachinesOptimizationFunctionDecision Function
  • 30.
  • 31.
    Questions?Now would bea good time to ask questions about Supervised Techniques.
  • 32.
    ClusteringIdentify discrete groupsof similar data pointsData points are unlabeled
  • 33.
    Recall K-MeansAlgorithmSelect K– the desired number of clustersInitialize K cluster centroidsFor each point in the data set, assign it to the cluster with the closest centroidUpdate the centroid based on the points assigned to each clusterIf any data point has changed clusters, repeat
  • 34.
  • 35.
    Soft K-meansIn k-means,we force every data point to exist in exactly one cluster.This constraint can be relaxed.Minimizes the entropy of cluster assignment
  • 36.
  • 37.
    Soft k-meansWe stilldefine a cluster by a centroid, but we calculate the centroid as the weighted mean of all the data pointsConvergence is based on a stopping threshold rather than changed assignments
  • 38.
    Gaussian Mixture ModelsRatherthan identifying clusters by “nearest” centroidsFit a Set of k Gaussians to the data.
  • 39.
  • 40.
    Gaussian Mixture ModelsFormallya Mixture Model is the weighted sum of a number of pdfs where the weights are determined by a distribution,
  • 41.
    Graphical Modelswith unobservedvariablesWhat if you have variables in a Graphical model that are never observed?Latent VariablesTraining latent variable models is an unsupervised learning applicationuncomfortableamusedlaughingsweating
  • 42.
    Latent Variable HMMsWecan cluster sequences using an HMM with unobserved state variables We will train the latent variable models using Expectation Maximization
  • 43.
    Expectation MaximizationBoth thetraining of GMMs and Gaussian Models with latent variables are accomplished using Expectation MaximizationStep 1: Expectation (E-step)Evaluate the “responsibilities” of each cluster with the current parametersStep 2: Maximization (M-step)Re-estimate parameters using the existing “responsibilities”Related to k-means
  • 44.
    Questions One more timefor questions on supervised learning…
  • 45.
    Next TimeGaussian MixtureModels (GMMs)Expectation Maximization

Editor's Notes

  • #39 p(x) = pi_0f_0(x) + pi_1f_1(x) + pi_2f_2(x) + ldots + pi_kf_k(x)