CS 217: Artificial Intelligence and Machine Learning Jan 10, 2024
Lecture 1: The Basics of Optimization
Lecturer: Swaprava Nath Scribes: SG1 & SG2
Disclaimer: These notes aggregate content from several texts and have not been subjected to the usual
scrutiny deserved by formal publications. If you find errors, please bring to the notice of the Instructor.
In this lecture, we discuss what essentially optimization is, where we use optimization, and look at a very
basic optimization technique called linear programming.
Broadly, optimization problems can be classified into two classes:
Class Variable Solution space Solution complexity
Continuous optimization continuous infinite polynomial in the size of the problem
Discrete optimization discrete finite exponential in the size of the problem
Table 1.1: Classes of optimization problems.
The knapsack problem is a classic discrete optimization problem, where we are given a bunch of objects with
specific weights and a knapsack (backpack, for instance), which can carry at most some amount of weight.
We want to fill our knapsack up to the highest permissible weight. There is nothing like partial association
of an item with the knapsack: An item is either in the knapsack or not. A polynomial solution to the most
general knapsack problem is not known. One way to solve it is to brute force all possible combinations and
find the optimal one, but this is not efficient.
In continuous optimization problems, even though the solution space is infinite, it is not very difficult to find
the optimal solution. We will be talking about continuous optimization problems.
There is a significant trend in AI of formulating problems in terms of optimization problems.
1.1 Optimization: What is it? — Motivating Applications
There are two components of an optimization problem: The objective function which we want to minimize
(or maximize, which is just the negative of the minimization problem), and the constraint set. Let f (x) be
the objective function, and C be the constraint set. Then, the optimization problem is written as
min f (x).
x∈C
Example. Minimize (x − 2)2 with the constraint that x ∈ [0, 1] ∪ [4, 7].
Here, the objective function is f (x) = (x−2)2 and the constraint set is C = [0, 1]∪[4, 7]. This is a one-variable
function, and we can easily see from the plot in figure 1.1a that x∗ = 1 is the optimal value of x.
Example (Geometry). Suppose we have a map of a country with cities represented by the points y1 , . . . , yn
in two-dimensions. We want to set up a supply chain and deliver to all these cities. We want to build a
warehouse such that the sum of the distances from it to all the cities is minimized (assuming we can transport
the Euclidean way).
1-1
1-2 Lecture 1: The Basics of Optimization
y
01 4 7
x x
(a) A plot of f (x) = (x − 2)2 . (b) An example of curve-fitting.
•y3 •yn
•y2 •x
•y1
(c) Map of a country with cities at y1 , . . . , yn and a
warehouse at x.
Figure 1.1: Examples of optimization problems.
Let the warehouse be located at x. Then, the optimization problem is
m
X
min ∥x − yi ∥2 ,
x∈C
i=1
where C is the set of all points in the country, and ∥x∥2 =
p
x21 + x22 represents the L2 norm of x = (x1 , x2 ).
Example (Computer vision). Image de-blurring is a common problem in computer vision. We want to
de-blur a blurred image according to some policy.
We consider grayscale images of size m × n, where each pixel has an intensity value in [0, 1]: 0 meaning black,
and 1 meaning white. Let y = [yi,j ]m×n be the blurred image that we are given. Let x = [xi,j ]m×n be the
original image and k be the blurring filter which was applied to get y. Then, to get the original (de-blurred)
image, we have to solve
m−1 n−1 m−1 n−1
X X XX
(xi,j − xi,j+1 )2 + (xi+1,j − xi,j )2 .
minm×n |yi,j − (k ∗ x)i,j | + λ
x∈[0,1]
i=0 j=0 i=0 j=0
| {z }
to reduce sudden intensity changes
λ and k are the hyperparameters. λ penalizes a high difference in intensity of adjacent pixels.
Example (Machine learning). Suppose we have inputs (xi , yi ) for i ∈ [n]. We are trying to fit a curve
through these points. Suppose that we hypothesize the curve to be a polynomial hθ (x) = w0 + w1 x + w2 x2 ,
where θ = (w0 , w1 , w2 ) is a vector of the parameters. For each point xi , we want the hypothesized point
hθ (xi ) to be close to yi . Let ℓ(x, y) = (x − y)2 be a (very simple) loss function. Our objective is to minimize
the sum of the loss over all the observed points.
The optimization problem is
n
X
min3 ℓ(hθ (xi ), yi ).
θ∈R
i=1
Lecture 1: The Basics of Optimization 1-3
1.2 Linear Programming: A way for Optimization
It is another type of optimization where we have to find an optimal value of vector x that maximizes the
following function along with satisfying some linear constraints. Let x be a n sized vector and A(m ×
n), c(size n) and b(size m) are given matrices/vectors, then the problem is:
max cT x
x
subject to Ax ≤ b
and x ≥ 0
The above form is the standard form of a linear program. If x = {x1 , x2 , x3 , . . . } and c = {c1 , c2 , c3 , . . . },
then the objective function to be maximized is:
cT x = c1 · x1 + c2 · x2 + c3 · x3 + . . .
For any two vectors {u, v}, u ≤ v means ui ≤ vi ∀i.
Example (Political Winning). Investing money to win the election
Consider a political scenario where a political party P needs to invest money to win elections. There are
3 different demographic classes, class 1, class 2, and class 3, and four issues to be addressed: A, B, C, and
D. There is a specific pattern in which people from different classes respond to different issues. Table 1.2
contains the number of votes gained or lost by the political party in each class per unit money spent on an
issue, the population of each class, and the majority required by the party to win in each class.
Classes
Issues
Class1 Class2 Class3
A −2 5 3
B 8 2 −5
C 0 0 10
D 10 0 2
Population 100 000 200 000 50 000
Majority 50 000 100 000 25 000
Table 1.2: Return Ratio Data
The aim of the political party is to minimize the total amount of money it needs to invest, yet get the
required majority across each class. Let x1 , x2 , x3 , x4 be the amount of money the party invests in issues A,
B, C, and D respectively. Then, we have the following optimization problem:
min x1 + x2 + x3 + x4
x1 ,x2 ,x3 ,x4
subject to the constraints
−2x1 + 8x2 + 0x3 + 10x4 ≥ 50 000, (1.1)
5x1 + 2x2 + 0x3 + 0x4 ≥ 100 000, (1.2)
3x1 + 5x2 + 10x3 + 2x4 ≥ 25 000, (1.3)
1-4 Lecture 1: The Basics of Optimization
and x1 , x2 , x3 , x4 ≥ 0. Let us look at the optimal solution
2 050 000 ∗ 425 000 ∗ 625 000
x∗1 = , x2 = , x3 = 0, x∗4 = .
111 111 111
The optimal value of x1 + x2 + x3 + x4 = x∗1 + x∗2 + x∗3 + x∗4 = 3 100 000/111.
After multiplying and adding the equations as (1.1) · 25/222 + (1.2) · 46/222 + (1.3) · 14/222, we get
140 3 100 000
x1 + x2 + x3 + x4 ≥ .
222 111
Since x1 + x2 + x3 + x4 = x1 + x2 + 140x3 /222 + x4 ≥ 3 100 000/111, this proves that our given solution is
truly the optimal solution.
1.2.1 Standard Form of Linear Program
Let x ∈ Rn be the vector containing the variables to optimize and c ∈ Rn be the vector of constants:
x1 c1
.. ..
x = . , c = . .
xn cn
Then, we can write the standard form of linear program as maxx c⊤ x subject to the constraints
b1
..
Am×n xn×1 ≤ bm×1 = . and x ≥ 0,
bm
where ≥ represents element-wise greater than or equal to:
x1 0
.. ..
. ≥ . ⇐⇒ ∀i, xi ≥ 0.
xn 0
The aforementioned problem is commonly referred to as the primal problem and it is accompanied by a
corresponding dual problem:
Primal Problem Dual Problem
max c⊤ x min b⊤ y
x y
s.t. Ax ≤ b, s.t. A⊤ y ≥ c,
x ≥ 0. y ≥ 0.
[Link] Principles of Duality — Strong & Weak Duality
We will be looking at two famous and ubiquitous Duality Principles – Weak Duality and Strong Duality. We
will be providing a proof for the first in this scribe.
Lecture 1: The Basics of Optimization 1-5
Theorem 1.1 (Weak Duality Principle) Let x and y represent feasible solutions, i.e., solutions that
satisfy all the constraints, for the primal and dual problems, respectively. Then,
b⊤ y ≥ c⊤ x.
Proof. Since x is a feasible solution of the primal problem,
Ax ≤ b,
and hence
x⊤ A⊤ ≤ b⊤
Since y ≥ 0, multiplying it on both sides will not change the inequality:
x⊤ A⊤ y ≤ b⊤ y.
Since y is a feasible solution of the dual problem, A⊤ y ≥ c, and hence
b⊤ y ≥ x⊤ c,
which can also be written as b⊤ y ≥ c⊤ x. Thus, weak duality principle provides a relation between the
solutions of primal and dual problems.
Theorem 1.2 (Strong Duality Principle) It suggests that for any optimal solution x∗ of primal problem
and any optimal solution y ∗ of dual problem, both the optima achieved are equal:
bT y ∗ = cT x∗