Algorithms for Data Science
CSOR W4246
Eleni Drinea
Computer Science Department
Columbia University
Thursday, September 10, 2015
Outline
1 Asymptotic notation
2 The divide & conquer principle; application: mergesort
3 Solving recurrences and running time of mergesort
Review of the last lecture
I
I
Introduced the problem of sorting.
Analyzed insertion-sort.
I
I
Worst-case running time: T (n) = O(n2 )
Space: in-place algorithm
Worst-case running time analysis: a reasonable measure of
algorithmic efficiency.
Defined polynomial-time algorithms as efficient.
Argued that detailed characterizations of running times are
not convenient for understanding scalability of algorithms.
Running time in terms of # primitive steps
We need a coarser classification of running times of algorithms;
exact characterizations
I
are too detailed;
do not reveal similarities between running times in an
immediate way as n grows large;
are often meaningless: pseudocode steps will expand by
a constant factor that depends on the hardware.
Today
1 Asymptotic notation
2 The divide & conquer principle; application: mergesort
3 Solving recurrences and running time of mergesort
Aymptotic analysis
A framework that will allow us to compare the rate of growth of
different running times as the input size n grows.
I
We will express the running time as a function of
the number of primitive steps.
The number of primitive steps is itself a function
of the size of the input n.
The running time is a function of the size of the input n.
I
To compare functions expressing running times, we will
ignore their low-order terms and focus solely on the
highest-order term.
Asymptotic upper bounds: Big-O notation
Definition 1 (O).
We say that T (n) = O(f (n)) if there exist constants c > 0 and
n0 0 s.t. for all n n0 , we have T (n) c f (n) .
T(n) = O(f(n))
c f(n)
T(n)
n
n0
Asymptotic upper bounds: Big-O notation
Definition 2 (O).
We say that T (n) = O(f (n)) if there exist constants c > 0 and
n0 0 s.t. for all n n0 , we have T (n) c f (n) .
Examples:
I
T (n) = an2 + b, a, b > 0 constants and f (n) = n2 .
T (n) = an2 + b, f (n) = n3 .
Asymptotic lower bounds: Big- notation
Definition 3 ().
We say that T (n) = (f (n)) if there exist constants c > 0 and
n0 0 s.t. for all n n0 , we have T (n) c f (n).
T(n) = (f(n))
T(n)
c f(n)
n0
Asymptotic lower bounds: Big- notation
Definition 4 ().
We say that T (n) = (f (n)) if there exist constants c > 0 and
n0 0 s.t. for all n n0 , we have T (n) c f (n).
Examples:
I
T (n) = an2 + b, a, b > 0 constants and f (n) = n2 .
T (n) = an2 + b, a, b > 0 constants and f (n) = n.
Asymptotic tight bounds: notation
Definition 5 ().
We say that T (n) = (f (n)) if there exist constants c1 , c2 > 0
and n0 0 s.t. for all n n0 , we have
c1 f (n) T (n) c2 f (n).
c2 f(n)
T(n) = (f(n))
T(n)
c1 f(n)
n
n0
Asymptotic tight bounds: notation
Definition 6 ().
We say that T (n) = (f (n)) if there exist constants c1 , c2 > 0
and n0 0 s.t. for all n n0 , we have
c1 f (n) T (n) c2 f (n).
Equivalent definition
T (n) = (f (n)) if T (n) = O(f (n)) and T (n) = (f (n))
Examples:
I
T (n) = an2 + b, a, b > 0 constants and f (n) = n2 .
T (n) = n log n + n, and f (n) = n log n.
Asymptotic upper bounds that are not tight: little-o
Definition 7 (o).
We say that T (n) = o(f (n)) if for any constant c > 0 there
exists a constant n0 0 s.t. for all n n0 , we have
T (n) < c f (n) .
Asymptotic upper bounds that are not tight: little-o
Definition 7 (o).
We say that T (n) = o(f (n)) if for any constant c > 0 there
exists a constant n0 0 s.t. for all n n0 , we have
T (n) < c f (n) .
I
Intuitively, T (n) becomes insignificant relative to f (n) as
n .
Proof by showing that lim
T (n)
n f (n)
= 0 (if the limit exists).
Asymptotic upper bounds that are not tight: little-o
Definition 7 (o).
We say that T (n) = o(f (n)) if for any constant c > 0 there
exists a constant n0 0 s.t. for all n n0 , we have
T (n) < c f (n) .
I
Intuitively, T (n) becomes insignificant relative to f (n) as
n .
Proof by showing that lim
T (n)
n f (n)
= 0 (if the limit exists).
Examples:
I
T (n) = an2 + b, a, b > 0 constants and f (n) = n3 .
T (n) = n log n, a, b, d > 0 constants and f (n) = n2 .
Asymptotic lower bounds that are not tight: little-
Definition 8 ().
We say that T (n) = (f (n)) if for any constant c > 0 there
exists n0 0 s.t. for all n n0 , we have T (n) > c f (n).
Asymptotic lower bounds that are not tight: little-
Definition 8 ().
We say that T (n) = (f (n)) if for any constant c > 0 there
exists n0 0 s.t. for all n n0 , we have T (n) > c f (n).
I
Intuitively T (n) becomes arbitrarily large relative to f (n)
as n .
T (n) = (f (n)) implies that lim
T (n)
n f (n)
exists. Then f (n) = o(T (n)).
= if the limit
Asymptotic lower bounds that are not tight: little-
Definition 8 ().
We say that T (n) = (f (n)) if for any constant c > 0 there
exists n0 0 s.t. for all n n0 , we have T (n) > c f (n).
I
Intuitively T (n) becomes arbitrarily large relative to f (n)
as n .
T (n) = (f (n)) implies that lim
T (n)
n f (n)
exists. Then f (n) = o(T (n)).
Examples:
I
T (n) = n2 and f (n) = n log n.
T (n) = 2n and f (n) = n5 .
= if the limit
Basic rules for omitting low order terms from functions
1. Ignore multiplicative factors: e.g., 10n3 becomes n3
2. na dominates nb if a > b: e.g., n2 dominates n
3. Exponentials dominate polynomials: e.g., 2n dominates n4
4. Polynomials dominate logarithms: e.g., n dominates log3 n
For large enough n,
log n < n < n log n < n2 < 2n < 3n < nn
Notation: log n stands for log2 n
Properties of asymptotic growth rates
Transitivity
1. If f = O(g) and g = O(h) then f = O(h).
2. If f = (g) and g = (h) then f = (h).
3. If f = (g) and g = (h) then f = (h).
Sums of (up to a constant number of) functions
1. If f = O(h) and g = O(h) then f + g = O(h).
2. Let k be a fixed constant, and let f1 , f2 , . . . , fk , h be
functions s.t. for all i, fi = O(h). Then
f1 + f2 + . . . + fk = O(h).
Transpose symmetry
I
f = O(g) if and only if g = (f ).
f = o(g) if and only if g = (f ).
Today
1 Asymptotic notation
2 The divide & conquer principle; application: mergesort
3 Solving recurrences and running time of mergesort
The divide & conquer principle
Divide the problem into a number of subproblems that are
smaller instances of the same problem.
Conquer the subproblems by solving them recursively.
Combine the solutions to the subproblems into the
solution for the original problem.
Divide & Conquer applied to sorting
Divide the problem into a number of subproblems that are
smaller instances of the same problem.
Divide the input array into two lists of equal size.
Conquer the subproblems by solving them recursively.
Sort each list recursively. (Stop when lists have size 2.)
Combine the solutions to the subproblems into the
solution for the original problem.
Merge the two sorted lists and output the sorted array.
Mergesort: pseudocode
Mergesort (A, lef t, right)
if right == lef t then return
end if
middle = lef t + b(right lef t)/2c
Mergesort (A, lef t, middle)
Mergesort (A, middle + 1, right)
Merge (A, lef t, middle, right)
Remarks
I
Mergesort is a recursive procedure (why?)
Initial call: Mergesort(A, 1, n)
Subroutine Merge merges two sorted lists of sizes bn/2c, dn/2e
into one sorted list of size n. How can we accomplish this?
Merge: intuition
Intuition: To merge two sorted lists of size n/2 repeatedly
I
compare the two items in the front of the two lists;
extract the smaller item and append it to the output;
update the front of the list from which the item was
extracted.
Example: n = 8, L = {1, 3, 5, 7}, R = {2, 6, 8, 10}
Merge: pseudocode
Merge (A, lef t, right, mid)
L = A[lef t, mid]
R = A[mid + 1, right]
Maintain two pointers CurrentL, CurrentR initialized to point to
the first element of L, R
while both lists are nonempty do
Let x, y be the elements pointed to by CurrentL, CurrentR
Compare x, y and append the smaller to the output
Advance the pointer in the list with the smaller of x, y
end while
Append the remainder of the non-empty list to the output.
Remark: the output is stored directly in A[lef t, right], thus the
subarray A[lef t, right] is sorted after Merge(A, lef t, right, mid).
Merge: optional exercises
Exercise 1: write detailed pseudocode (or Python code) for
Merge
Exercise 2: write a recursive Merge
Analysis of Merge
1. Correctness
2. Running time
3. Space
Analysis of Merge: correctness
1. Correctness: the smaller number in the input is L[1] or
R[1] and it will be the first number in the output. The rest
of the output is just the list obtained by Merge(L, R) after
deleting the smallest element.
2. Running time
3. Space
Merge: pseudocode
Merge (A, lef t, right, mid)
L = A[lef t, mid]
not a primitive computational step!
R = A[mid + 1, right] not a primitive computational step!
Maintain two pointers CurrentL, CurrentR initialized to point to
the first element of L, R
while both lists are nonempty do
Let x, y be the elements pointed to by CurrentL, CurrentR
Compare x, y and append the smaller to the output
Advance the pointer in the list with the smaller of x, y
end while
Append the remainder of the non-empty list to the output.
Remark: the output is stored directly in A[lef t, right], thus the
subarray A[lef t, right] is sorted after Merge(A, lef t, right, mid).
Analysis of Merge: running time
1. Correctness: the smaller number in the input is L[1] or
R[1] and it will be the first number in the output. The rest
of the output is just the list obtained by Merge(L, R) after
deleting the smallest element.
2. Running time:
I
I
Suppose L, R have n/2 elements each
How many iterations before all elements from both lists have
been appended to the output?
How much work within each iteration?
3. Space
Analysis of Merge: space
1. Correctness: the smaller number in the input is L[1] or
R[1] and it will be the first number in the output. The rest
of the output is just the list obtained by Merge(L, R) after
deleting the smallest element.
2. Running time:
L, R have n/2 elements each
How many iterations before all elements from both lists have
been appended to the output? At most n 1.
I How much work within each iteration? Constant.
Merge takes O(n) time to merge L, R (why?).
I
3. Space: extra (n) space to store L, R (the sorted output
is stored directly in A).
Example of Mergesort
Input: 1, 7, 4, 3, 5, 8, 6, 2
Analysis of Mergesort
1. Correctness
2. Running time
3. Space
Mergesort: correctness
For simplicity, assume n = 2k , integer k 0. We will use
induction on k.
I
Base case: For k = 0, the input consists of n = 1 item;
Mergesort returns the item.
Induction Hypothesis: For k > 0, assume that
Mergesort correctly sorts any list of size 2k .
Induction Step: We will show that Mergesort correctly
sorts any list of size 2k+1 .
I
I
The input list is split into two lists, each of size 2k .
Mergesort recursively calls itself on each list. By the
hypothesis, when the subroutines return, each list is sorted.
Since Merge is correct, it will merge these two sorted lists
into one sorted output list of size 2 2k .
Thus Mergesort correctly sorts any input of size 2k+1 .
Running time of Mergesort
The running time of Mergesort satisfies:
T (n) = 2T (n/2) + cn, for n 2, constant c > 0
T (1) = c
This structure is typical of recurrence relations
I
an inequality or equation bounds T (n) in terms of an
expression involving T (m) for m < n
a base case generally says that T (n) is constant for small
constant n
Remarks
I
We ignore floor and ceiling notations.
A recurrence does not provide an asymptotic bound for
T (n): to this end, we must solve the recurrence.
Today
1 Asymptotic notation
2 The divide & conquer principle; application: mergesort
3 Solving recurrences and running time of mergesort
Solving recurrences, method 1: recursion trees
The technique consists of three steps
1. Analyze the first few levels of the tree of recursive calls
2. Identify a pattern
3. Sum over all levels of recursion
Example: analysis of running time of Mergesort
T (n) = 2T (n/2) + cn, n 2
T (1) = c
A general recurrence and its solution
The running times of many recursive algorithms can be
expressed by the following recurrence
T (n) = aT (n/b) + cnk , for a, c > 0, b > 1,k 0
What is the recursion tree for this recurrence?
I
a is the branching factor
b is the factor by which the size of each subproblem shrinks
at level i, there are ai subproblems, each of size n/bi
each subproblem at level i requires c(n/bi )k work
I
the height of the tree is logb n levels
Total work:
Plogb n
i=0
ai c(n/bi )k = cnk
log
Pb n
i=0
a i
bk
Solving recurrences, method 2: Master theorem
Theorem 9 (Master theorem).
If T (n) = aT (dn/be) + O(nk ) for some
k 0, then
O(nlogb a ) ,
O(nk log n) ,
T (n) =
O(nk ) ,
constants a > 0, b > 1,
if a > bk
if a = bk
if a < bk
Example: running time of Mergesort
I
T (n) = 2T (n/2) + cn:
a = 2, b = 2, k = 1, bk = 2 = a T (n) = O(n log n)
Solving recurrences, method 3: the substitution method
The technique consists of two steps
1. Guess a bound
2. Use (strong) induction to prove that the guess is correct
Remark 1 (simple vs strong induction).
1. Simple induction: the induction step at n requires that the
inductive hypothesis holds at step n 1.
2. Strong induction: the induction step at n requires that the
inductive hypothesis holds at all steps 1, 2, . . . , n 1.
Exercise: show inductively that Mergesort runs in time
O(n log n).
What about...
1. T (n) = 2T (n 1) + 1, T (1) = 2
2. T (n) = 2T 2 (n 1), T (1) = 4
3. T (n) = T (2n/3) + T (n/3) + cn