The Newton Raphson Algorithm for Function
Optimization
1
1
Introduction
Throughout this course we will be interested in calculating maximum likelihood estimates
(MLEs). Such estimates are often extremely complicated nonlinear functions of the observed
data. As a result, closed form expressions for the MLEs will generally not exist for the models
we are working with.
The Newton Raphson algorithm is an iterative procedure that can be used to calculate
MLEs. The basic idea behind the algorithm is the following. First, construct a quadratic
approximation to the function of interest around some initial parameter value (hopefully
close to the MLE). Next, adjust the parameter value to that which maximizes the quadratic
approximation. This procedure is iterated until the parameter values stabilize.
These notes begin with the easy to visualize case of maximizing a function of one variable.
After this case is developed, we turn to the more general case of maximizing a function of k
variables.
2
The Newton Raphson Algorithm for Finding the Max-
imum of a Function of 1 Variable
2.1
Taylor Series Approximations
The first part of developing the Newton Raphson algorithm is to devise a way to approximate
the likelihood function with a function that can be easily maximized analytically. To do this
we need to make use of Taylor’s Theorem.
Theorem 1 (Taylor’s Theorem (1 Dimension)). Suppose the function f is k + 1 times
differentiable on an open interval I. For any points x and x + h in I there exists a point w
between x and x + h such that
f (x + h) = f (x) + f (x)h +
1
2
f (x)h
2
+ · · · +
1
k!
f
[k]
(x)h
k
+
1
(k + 1)!
f
[k+1]
(w)h
k+1
.
(1)
It can be shown that as h goes to 0 the higher order terms in equation 1 go to 0 much
faster than h goes to 0. This means that (for small values of h)
f (x + h) ≈ f (x) + f (x)h
This is referred to as a first order Taylor approximation of f at x. A more accurate approx-
imation to f (x + h) can be constructed for small values of h as:
f (x + h) ≈ f (x) + f (x)h +
1
2
f (x)h
2
This is known as a second order Taylor approximiation of f at x.
Note that the first order Taylor approximation can be rewritten as:
f (x + h) ≈ a + bh
2
where a = f (x) and b = f (x). This highlights the fact that the first order Taylor approxi-
mation is a linear function in h.
Similarly, the second order Taylor approximation can be rewritten as:
f (x + h) ≈ a + bh +
1
2
ch
2
where a = f (x), b = f (x), and c = f (x). This highlights the fact that the second order
Taylor approximation is a second order polynomial in h.
2.2
Finding the Maximum of a Second Order Polynomial
Suppose we want to find the value of x that maximizes
f (x) = a + bx + cx
2
.
First, we calculate the first derivative of f :
f (x) = b + 2cx
We know that f (ˆ
x) = 0, where ˆ
x is the value of x at which f attains its maximum. In other
words, we know that
0 = b + 2cˆ
x.
Solving for ˆ
x we find that ˆ
x = −
b
2c
. The second order condition is f (x) = 2c < 0. This
implies that f (−
b
2c
) will be a maximum whenever c < 0.
2.3
The Newton Raphson Algorithm
Suppose we want to find the value of x that maximizes some twice continuously differentiable
function f (x).
Recall
f (x + h) ≈ a + bh +
1
2
ch
2
where a = f (x), b = f (x), and c = f (x). This implies
f (x + h) ≈ b + ch.
The first order condition for the value of h (denoted ˆ
h) that maximizes f (x + h) is
0 = b + cˆ
h
Which implies ˆ
h = −
b
c
. In other words, the value that maximizes the second order Taylor
approximation to f at x is
x + ˆ
h = x −
b
c
= x −
1
f (x)
f (x)
3
With this in mind we can specify the Newton Raphson algorithm for 1 dimensional
function optimization.
Algorithm 2.1: NewtonRaphson1D(f, x
0
, tolerance)
comment: Find the value ˆ
x of x that maximizes f (x)
i ← 0
while |f (x
i
)| > tolerance
do
i ← i + 1
x
i
← x
i−1
−
1
f (x
i−1
)
f (x
i−1
)
ˆ
x ← x
i
return (ˆ
x)
Caution: Note that the Newton Raphson Algorithm doesn’t check the second order
conditions necessary for ˆ
x to be a maximizer. This means that if you give the algorithm a
bad starting value for x
0
you may end up with a min rather than a max.
2.4
Example: Calculating the MLE of a Binomial Sampling Model
To see how the Newton Raphson algorithm works in practice lets look at a simple example
with an analytical solution– a simple model of binomial sampling.
Our log-likelihood function is:
(π|y) = y ln(π) + (n − y) ln(1 − π)
where n is the sample size, y is the number of successes, and π is the probability of a success.
The first derivative of the log-likelihood function is
(π|y) =
y
π
+ (n − y)
−1
1 − π
and the second derivative of the log-likelihood function is
(π|y) = −
y
π
2
−
n − y
(1 − π)
2
.
Analytically, we know that the MLE is ˆ
π =
y
n
.
For the sake of example, suppose n = 5 and y = 2. Analytically, we know that the MLE
is ˆ
π =
y
n
= 0.4. Let’s see how the Newton Raphson algorithm works in this situation.
We begin by setting a tolerance level. In this case, let’s set it to 0.01 (In practice you
probably want something closer to 0.00001). Next we make an initial guess (denoted π
0
) as
to the MLE. Suppose π
0
= 0.55.
(π
0
|y) ≈ −3.03 which is larger in absolute value than our
tolerance of 0.01. Thus we set
π
1
← π
0
−
1
(π
0
|y)
(π
0
|y) ≈ 0.40857.
4
Now we calculate
(π
1
|y) ≈ −0.1774 which is still larger in absolute value than our
tolerance of 0.01. Thus we set
π
2
← π
1
−
1
(π
1
|y)
(π
1
|y) ≈ 0.39994.
(π
2
|y) is approximately equal to 0.0012 which is smaller in absolute value than our
tolerance of 0.01 so we can stop. The Newton Raphson algorithm here returns a value of
ˆ
pi equal to 0.39994 which is reasonably close to the analytical value of 0.40. Note we can
make the Newton Raphson procedure more accurate (within machine precision) by setting
the tolerance level closer to 0.
3
The Newton Raphson Algorithm for Finding the Max-
imum of a Function of k Variables
3.1
Taylor Series Approximations in k Dimensions
Consider a function f : R
k
→ R that is at least twice continuously differentiable. Suppose
x ∈ R
k
and h ∈ R
k
. Then the first order Taylor approximation to f at x is given by
f (x + h) ≈ f (x) +
f (x) h
and the second order Taylor approximation to f at x is given by
f (x + h) ≈ f (x) +
f (x) h +
1
2
h D
2
f (x)h
where
f (x) is the gradient (vector of first derivatives) f at x, and D
2
f (x) is the Hessian
(matrix of second derivatives) of f at x.
3.2
Finding the Maximum of a Second Order Polynomial in k
Variables
Consider
f (x) = a + b x + x Cx
where a is a scalar, b and x are k-vectors, and C is a k × k symmetric, negative definite
matrix. The gradient of f at x is
f (x) = b + 2Cx
Since the gradient at the value that maximizes f will be a vector of zeros we know that the
maximizer ˆ
x satisfies
0 = b + 2Cˆ
x
Solving for ˆ
x we find that
ˆ
x = −
1
2
C
−1
b.
Since C is assumed to be negative definite we know that this is a maximum.
5
3.3
The Newton Raphson Algorithm in k Dimensions
Suppose we want to find the ˆ
x ∈ R
k
that maximizes the twice continuously differentiable
function f : R
k
→ R.
Recall
f (x + h) ≈ a + b h +
1
2
h Ch
where a = f (x), b =
f (x), and C = D
2
f (x). Note that C will be symmetric. This implies
f (x + h) ≈ b + Ch.
Once again, the first order condition for a maximum is
0 = b + Cˆ
h
which implies that
ˆ
h = C
−1
b
In other words, the vector that maximizes the second order Taylor approximation to f at x
is
x + ˆ
h = x − C
−1
b
= x − D
2
f (x)
−1
f (x)
With this in mind we can specify the Newton Raphson algorithm for k-dimensional
function optimization.
Algorithm 3.1: NewtonRaphsonKD(f, x
0
, tolerance)
comment: Find the value ˆ
x of x that maximizes f (x)
i ← 0
while
f (x
i
) > tolerance
do
i ← i + 1
x
i
← x
i−1
− (D
2
f (x
i−1
))
−1
f (x
i−1
)
ˆ
x ← x
i
return (ˆ
x)
6
Dostları ilə paylaş: |