4.2 Prediction rule
This two-step process is more typical:
1. “Fit” a model to the training data
2. Use the model directly to make predictions
In the prediction rule setting of regression or classification, the model will be some hy-
pothesis or prediction rule y = h(x; θ) for some functional form h. The idea is that θ is
a vector of one or more parameter values that will be determined by fitting the model to
the training data and then be held fixed. Given a new x
(n+
1)
, we would then make the
prediction h(x
(n+
1)
; θ).
We write f(a; b) to de-
scribe a function that is
usually applied to a sin-
gle argument a, but is a
member of a paramet-
ric family of functions,
with the particular func-
tion determined by pa-
rameter value b. So,
for example, we might
write h(x; p) = x
p
to
describe a function of a
single argument that is
parameterized by p.
We write f(a; b) to de-
scribe a function that is
usually applied to a sin-
gle argument a, but is a
member of a paramet-
ric family of functions,
with the particular func-
tion determined by pa-
rameter value b. So,
for example, we might
write h(x; p) = x
p
to
describe a function of a
single argument that is
parameterized by p.
The fitting process is often articulated as an optimization problem: Find a value of θ
that minimizes some criterion involving θ and the data. An optimal strategy, if we knew
the actual underlying distribution on our data, Pr(X, Y) would be to predict the value of
y
that minimizes the expected loss, which is also known as the test error. If we don’t have
that actual underlying distribution, or even an estimate of it, we can take the approach
of minimizing the training error: that is, finding the prediction rule h that minimizes the
average loss on our training data set. So, we would seek θ that minimizes
E
n
(θ) =
1
n
n
X
i
=
1
L
(h(x
(i)
; θ), y
(i)
)
,
where the loss function L(g, a) measures how bad it would be to make a guess of g when
the actual value is a.
We will find that minimizing training error alone is often not a good choice: it is possible
to emphasize fitting the current data too strongly and end up with a hypothesis that does
not generalize well when presented with new x values.
Dostları ilə paylaş: |