In reinforcement learning, the goal is to learn a mapping from input values x to output
values y, but without a direct supervision signal to specify which output values y are
best for a particular input. There is no training set specified a priori. Instead, the learning
problem is framed as an agent interacting with an environment, in the following setting:
.
MIT 6.036
Fall 2021
7
• The environment transitions probabilistically to a new state, x
(
1)
, with a distribution
that depends only on x
(
0)
and y
(
0)
.
• The agent observes the current state, x
(
1)
.
• . . .
The goal is to find a policy π, mapping x to y, (that is, states to actions) such that some
long-term sum or average of rewards r is maximized.
This setting is very different from either supervised learning or unsupervised learning,
because the agent’s action choices affect both its reward and its ability to observe the envi-
ronment. It requires careful consideration of the long-term effects of actions, as well as all
of the other issues that pertain to supervised learning.
Dostları ilə paylaş: