Zum Hauptinhalt springen

Reinforcement learning (RL)

Definition

Bestärkendes Lernen trainiert Agenten, kumulative Belohnung zu maximieren in an environment. The agent takes actions, receives observations and rewards, and improves its policy (z. B. value-based, policy gradient, actor-critic).

Es unterscheidet sich von supervised and unsupervised learning weil das Feedback sparse and delayed (rewards), and the agent must explore. Used in games, robotics, and LLM alignment (RLHF). For high-dimensional states/actions, see deep RL.

Funktionsweise

Die Situation ist normalerweise ein MDP: der Agent sieht einen Zustand, wählt eine Aktion, und die Umgebung gibts a reward and next state. The agent improves its policy (mapping from state to action) to maximize cumulative reward. Value-based methods (z. B. Q-learning, DQN) learn a value function and derive the policy; policy gradient methods (z. B. PPO, SAC) optimize the policy directly. Exploration (z. B. epsilon-greedy, entropy bonus) is needed because rewards are only observed for actions taken. Algorithms differ in how they handle off-policy data, continuous actions, and scaling to large state spaces.

Anwendungsfälle

Reinforcement learning applies wherever an agent learns from rewards and sequential Entscheidungs (games, control, alignment).

  • Game playing (z. B. Atari, Go, poker) and simulation
  • Robotics control and continuous control (z. B. manipulation)
  • LLM alignment (z. B. RLHF) and sequential Entscheidung systems

Externe Dokumentation

Siehe auch