Reinforcement learning (RL)
Definición
El aprendizaje por refuerzo entrena agentes para maximizar la recompensa acumulada in an environment. The agent takes actions, receives observations and rewards, and improves its policy (por ej. value-based, policy gradient, actor-critic).
Difiere de supervised and unsupervised learning porque la retroalimentación es sparse and delayed (rewards), and the agent must explore. Used in games, robotics, and LLM alignment (RLHF). For high-dimensional states/actions, see deep RL.
Cómo funciona
El escenario es generalmente un MDP: el agente ve un estado, elige una acción, y el entorno devuelves a reward and next state. The agent improves its policy (mapping from state to action) to maximize cumulative reward. Value-based methods (por ej. Q-learning, DQN) learn a value function and derive the policy; policy gradient methods (por ej. PPO, SAC) optimize the policy directly. Exploration (por ej. epsilon-greedy, entropy bonus) is needed because rewards are only observed for actions taken. Algorithms differ in how they handle off-policy data, continuous actions, and scaling to large state spaces.
Casos de uso
Reinforcement learning applies wherever an agent learns from rewards and sequential decisións (games, control, alignment).
- Game playing (por ej. Atari, Go, poker) and simulation
- Robotics control and continuous control (por ej. manipulation)
- LLM alignment (por ej. RLHF) and sequential decisión systems
Documentación externa
- Reinforcement Learning (Sutton & Barto) — Free online book
- Spinning Up in Deep RL (OpenAI)