Deep reinforcement learning (DRL)

Definition

Deep RL combines reinforcement learning with deep neural networks to handle high-dimensional state and action spaces. Examples: DQN, A3C, PPO, SAC.

Neural networks approximate the value function and/or policy so RL can scale to raw pixels, high-D controls, and large discrete actions. Training is unstable without tricks (experience replay, target networks, advantage estimation); modern algorithms (PPO, SAC) are widely used in robotics and LLM alignment (RLHF, DPO).

How it works

The state (e.g. image, vector) is fed into a neural network policy (or value network) that outputs an action. The env returns reward and next state; the agent uses this experience to update the policy (e.g. policy gradient or Q-learning with function approximation). Experience replay (store transitions, sample batches) and target networks (slow-moving copy of the network) stabilize training. Advantage estimation (e.g. GAE) reduces variance in policy gradients. PPO and SAC are common for continuous control; DQN and variants for discrete actions.

Use cases

Deep RL is used when the decision problem is complex and you can learn from trial and error (simulation or real environment).

High-dimensional control (e.g. robotics, autonomous driving)
Game AI and simulation (e.g. DQN, PPO in complex environments)
LLM alignment via policy optimization (e.g. RLHF, DPO)

Deep reinforcement learning (DRL)

Definition

How it works

Use cases

External documentation

See also

Definition​

How it works​

Use cases​

External documentation​

See also​

Definition

How it works

Use cases

External documentation

See also