Chapter 02  ·  Reinforcement Learning · Memory · POMDP

Remember To Play

When memory enters the Q-network, the agent stops guessing and starts planning.

Deep Q-Network (DQN) DRQN (CNN + LSTM) POMDP Formulation Atari Wrappers Sequential Replay
Concept Note

Treat single-frame Atari as partially observable instead of fully observable — a more realistic framing.

Compare memoryless Q-learning versus recurrent Q-learning under controlled partial observability.

Evaluate exactly where temporal memory changes policy quality across different game dynamics.

The Project Brief

A systematic empirical study comparing Deep Q-Networks (DQN) and Deep Recurrent Q-Networks (DRQN) across standard and partially observable reinforcement learning environments.

Motivation: Atari games are typically treated as fully observable MDPs, but restricting the agent to a single frame induces partial observability — turning the problem into a POMDP. We hypothesize that LSTM-augmented networks can recover missing temporal context and outperform memoryless DQN under this constraint.

Contributions:

  • Clean, reproducible PyTorch implementations of DQN (3-layer CNN) and DRQN (CNN + LSTM).
  • Systematic comparison across Atari environments (Assault, Breakout) and CartPole under single-frame observation.
  • Ablation over replay memory strategies: standard experience replay vs. episode-based sequential replay for DRQN.
  • Quantitative evidence that recurrent memory provides measurable gains in partially observable settings.

Stack: Python · PyTorch · OpenAI Gymnasium · CUDA

Techniques

Deep Q-Network (DQN) DRQN (CNN + LSTM) POMDP Formulation Atari Wrappers Sequential Replay

Key Components

🧠
DQN Baseline
3-layer CNN with experience replay
🔁
DRQN + LSTM
Recurrent memory over temporal frames
🎮
Atari Environments
Assault, Breakout, CartPole-v1
🌫️
Partial Observability
Single-frame blackout wrappers

Project Highlights

1
Abstract Focus

README frames the study as a POMDP testbed and investigates whether recurrent memory improves decision quality under limited observations.

2
Environments

CartPole-v1, Assault-v5, and Breakout-v5 are evaluated with wrappers that induce partial observability.

3
Results

Reported results show DRQN benefits in scenarios where temporal context is critical for high reward.

Quickstart

1
Install Python dependencies listed in README and open the Atari notebooks.
2
Run DQN baseline notebook first to establish a comparable metric curve.
3
Run DRQN notebook and compare rewards under blackout/partial-observation wrappers.
"Memory is not a luxury for an agent — it is the difference between reacting and reasoning."