Remember To Play | Pratyay Dutta

Concept Note

Treat single-frame Atari as partially observable instead of fully observable — a more realistic framing.

Compare memoryless Q-learning versus recurrent Q-learning under controlled partial observability.

Evaluate exactly where temporal memory changes policy quality across different game dynamics.

The Project Brief

A systematic empirical study comparing Deep Q-Networks (DQN) and Deep Recurrent Q-Networks (DRQN) across standard and partially observable reinforcement learning environments.

Motivation: Atari games are typically treated as fully observable MDPs, but restricting the agent to a single frame induces partial observability — turning the problem into a POMDP. We hypothesize that LSTM-augmented networks can recover missing temporal context and outperform memoryless DQN under this constraint.

Contributions:

Clean, reproducible PyTorch implementations of DQN (3-layer CNN) and DRQN (CNN + LSTM).
Systematic comparison across Atari environments (Assault, Breakout) and CartPole under single-frame observation.
Ablation over replay memory strategies: standard experience replay vs. episode-based sequential replay for DRQN.
Quantitative evidence that recurrent memory provides measurable gains in partially observable settings.

Stack: Python · PyTorch · OpenAI Gymnasium · CUDA

Techniques

Deep Q-Network (DQN) DRQN (CNN + LSTM) POMDP Formulation Atari Wrappers Sequential Replay

GitHub README

Key Components

🧠

DQN Baseline

3-layer CNN with experience replay

🔁

DRQN + LSTM

Recurrent memory over temporal frames

🎮

Atari Environments

Assault, Breakout, CartPole-v1

🌫️

Partial Observability

Single-frame blackout wrappers

Project Highlights

Abstract Focus

README frames the study as a POMDP testbed and investigates whether recurrent memory improves decision quality under limited observations.

Environments

CartPole-v1, Assault-v5, and Breakout-v5 are evaluated with wrappers that induce partial observability.

Results

Reported results show DRQN benefits in scenarios where temporal context is critical for high reward.

Quickstart

Install Python dependencies listed in README and open the Atari notebooks.

Run DQN baseline notebook first to establish a comparable metric curve.

Run DRQN notebook and compare rewards under blackout/partial-observation wrappers.

"Memory is not a luxury for an agent — it is the difference between reacting and reasoning."