2025 Summer School – Peking University

2025 Summer School @ Pekin University – Foundations of Reinforcement Learning

A rigorous introduction to sequential decision-making, from Markov decision processes to deep reinforcement learning.

What is Reinforcement Learning?

Reinforcement learning (RL) is a computational framework for decision-making under uncertainty. An RL agent learns to interact with its environment by trial and error, aiming to maximize long-term reward.

From early successes in games to modern applications in robotics, healthcare, and education, RL provides a foundation for building intelligent systems that learn from experience.

A Principled Approach

At the heart of reinforcement learning lies the Markov decision process (MDP): a formal model describing how agents, states, actions, and rewards evolve over time.

This course builds from first principles, covering:

Transition dynamics and reward models
Policies and value functions
The Bellman equations
Monte Carlo and temporal-difference methods
Policy optimization techniques

We emphasize clarity, rigor, and the connections between theory and practice.

Topics Covered

What is an MDP?
Policies and Interaction Protocols
Value Functions and the Bellman Equations
Monte Carlo and TD Learning
Function Approximation
Policy Gradient and Actor-Critic Methods
Exploration, Generalization, and Safety
Real-World Challenges and Applications

Lecture Series Overview

10 Lectures – 2 Hours Each

Week 1: Theoretical Foundations

Week 2: Algorithms and Applications

Syllabus

An overview of the topics can be found here: Overview

A list of exercises for the first 5 lectures can be found here.

Lectures	Files	Topics
Lecture 1	Notes, Slides	What is RL? · MDP components · Agent-environment interaction · Markov property · Policies
Lecture 2	Notes	Discounted return · Task types · RL objective · Occupancy measures · Adequacy of Markov policies
Lecture 3	Notes	Value functions · Bellman equations · Policy evaluation · Policy improvement
Lecture 4	Notes	Bellman operators · Contraction and monotonicity · Value iteration and stopping rules
Lecture 5	Notes	Policy iteration and MPI · Multi-armed bandits · Exploration vs exploitation ·
Lecture 6	Notes	Regret, ε-greedy, UCB, Thompson sampling · Monte Carlo methods (Prediction and Control)
Lecture 7	Notes	Temporal-difference learning · TD(0) · SARSA · Q-learning · Function approximation
Lecture 8	Notes	Linear methods · Semi-gradient TD · Policy gradient methods · REINFORCE · Variance reduction
Lecture 9	Slides	Actor-Critic methods · Deep RL: instability, tricks Replay buffers
Lecture 10	Slides	Advanced topics: Safe RL, offline RL, AlphaZero Applications and open challenges