2025 Summer School – Peking University

2025 Summer School @ Pekin University – Foundations of Reinforcement Learning

A rigorous introduction to sequential decision-making, from Markov decision processes to deep reinforcement learning.

What is Reinforcement Learning?

Reinforcement learning (RL) is a computational framework for decision-making under uncertainty. An RL agent learns to interact with its environment by trial and error, aiming to maximize long-term reward.

From early successes in games to modern applications in robotics, healthcare, and education, RL provides a foundation for building intelligent systems that learn from experience.

A Principled Approach

At the heart of reinforcement learning lies the Markov decision process (MDP): a formal model describing how agents, states, actions, and rewards evolve over time.

This course builds from first principles, covering:

Transition dynamics and reward models
Policies and value functions
The Bellman equations
Monte Carlo and temporal-difference methods
Policy optimization techniques

We emphasize clarity, rigor, and the connections between theory and practice.

Topics Covered

What is an MDP?
Policies and Interaction Protocols
Value Functions and the Bellman Equations
Monte Carlo and TD Learning
Function Approximation
Policy Gradient and Actor-Critic Methods
Exploration, Generalization, and Safety
Real-World Challenges and Applications

Lecture Series Overview

10 Lectures – 2 Hours Each

Week 1: Theoretical Foundations

Week 2: Algorithms and Applications

Syllabus

An overview of the topics can be found here: Overview

Lectures	Files	Topics
Lecture 1	Notes, Slides	What is RL? · MDP components · Agent-environment interaction · Markov property · Policies
Lecture 2	Notes	Discounted return · Task types · RL objective · Occupancy measures · Adequacy of Markov policies
Lecture 3	Notes	Value functions · Bellman equations · Policy evaluation · Policy improvement
Lecture 4	Notes	Value and policy iteration · Multi-armed bandits · Exploration vs exploitation · Regret, ε-greedy, UCB
Lecture 5		Monte Carlo methods · First-visit and every-visit estimation · Monte Carlo control
Lecture 6		Temporal-difference learning · TD(0) · SARSA · Q-learning
Lecture 7		Function approximation · Linear methods · Semi-gradient TD
Lecture 8		Policy gradient methods · REINFORCE · Variance reduction
Lecture 9		Actor-Critic methods · Deep RL: instability, tricks Replay buffers
Lecture 10		Advanced topics: Safe RL, offline RL, AlphaZero Applications and open challenges