2025 Summer School – Peking University

2025 Summer School @ Pekin University – Foundations of Reinforcement Learning

A rigorous introduction to sequential decision-making, from Markov decision processes to deep reinforcement learning.

What is Reinforcement Learning?

Reinforcement learning (RL) is a computational framework for decision-making under uncertainty. An RL agent learns to interact with its environment by trial and error, aiming to maximize long-term reward.

From early successes in games to modern applications in robotics, healthcare, and education, RL provides a foundation for building intelligent systems that learn from experience.

A Principled Approach

At the heart of reinforcement learning lies the Markov decision process (MDP): a formal model describing how agents, states, actions, and rewards evolve over time.

This course builds from first principles, covering:

  • Transition dynamics and reward models
  • Policies and value functions
  • The Bellman equations
  • Monte Carlo and temporal-difference methods
  • Policy optimization techniques

We emphasize clarity, rigor, and the connections between theory and practice.

Topics Covered

  • What is an MDP?
  • Policies and Interaction Protocols
  • Value Functions and the Bellman Equations
  • Monte Carlo and TD Learning
  • Function Approximation
  • Policy Gradient and Actor-Critic Methods
  • Exploration, Generalization, and Safety
  • Real-World Challenges and Applications

Lecture Series Overview

10 Lectures – 2 Hours Each

Week 1: Theoretical Foundations

Week 2: Algorithms and Applications

Syllabus

An overview of the topics can be found here: Overview

LecturesFilesTopics
Lecture 1Notes, SlidesWhat is RL? · MDP components · Agent-environment interaction · Markov property · Policies
Lecture 2NotesReturns and task types · RL objective · Adequacy of Markov policies · Value functions · Bellman equations
Lecture 3Dynamic programming  Policy evaluation and improvement  Value and policy iteration
Lecture 4Multi-armed bandits  Exploration vs exploitation  Regret, ε-greedy, UCB
Lecture 5Monte Carlo methods  First-visit and every-visit estimation  Monte Carlo control
Lecture 6Temporal-difference learning  TD(0), SARSA, Q-learning
Lecture 7Function approximation  Linear methods  Semi-gradient TD
Lecture 8Policy gradient methods  REINFORCE  Variance reduction
Lecture 9Actor-Critic methods  Deep RL: instability, tricks  Replay buffers
Lecture 10Advanced topics:  Safe RL, offline RL, AlphaZero  Applications and open challenges