1 paper accepted to L4DC

Our paper on data-driven acceleration of Model Predictive Control (MPC) [1] has been accepted to the 8th Annual Learning for Dynamics and Control Conference. Congrats to Agustin and Shijie!

[1] A. Castellano, S. Pan, and E. Mallada, “Data-driven Acceleration of MPC with Guarantees,” in Proceedings of The 8th Annual Learning for Dynamics and Control Conference, 2026.
[Bibtex] [Abstract] [Download PDF]

Model Predictive Control (MPC) is a powerful framework for optimal control but can be too slow for low-latency applications. We present a data-driven framework to accelerate MPC by replacing online optimization with a nonparametric policy constructed from offline MPC solutions. Our policy is greedy with respect to a constructed upper bound on the optimal cost-to-go, and can be implemented as a nonparametric lookup rule that is orders of magnitude faster than solving MPC online. Our analysis shows that under sufficient coverage condition of the offline data, the policy is recursively feasible and admits provable, bounded optimality gap. These conditions establish an explicit trade-off between the amount of data collected and the tightness of the bounds. Our experiments show that this policy is between $100$ and $1000$ times faster than standard MPC, with only a modest hit to optimality, showing potential for real-time control tasks.

@inproceedings{cpm2026l4dc,
  abstract = {Model Predictive Control (MPC) is a powerful framework for optimal control but can be too slow for low-latency applications. We present a data-driven framework to accelerate MPC by replacing online optimization with a nonparametric policy constructed from offline MPC solutions. Our policy is greedy with respect to a constructed upper bound on the optimal cost-to-go, and can be implemented as a nonparametric lookup rule that is orders of magnitude faster than solving MPC online. Our analysis shows that under sufficient coverage condition of the offline data, the policy is recursively feasible and admits provable, bounded optimality gap. These conditions establish an explicit trade-off between the amount of data collected and the tightness of the bounds. Our experiments show that this policy is between $100$ and $1000$ times faster than standard MPC, with only a modest hit to optimality, showing  potential for real-time control tasks.},
  author = {Castellano, Agustin and Pan, Shijie and Mallada, Enrique},
  booktitle = {Proceedings of The 8th Annual Learning for Dynamics and Control Conference},
  grants = {Global-Centers-2330450; DOE-ASCR-826565},
  month = {6},
  organization = {PMLR},
  pubstate = {accepted},
  record = {accepted Mar. 2026, submitted Nov. 2025},
  title = {Data-driven Acceleration of MPC with Guarantees},
  url = {https://mallada.ece.jhu.edu/pubs/2026-L4DC-CPM.pdf},
  year = {2026}
}

2 papers accepted to ACC

Our papers on safety-critical control via recurrent tracking functions [1] and on data-driven practical stabilization via chain policies [2] have been accepted to the American Control Conference. Congrats Jixian and Roy!

[1] J. Liu and E. Mallada, “Safety-Critical Control via Recurrent Tracking Functions,” in American Control Conference (ACC), 2026, pp. 1-7.
[Bibtex] [Abstract] [Download PDF]

This paper addresses the challenge of synthesizing safety-critical controllers for high-order nonlinear systems, where constructing valid Control Barrier Functions (CBFs) remains computationally intractable. Leveraging layered control, we design CBFs in reduced-order models (RoMs) while regulating full-order models’ (FoMs) dynamics at the same time. Traditional Lyapunov tracking functions are required to decrease monotonically, but systematic synthesis methods for such functions exist only for fully-actuated systems. To overcome this limitation, we introduce Recurrent Tracking Functions (RTFs), which replace the monotonic decay requirement with a weaker finite-time recurrence condition. This relaxation permits transient deviations of tracking errors while ensuring safety. By augmenting CBFs for RoMs with RTFs, we construct recurrent CBFs (RCBFs) whose zero-superlevel set is control $τ$-recurrent, and guarantee safety for all initial states in such a set when RTFs are satisfied. We establish theoretical safety guarantees and validate the approach through numerical experiments, demonstrating RTFs’ effectiveness and the safety of FoMs.

@inproceedings{lm2026acc,
  abstract = {This paper addresses the challenge of synthesizing safety-critical controllers for high-order nonlinear systems, where constructing valid Control Barrier Functions (CBFs) remains computationally intractable. Leveraging layered control, we design CBFs in reduced-order models (RoMs) while regulating full-order models' (FoMs) dynamics at the same time. Traditional Lyapunov tracking functions are required to decrease monotonically, but systematic synthesis methods for such functions exist only for fully-actuated systems. To overcome this limitation, we introduce Recurrent Tracking Functions (RTFs), which replace the monotonic decay requirement with a weaker finite-time recurrence condition. This relaxation permits transient deviations of tracking errors while ensuring safety. By augmenting CBFs for RoMs with RTFs, we construct recurrent CBFs (RCBFs) whose zero-superlevel set is control $τ$-recurrent, and guarantee safety for all initial states in such a set when RTFs are satisfied. We establish theoretical safety guarantees and validate the approach through numerical experiments, demonstrating RTFs' effectiveness and the safety of FoMs.},
  author = {Liu, Jixian and Mallada, Enrique},
  bdsk-url-3 = {https://doi.org/10.23919/ACC55779.2023.10156212},
  booktitle = {American Control Conference (ACC)},
  grants = {Global-Centers-2330450; DOE-ASCR-826565},
  month = {5},
  pages = {1-7},
  pubstate = {accepted},
  record = {accepted Feb. 2026, submitted Sep. 2025},
  title = {Safety-Critical Control via Recurrent Tracking Functions},
  url = {https://mallada.ece.jhu.edu/pubs/2026-ACC-LM.pdf},
  year = {2026}
}
[2] R. Siegelmann and E. Mallada, “Data-driven Practical Stabilization of Nonlinear Systems via Chain Policies: Sample Complexity and Incremental Learning,” in American Control Conference (ACC), 2026, pp. 1-8.
[Bibtex] [Abstract] [Download PDF]

We propose a method for data-driven practical stabilization of nonlinear systems with provable guarantees, based on the concept of \emphNonparametric Chain Policies (NCPs). The approach employs a normalized nearest-neighbor rule to assign, at each state, a finite-duration control signal derived from stored data, after which the process repeats. Unlike recent works that model the system as linear, polynomial, or polynomial fraction, we only assume the system to be locally Lipschitz. Our analysis build son the framework of Recurrent Lyapunov Functions (RLFs), which enable data-driven certification of (practical) stability using standard norm functions instead of requiring the explicit construction of a classical Lyapunov function. To extend this framework, we introduce the concept of Recurrent Control Lyapunov Functions (R-CLFs), which can certify the existence of an NCP that practically stabilizes an arbitrarily small $c$-neighborhood of an equilibrium point. We also provide an explicit sample complexity guarantee of $\mathcalO\!łeft((3/h̊o)^d łog(R/c)\g̊ht)$ number of trajectories—where $R$ is the domain radius, $d$ the state dimension, and $\r$̊ a system-dependent constant. The proposed Chain Policies are nonparametric, thus allowing new verified data to be readily incorporated into the policy to either improve convergence rate or enlarge the certified region. Numerical experiments illustrate and validate these properties.

@inproceedings{sm2026acc,
  abstract = {We propose a method for data-driven practical stabilization of nonlinear systems with provable guarantees, based on the concept of \emphNonparametric Chain Policies (NCPs). The approach employs a normalized nearest-neighbor rule to assign, at each state, a finite-duration control signal derived from stored data, after which the process repeats. 
Unlike recent works that model the system as linear, polynomial, or polynomial fraction, we only assume the system to be locally Lipschitz.
Our analysis build son the framework of Recurrent Lyapunov Functions (RLFs), which enable data-driven certification of (practical) stability using standard norm functions instead of requiring the explicit construction of a classical Lyapunov function. To extend this framework, we introduce the concept of Recurrent Control Lyapunov Functions (R-CLFs), which can certify the existence of an NCP that practically stabilizes an arbitrarily small $c$-neighborhood of an equilibrium point. 
We also provide an explicit sample complexity guarantee of $\mathcalO\!łeft((3/h̊o)^d łog(R/c)\g̊ht)$ number of trajectories---where $R$ is the domain radius, $d$ the state dimension, and $\r$̊ a system-dependent constant. The proposed Chain Policies are nonparametric, thus allowing new verified data to be readily incorporated into the policy to either improve convergence rate or enlarge the certified region. Numerical experiments illustrate and validate these properties.},
  author = {Siegelmann, Roy and Mallada, Enrique},
  bdsk-url-3 = {https://doi.org/10.23919/ACC55779.2023.10156212},
  booktitle = {American Control Conference (ACC)},
  grants = {Global-Centers-2330450; DOE-ASCR-826565},
  month = {5},
  pages = {1-8},
  pubstate = {accepted},
  record = {accepted Feb. 2026, submitted Sep. 2025},
  title = {Data-driven Practical Stabilization of Nonlinear Systems via Chain Policies: Sample Complexity and Incremental Learning},
  url = {https://mallada.ece.jhu.edu/pubs/2026-ACC-SgM.pdf},
  year = {2026}
}

1 paper published in NAHS

Our paper on recurrence entropy of nonlinear control systems [1] has been published in Nonlinear Analysis: Hybrid Systems. The paper shows that making a set recurrent, in the sense that trajectories starting in the set must return to it, is provably less complex than making it invariant, and characterizes the minimum data rates and finite control alphabets needed to achieve it.

[1] [doi] H. Sibai and E. Mallada, “Recurrence of Nonlinear Control Systems: Entropy, Bit Rates, and Finite Alphabets,” Nonlinear Analysis: Hybrid Systems, vol. 59, iss. 101649, pp. 1-16, 2026.
[Bibtex] [Abstract] [Download PDF]

In this paper, we introduce the notion of recurrence entropy in the context of nonlinear control systems. A set is said to be ($τ$-)recurrent if every trajectory that starts in the set returns to it (within at most $τ$ units of time). The recurrence entropy of a control system quantifies the complexity of making a set $τ$-recurrent measured by the average rate of growth, as time increases, of the number of control signals required to achieve this goal. Our analysis reveals that, compared to invariance, recurrence is quantitatively less complex, meaning that the recurrence entropy of a set is no larger than, and often strictly smaller than, the invariance entropy. We provide upper and lower bounds on recurrence entropy and show that they converge to the bounds on invariance entropy as $τ$ decreases to zero. Further, our results show that recurrence entropy lower bounds the minimum data rate between the sensor and controller required for achieving recurrence. We present an algorithm according to which the sensor can send state estimates to the controller over a limited-bandwidth channel to achieve recurrence asymptotically at an exponential rate. Finally, we show that, under mild stricter conditions on the set and dynamics, the control signals that enforce the $τ$-recurrence of a set can be generated by a finite alphabet of control signals of durations of at most $τ$ units of time, which allows us to store them for quick online execution.

@article{sm2026nahs,
  abstract = {In this paper, we introduce the notion of recurrence entropy in the context of nonlinear control systems. A set is said to be ($τ$-)recurrent if every trajectory that starts in the set returns to it (within at most $τ$ units of time). The recurrence entropy of a control system quantifies the complexity of making a set $τ$-recurrent measured by the average rate of growth, as time increases, of the number of control signals required to achieve this goal. Our analysis reveals that, compared to invariance, recurrence is quantitatively less complex, meaning that the recurrence entropy of a set is no larger than, and often strictly smaller than, the invariance entropy. We provide upper and lower bounds on recurrence entropy and show that they converge to the bounds on invariance entropy as $τ$ decreases to zero. Further, our results show that recurrence entropy lower bounds the minimum data rate between the sensor and controller required for achieving recurrence. We present an algorithm according to which the sensor can send state estimates to the controller over a limited-bandwidth channel to achieve recurrence asymptotically at an exponential rate. Finally, we show that, under mild stricter conditions on the set and dynamics, the control signals that enforce the $τ$-recurrence of a set can be generated by a finite alphabet of control signals of durations of at most $τ$ units of time, which allows us to store them for quick online execution.},
  author = {Sibai, Hussein and Mallada, Enrique},
  doi = {https://doi.org/10.1016/j.nahs.2025.101649},
  grants = {CPS-2136324; Global-Centers-2330450; CAREER-1752362},
  journal = {Nonlinear Analysis: Hybrid Systems},
  month = {2},
  number = {101649},
  pages = {1-16},
  record = {published Feb 2026, online Oct 2025, accepted Oct 2025, submitted Feb 2025},
  title = {Recurrence of Nonlinear Control Systems: Entropy, Bit Rates, and Finite Alphabets},
  url = {https://mallada.ece.jhu.edu/pubs/2026-NAHS-SM.pdf},
  volume = {59},
  year = {2026}
}

1 paper published in EJOR

Our paper on counterfactual analysis of default bid market power mitigation strategies in two-stage electricity markets [1] has been published in the European Journal of Operational Research. Congrats Rajni!

[1] [doi] R. K. Bansal, P. You, Y. Chen, and E. Mallada, “Counterfactual analysis of default bid market power mitigation strategies in two-stage electricity markets,” European Journal of Operational Research, pp. 1-18, 2025.
[Bibtex] [Abstract] [Download PDF]

Market power remains a persistent challenge in many liberalized electricity markets worldwide, driving the adoption of ex-ante and ex-post mitigation measures. Despite locational mitigation tools (e.g., cost-based reference levels or default energy bids), evidence of price manipulation has motivated system-level market power mitigation (MPM) policies. However, the full implications of these rules are not well understood, and limited insight into participant behavior can lead to unintended consequences, including increased market power and welfare losses. We study sequentially cleared electricity markets and analyze a two-stage settlement structure commonly used by system operators (e.g., day-ahead and real-time markets in North America). Our focus is on MPM policies that replace noncompetitive generator offers with operator-estimated default bids, and we model competition between generators and loads with inelastic energy requirements who act strategically in allocating demand across stages under real-time, day-ahead, and simultaneous applications of MPM policies. Motivated by the loss of Nash equilibrium under conventional supply-function bidding, we adopt an alternative mechanism in which generators bid the intercept of an affine supply function. Under real-time MPM, strategic interaction in the day-ahead market drives all demand to real time, producing an undesirable outcome. To test robustness, we incorporate demand uncertainty using a variance-penalized expectation framework. Low risk aversion still leads to substantial real-time clearing, while imbalances in risk preferences further amplify market power. Overall, intercept-function bidding combined with day-ahead and simultaneous MPM policies mitigates generator market power more effectively than real-time substitution alone, although these policies shift some market power toward loads.

@article{bcym2025ejor,
  abstract = {Market power remains a persistent challenge in many liberalized electricity markets worldwide, driving the adoption of ex-ante and ex-post mitigation measures. Despite locational mitigation tools (e.g., cost-based reference levels or default energy bids), evidence of price manipulation has motivated system-level market power mitigation (MPM) policies. However, the full implications of these rules are not well understood, and limited insight into participant behavior can lead to unintended consequences, including increased market power and welfare losses. We study sequentially cleared electricity markets and analyze a two-stage settlement structure commonly used by system operators (e.g., day-ahead and real-time markets in North America). Our focus is on MPM policies that replace noncompetitive generator offers with operator-estimated default bids, and we model competition between generators and loads with inelastic energy requirements who act strategically in allocating demand across stages under real-time, day-ahead, and simultaneous applications of MPM policies. Motivated by the loss of Nash equilibrium under conventional supply-function bidding, we adopt an alternative mechanism in which generators bid the intercept of an affine supply function. Under real-time MPM, strategic interaction in the day-ahead market drives all demand to real time, producing an undesirable outcome. To test robustness, we incorporate demand uncertainty using a variance-penalized expectation framework. Low risk aversion still leads to substantial real-time clearing, while imbalances in risk preferences further amplify market power. Overall, intercept-function bidding combined with day-ahead and simultaneous MPM policies mitigates generator market power more effectively than real-time substitution alone, although these policies shift some market power toward loads.},
  author = {Bansal, Rajni Kant and You, Pengcheng and Chen, Yue and Mallada, Enrique},
  doi = {https://doi.org/10.1016/j.ejor.2025.12.030},
  grants = {CAREER-1752362; CPS-2136324; Global-Centers-2330450},
  issn = {0377-2217},
  journal = {European Journal of Operational Research},
  month = {12},
  pages = {1-18},
  record = {online 12 2025, accepted Dec 2025, under revision Jan 2024, submitted Aug 2023},
  title = {Counterfactual analysis of default bid market power mitigation strategies in two-stage electricity markets},
  url = {https://mallada.ece.jhu.edu/pubs/2025-EJOR-BCYM.pdf},
  year = {2025}
}

1 paper published in TAC

Our paper on the stability, economic efficiency, and incentive compatibility of electricity market dynamics [1] has been published in the IEEE Transactions on Automatic Control. Congrats Pengcheng and Yan!

[1] [doi] P. You, Y. Jiang, E. Yeung, D. Gayme, and E. Mallada, “On the Stability, Economic Efficiency and Incentive Compatibility of Electricity Market Dynamics,” IEEE Transactions on Automatic Control, vol. 70, iss. 10, pp. 6815-6830, 2025.
[Bibtex] [Abstract] [Download PDF]

This paper focuses on the operation of an electricity market that accounts for participants that bid at a sub-minute timescale. To that end, we model the market-clearing process as a dynamical system, called market dynamics, which is temporally coupled with the grid frequency dynamics and is thus required to guarantee system-wide stability while meeting the system operational constraints. We characterize participants as price-takers who rationally update their bids to maximize their utility in response to real-time schedules of prices and dispatch. For two common bidding mechanisms, based on quantity and price, we identify a notion of alignment between participants’ behavior and planners’ goals that leads to a saddle-based design of the market that guarantees convergence to a point meeting all operational constraints. We further explore cases where this alignment property does not hold and observe that misaligned participants’ bidding can destabilize the closed-loop system. We thus design a regularized version of the market dynamics that recovers all the desirable stability and steady-state performance guarantees. Numerical tests validate our results on the IEEE 39-bus system.

@article{yjygm2025tac,
  abstract = {This paper focuses on the operation of an electricity market that accounts for participants that bid at a sub-minute timescale. To that end, we model the market-clearing process as a dynamical system, called market dynamics, which is temporally coupled with the grid frequency dynamics and is thus required to guarantee system-wide stability while meeting the system operational constraints. We characterize participants as price-takers who rationally update their bids to maximize their utility in response to real-time schedules of prices and dispatch. For two common bidding mechanisms, based on quantity and price, we identify a notion of alignment between participants' behavior and planners' goals that leads to a saddle-based design of the market that guarantees convergence to a point meeting all operational constraints. We further explore cases where this alignment property does not hold and observe that misaligned participants' bidding can destabilize the closed-loop system.  We thus design a regularized version of the market dynamics that recovers all the desirable stability and steady-state performance guarantees. Numerical tests validate our results on the IEEE 39-bus system.},
  author = {You, Pengcheng and Jiang, Yan and Yeung, Enoch and Gayme, Dennice and Mallada, Enrique},
  bdsk-url-3 = {https://mallada.ece.jhu.edu/pubs/2024-TAC-YJYGM.pdf},
  doi = {10.1109/TAC.2025.3589447},
  grants = {CPS-2136324, Global Centers-2330450},
  journal = {IEEE Transactions on Automatic Control},
  month = {10},
  number = {10},
  pages = {6815-6830},
  record = {published Oct 2025, accepted Aug 2024, revised Dec 2023, submitted Dec 2021},
  title = {On the Stability, Economic Efficiency and Incentive Compatibility of Electricity Market Dynamics},
  url = {https://mallada.ece.jhu.edu/pubs/2025-TAC-YJYGM.pdf},
  volume = {70},
  year = {2025}
}

Roy defended his dissertation

Roy Siegelmann, an AMS Ph.D. student in our lab, defended his dissertation entitled “Data-Driven Analysis and Control of Dynamical Systems via Recurrent Lyapunov Functions” on Monday, August 18th. Congratulations Dr Siegelmann!

1 paper accepted to CDC

Our paper on recurrent control barrier functions [1] has been accepted to the 64th IEEE Conference on Decision and Control. Congrats Jixian!

[1] [doi] J. Liu and E. Mallada, “Recurrent Control Barrier Functions: A Path Towards Nonparametric Safety Verification,” in 64th IEEE Conference on Decision and Control (CDC), 2025.
[Bibtex] [Abstract] [Download PDF]

Ensuring the safety of complex dynamical systems often relies on Hamilton-Jacobi (HJ) Reachability Analysis or Control Barrier Functions (CBFs). Both methods require computing a function that characterizes a safe set that can be made (control) invariant. However, the computational burden of solving high-dimensional partial differential equations (for HJ Reachability) or large-scale semidefinite programs (for CBFs) makes finding such functions challenging. In this paper, we introduce the notion of Recurrent Control Barrier Functions (RCBFs), a novel class of CBFs that leverages a recurrent property of the trajectories, i.e., coming back to a safe set, for safety verification. Under mild assumptions, we show that the RCBF condition holds for the signed-distance function, turning function design into set identification. Notably, the resulting set need not be invariant to certify safety. We further propose a data-driven nonparametric method to compute safe sets that is massively parallelizable and trades off conservativeness against computational cost.

@inproceedings{lm2025cdc,
  abstract = {Ensuring the safety of complex dynamical systems often relies on Hamilton-Jacobi (HJ) Reachability Analysis or Control Barrier Functions (CBFs). Both methods require computing a function that characterizes a safe set that can be made (control) invariant. However, the computational burden of solving high-dimensional partial differential equations (for HJ Reachability) or large-scale semidefinite programs (for CBFs) makes finding such functions challenging. In this paper, we introduce the notion of Recurrent Control Barrier Functions (RCBFs), a novel class of CBFs that leverages a recurrent property of the trajectories, i.e., coming back to a safe set, for safety verification. Under mild assumptions, we show that the RCBF condition holds for the signed-distance function, turning function design into set identification. Notably, the resulting set need not be invariant to certify safety. We further propose a data-driven nonparametric method to compute safe sets that is massively parallelizable and trades off conservativeness against computational cost.},
  author = {Liu, Jixian and Mallada, Enrique},
  booktitle = {64th IEEE Conference on Decision and Control (CDC)},
  doi = {10.1109/CDC57313.2025.11312572},
  grants = {CPS-2136324; Global-Centers-2330450},
  month = {12},
  record = {presented Dec. 2025, accepted Jul. 2025, submitted Mar. 2025},
  title = {Recurrent Control Barrier Functions: A Path Towards Nonparametric Safety Verification},
  url = {https://mallada.ece.jhu.edu/pubs/2025-CDC-LM.pdf},
  year = {2025}
}

1 paper published in TMLR

Our paper on a local Polyak-Lojasiewicz condition and descent lemma of gradient descent for overparametrized linear models [1] has been published in Transactions on Machine Learning Research. Congrats Ziqing!

[1] Z. Xu, H. Min, S. Tarmoun, E. Mallada, and R. Vidal, “A Local Polyak-Łojasiewicz and Descent Lemma of Gradient Descent For Overparametrized Linear Models,” Transaction on Machine Learning Research (TMLR), 2025.
[Bibtex] [Download PDF]
@article{xmtmv2025tmlr,
  author = {Xu, Ziqing and Min, Hancheng and Tarmoun, Salma and Mallada, Enrique and Vidal, Rene},
  grants = {Global Centers-2330450},
  issn = {2835-8856},
  journal = {Transaction on Machine Learning Research (TMLR)},
  month = {5},
  record = {accepted May 2025, submitted Feb 2025},
  title = {A Local Polyak-Łojasiewicz and Descent Lemma of Gradient Descent For Overparametrized Linear Models},
  url = {https://mallada.ece.jhu.edu/pubs/2025-TMLR-XMTMV.pdf},
  year = {2025}
}

1 paper accepted to RLC

Our paper on nonparametric policy improvement in continuous action spaces via expert demonstrations [1] has been accepted to the Reinforcement Learning Conference. Congrats Agustin!

[1] A. Castellano, S. Rezaei, J. Markowitz, and E. Mallada, “Nonparametric Policy Improvement in Continuous Action Spaces via Expert Demonstrations,” in Reinforcement Learning Conference, 2025, pp. 1158-1179.
[Bibtex] [Abstract] [Download PDF]

The policy improvement theorem is a fundamental building block of classical reinforcement learning for discrete action spaces. Unfortunately, the lack of an analogous result for continuous action spaces with function approximation has historically limited theoretical guarantees of policy optimization algorithms, undermining their reliability. Here, we introduce a novel nonparametric policy that relies purely on data to take actions and that admits a policy improvement theorem for deterministic Markov Decision Processes (MDPs). By imposing mild regularity assumptions on the optimal policy, we show that, when data come from expert demonstrations, one can construct a nonparametric lower bound on the value of the policy, thus enabling its robust evaluation. The constructed lower bound naturally leads to a simple improvement mechanism based on adding more demonstrations. We also provide conditions to identify regions of the state space where additional demonstrations are needed to meet specific performance goals. Finally, we propose a policy optimization algorithm that ensures a monotonic improvement of the lower bound and leads to high probability performance guarantees. These contributions provide a foundational step toward establishing a rigorous framework for policy improvement in continuous action spaces.

@inproceedings{crmm2025rlc,
  abstract = {The policy improvement theorem is a fundamental building block of classical reinforcement learning for discrete action spaces. Unfortunately, the lack of an analogous result for continuous action spaces with function approximation has historically limited theoretical guarantees of policy optimization algorithms, undermining their reliability. Here, we introduce a novel nonparametric policy that relies purely on data to take actions and that admits a 
policy improvement theorem for deterministic Markov Decision Processes (MDPs). By imposing mild regularity assumptions on the optimal policy, we show that, when data come from expert demonstrations, one can construct a nonparametric lower bound on the value of the policy, thus enabling its robust evaluation. The constructed lower bound naturally leads to a simple improvement mechanism based on adding more demonstrations. We also provide conditions to identify regions of the state space where additional demonstrations are needed to meet specific performance goals. Finally, we propose a policy optimization algorithm that ensures a monotonic improvement of the lower bound and leads to high probability performance guarantees. These contributions provide a foundational step toward establishing a rigorous framework for policy improvement in continuous action spaces.},
  author = {Castellano, Agustin and Rezaei, Sohrab and Markowitz, Jared and Mallada, Enrique},
  booktitle = {Reinforcement Learning Conference},
  month = {8},
  pages = {1158-1179},
  record = {presented Aug. 2025, accepted May 2025, submitted Feb. 2025},
  title = {Nonparametric Policy Improvement in Continuous Action Spaces via Expert Demonstrations},
  url = {https://mallada.ece.jhu.edu/pubs/2025-RLC-CRMM.pdf},
  year = {2025}
}