AI Seminar Series - Daniil Tiapkin | Technology Innovation Institute

Daniil Tiapkin

Daniil Tiapkin

PhD student, École Polytechnique

27th March 2024, 10:00am - 11:00am (GST)

Title:	Demonstration-Regularized RL
Venue:	TII Auditorium, Yas Arcade, Abu Dhabi
Abstract:	Incorporating expert demonstrations has empirically helped to improve the sample efficiency of reinforcement learning (RL). This paper quantifies theoretically to what extent this extra information reduces RL's sample complexity. In particular, we study the demonstration-regularized reinforcement learning that leverages the expert demonstrations by KL-regularization for a policy learned by behavior cloning. Our findings reveal that using N expert demonstrations enables the identification of an optimal policy at a sample complexity of order O( Poly(S,A,H)/(N epsilon^2)) in finite and O(Poly(d,H)/(N epsilon^2)) in linear Markov decision processes, where epsilon is the target precision, H the horizon, A the number of action, S the number of states in the finite case and d the dimension of the feature space in the linear case. As a by-product, we provide tight convergence guarantees for the behavior cloning procedure under general assumptions on the policy classes. Additionally, we establish that demonstration-regularized methods are provably efficient for reinforcement learning from human feedback (RLHF). In this respect, we provide theoretical evidence showing the benefits of KL-regularization for RLHF in tabular and linear MDPs. Interestingly, we avoid pessimism injection by employing computationally feasible regularization to handle reward estimation uncertainty, thus setting our approach apart from the prior works.
Bio:	Daniil is a first-year PhD student at Ecole Polytechnique and Université Paris-Saclay, advised by Eric Moulines and Gilles Stoltz. Prior to that, Daniil got his BSc and MSc at HSE University, where started working in collaboration with Michal Valko and Pierre Menard. His research interests are focused on reinforcement learning, in particular, on randomized exploration, regularization, intersection of RL with sampling, and reinforcement learning with human feedback.