EE ZOOM Seminar: Generalization in Reinforcement Learning via Structural Priors
https://tau-ac-il.zoom.us/j/84875921874
Electrical Engineering Systems Seminar
Speaker: Maayan Shalom
M.Sc. student under the supervision of Dr. Alon Cohen
Monday, 2nd July 2025, at 16:00
Generalization in Reinforcement Learning via Structural Priors
Abstract
Generalization is a central challenge in reinforcement learning (RL) applications where an agent must succeed across many possible environments, not merely the handful it encountered during training. We formalize this challenge by assuming that, before each episode, Nature draws an unknown Markov Decision Process (MDP) from a fixed—yet hidden—distribution, and the agent must learn, from a finite training sample of such MDPs, a policy whose expected return over the entire distribution is near-optimal.
Earlier theory has shown that this problem is intractable in the worst case: partial observability of the true MDP identity induces an Epistemic Partially Observable MDP (Epistemic-POMDP), whose sample complexity can grow exponentially with the planning horizon. While positive results do exist, they typically rely on regularized learning objectives or strong Bayesian priors.
In this thesis, we revisit generalization through two natural structural lenses that make the problem tractable without resorting to explicit regularization. The first is a uniform similarity assumption, where every pair of MDPs induces statistically similar trajectory distributions under any policy. In this setting, we show that plain Empirical Risk Minimization (ERM) achieves a generalization error of O(1/m), where m
is the number of training environments. This improves over the best known O(1/4m)
rate for regularized ERM and highlights how trajectory-level similarity implicitly curbs hypothesis-class complexity. The second is a decodability assumption, where a short trajectory prefix uniquely reveals the identity of the underlying MDP. We show that in this case, ERM again enjoys the same O(1/m)
sample complexity. Our analysis constructs truncated policies that depend on history only until the MDP is identified, and then act optimally according to the identified model.
Together, these results provide new foundations for learning under epistemic uncertainty. They delineate precise conditions under which simple empirical learning suffices, quantify the role of environment structure in determining sample complexity, and offer guidance for the design of agents that must generalize reliably in practice.
השתתפות בסמינר תיתן קרדיט שמיעה = עפ"י רישום בצ'ט של שם מלא + מספר ת.ז.