EE ZOOM Seminar: Generalization in Reinforcement Learning via Structural Priors

02 ביולי 2025, 16:00 
סמינר זום 
EE ZOOM Seminar: Generalization in Reinforcement Learning via Structural Priors

https://tau-ac-il.zoom.us/j/84875921874

Electrical Engineering Systems Seminar

 

Speaker: Maayan Shalom

M.Sc. student under the supervision of Dr. Alon Cohen

 

Monday, 2nd July 2025, at 16:00

 

Generalization in Reinforcement Learning via Structural Priors

Abstract

Generalization is a central challenge in reinforcement learning (RL) applications where an agent must succeed across many possible environments, not merely the handful it encountered during training. We formalize this challenge by assuming that, before each episode, Nature draws an unknown Markov Decision Process (MDP) from a fixed—yet hidden—distribution, and the agent must learn, from a finite training sample of such MDPs, a policy whose expected return over the entire distribution is near-optimal.

Earlier theory has shown that this problem is intractable in the worst case: partial observability of the true MDP identity induces an Epistemic Partially Observable MDP (Epistemic-POMDP), whose sample complexity can grow exponentially with the planning horizon. While positive results do exist, they typically rely on regularized learning objectives or strong Bayesian priors.

In this thesis, we revisit generalization through two natural structural lenses that make the problem tractable without resorting to explicit regularization. The first is a uniform similarity assumption, where every pair of MDPs induces statistically similar trajectory distributions under any policy. In this setting, we show that plain Empirical Risk Minimization (ERM) achieves a generalization error of O(1/m), where m is the number of training environments. This improves over the best known O(1/4m) rate for regularized ERM and highlights how trajectory-level similarity implicitly curbs hypothesis-class complexity. The second is a decodability assumption, where a short trajectory prefix uniquely reveals the identity of the underlying MDP. We show that in this case, ERM again enjoys the same O(1/m) sample complexity. Our analysis constructs truncated policies that depend on history only until the MDP is identified, and then act optimally according to the identified model.

Together, these results provide new foundations for learning under epistemic uncertainty. They delineate precise conditions under which simple empirical learning suffices, quantify the role of environment structure in determining sample complexity, and offer guidance for the design of agents that must generalize reliably in practice.

 

השתתפות בסמינר תיתן קרדיט שמיעה = עפ"י רישום בצ'ט של שם מלא + מספר ת.ז.

 

 

אוניברסיטת תל אביב עושה כל מאמץ לכבד זכויות יוצרים. אם בבעלותך זכויות יוצרים בתכנים שנמצאים פה ו/או השימוש שנעשה בתכנים אלה לדעתך מפר זכויות
שנעשה בתכנים אלה לדעתך מפר זכויות נא לפנות בהקדם לכתובת שכאן >>