Periodic agent-state based Q-learning for POMDPs
Informal Systems Seminar (ISS), Centre for Intelligent Machines (CIM) and Groupe d'Etudes et de Recherche en Analyse des Decisions (GERAD)
Speaker: Amit Sinha
**听狈辞迟别听迟丑补迟听迟丑颈蝉听颈蝉听补听丑测产谤颈诲听别惫别苍迟.
**听罢丑颈蝉听蝉别尘颈苍补谤听飞颈濒濒听产别听辫谤辞箩别肠迟别诲听补迟听惭肠颁辞苍苍别濒濒听437听补迟听惭肠骋颈濒濒听鲍苍颈惫别谤蝉颈迟测.
惭别别迟颈苍驳听滨顿:听845听1388听1004听听听听听听听
笔补蝉蝉肠辞诲别:听痴滨厂厂
Abstract: The traditional approach to POMDPs is to convert them into fully observed MDPs by considering a belief state as an information state. However, a belief-state based approach requires perfect knowledge of the system dynamics and is therefore not applicable in the learning setting where the system model is unknown. Various approaches to circumvent this limitation have been proposed in the literature. A unified treatment of these approaches involves considering the "agent state", which is a model-free, recursively updateable function of the observation history. Some examples of an agent state include frame stacking and recurrent neural networks. Since the agent state is model-free, it is used to adapt standard RL algorithms to POMDPs. However, standard RL algorithms like Q-learning learn a deterministic stationary policy. Since the agent state is not an information state, we cannot apply the same results for MDPs and thus, we must first consider what happens with the different policy classes: stationary/non-stationary and deterministic/stochastic. Our main thesis that we illustrate via examples is that because the agent state is not information state, non-stationary agent-state based policies can outperform stationary ones. To leverage this feature, we propose PASQL (periodic agent-state based Q-learning), which is a variant of agent-state-based Q-learning that learns periodic policies. By combining ideas from periodic Markov chains and stochastic approximation, we rigorously establish that PASQL converges to a cyclic limit and characterize the approximation error of the converged periodic policy. Finally, we present a numerical experiment to highlight the salient features of PASQL and demonstrate the benefit of learning periodic policies over stationary policies.
Affiliation: Amit Sinha is a PhD candidate in the Department of Electrical and Computer Engineering, 平特五不中.