Sample-efficient multi-task and multi-objective reinforcement learning by combining multiple behaviors

When solving sequential decision-making problems, humans exhibit different behaviors depending on the problem (or task) they are tasked with solving at a given moment. One of the main challenges in the field of artificial intelligence, and reinforcement learning (RL) in particular, is the development of generalist and flexible agents that are capable of solving multiple tasks—each requiring the agent to learn a potentially new, specialized behavior. Importantly, tackling this challenge requires agents to learn behaviors that may involve optimizing a single objective, or trading off between multiple conflicting objectives. We argue that many important real-world tasks are naturally defined by multiple objectives, which when prioritized differently, may require the agent to adapt its behavior accordingly. In this thesis, we study the problem of how to design flexible RL agents that can, in a sample-efficient manner, adapt their behavior to solve any given tasks—each of which is defined by multiple (possibly conflicting) objectives. The main hypothesis of this thesis is that it is possible to meaningfully combine insights from two apparently disparate sub-fields of machine learning—multi-objective RL and multi-task RL—to design novel and principled techniques to address the problem discussed above. In particular, such insights arise from the fact that both of these fields typically deal with problems where an agent needs to learn multiple behaviors/policies. We introduce new multi-policy methods that empower RL agents to (i) carefully learn multiple behaviors, each specialized in a different task or in tasks in which an agent assigns different priorities (or preferences) to each of its objectives; and (ii) combine previously-learned behaviors to efficiently identify solutions to novel tasks. The methods we investigate and introduce have important theoretical guarantees regarding the optimality of the set of behaviors they identify and their capability of solving new tasks in a zero-shot manner, even in the presence of function approximation errors. We evaluate the proposed methods in various challenging multi-task and multi-objective RL problems and show that our algorithms outperform various current state-of-the-art methods in domains with both discrete and continuous state and action spaces. ...

Resumo

Ao resolver problemas de tomada de decisão sequencial, seres humanos exibem com portamentos diferentes dependendo do problema (ou tarefa) que devem resolver em um determinado momento. Um dos principais desafios na área de inteligência artificial, e de aprendizado por reforço (RL) em particular, é o desenvolvimento de agentes generalistas e f lexíveis que sejam capazes de resolver múltiplas tarefas ou problemas—cada uma exigindo que o agente aprenda um comportamento potencialmente novo e especializado. Superar esse desafio requer que agentes aprendam comportamentos que podem envolver otimizar umúnico objetivo, ou realizar “trade-offs” entre múltiplos objetivos conflitantes. Nós argu mentamos que muitas tarefas importantes do mundo real são naturalmente definidas por múltiplos objetivos que, quando priorizados de forma diferente, podem exigir que o agente adapte seu comportamento. Nesta tese, estudamos o problema de como projetar agentes de RLflexíveis que possam, de maneira eficiente em termos de quantidade de interações com o ambiente, adaptar seu comportamento para resolver qualquer tarefa—cada uma definida por múltiplos objetivos possivelmente conflitantes. A hipótese principal desta tese é que é possível combinar ideias de dois subcampos aparentemente díspares de aprendizado de máquina—RL multiobjetivo e RL multitarefa—para projetar novas técnicas com garantias teóricas para resolver o problema discutido acima. Em particular, tal combinação é possível porque ambos os campos lidam com problemas onde o agente precisa aprender múltiplos comportamentos/políticas. Nós introduzimos novos métodos multipolíticas que capacitam os agentes a (i) aprender cuidadosamente múltiplos comportamentos, cada um especia lizado em uma tarefa diferente ou em tarefas nas quais um agente atribui preferências diferentes para cada um de seus objetivos; e (ii) a combinar comportamentos previamente aprendidos para identificar eficientemente soluções para novas tarefas. Os métodos que investigamos e introduzimos têm importantes garantias teóricas em relação à otimalidade do conjunto de comportamentos identificados e sua capacidade de resolver novas tarefas de forma zero-shot, mesmo na presença de erros de aproximação de função. Avaliamos os métodos propostos em diversos problemas desafiadores de RL multitarefa e multiobjetivo e demonstramos que nossos algoritmos superam vários métodos estado-da-arte em domínios com espaços de estado e ação discretos ou contínuos. ...

Instituição

Universidade Federal do Rio Grande do Sul. Instituto de Informática. Programa de Pós-Graduação em Computação.

Coleções

Ciências Exatas e da Terra (5196)

Computação (1786)

Outras opções

Mostrar todos os metadados

Estatísticas

Este item está licenciado na Creative Commons License