RoboErgoSum Project

There is an intricate relationship between self-awareness and the ability to perform cognitive-level reasoning.

Learning abilities

A central issue in the project is related to including a learning capacity, and more speciﬁcally, reinforcement learning. Reinforcement Learning (RL) studies how an agent can appropriately adapt its behavioral policy so as to reach a particular goal in a given environment [1]. Here, we assume this goal to be maximizing the amount of reward obtained by the agent. RL methods rely on Markov Decision Processes, which suppose that the agent is situated in a stochastic or deterministic environment, that it has a certain representation of its state (e.g. its location in the environment, the presence of stimuli or rewards, its motivational state), and that future states depend on the performance of particular actions in the current state. Thus the objective of the agent is to learn the value associated to the performance of each possible action a in each possible state s in terms of the amount of reward that it provides, Q(s, a). Two main classes of RL algorithms will be implemented, showing complementary performances in different conditions: model-based and model-free RL [1]. The former involves learning the transition function of the environment – a mapping from (states,actions) to states which enable to plan sequences of actions from a starting state to a particular goal. While model-based RL techniques such as dynaQ can provide efﬁcient decision-making and produce quick adaptation following changes in the goal, they produce large reaction times and are limited by the scaling problem: they do not cope well with large task domains, i.e., domains involving a large space of possible world states or a large set of possible actions. Model-free techniques such as Temporal-Difference methods (e.g. Q-learning, Actor-Critic, SARSA) learn values associated to (state,action) couples without estimating the transition function. Thus these techniques are computationally less expensive and produce shorter reaction times. However model-free techniques are also limited by the scaling problem: updating each action value within a long behavioral sequence can be very time consuming.

A ﬁrst solution that we want to investigate to address these limitations consists in combining model-free and model-based RL within the same cognitive architecture so as to beneﬁt from the advantages of the two. As mentioned above, recent computational neuroscience work can help take inspiration from the way these learning processes are coordinated within the mammalian brain [2][3][4][5]. Peter Dayan and colleagues have proposed a computational model combining model-based and model-free reinforcement learning [4]. The system enables to select the controller (model-based or model-free) that currently has the lowest uncertainty in its reward estimation to take over behavior. A robotic implementation inspired by this model has already been tested in the spatial navigation domain [6][7][8]. However, the model is difficult to extend to large scale real-world applications because the complexity of the uncertainty of the model-based controller grows exponentially, and thus becomes long and difficult to estimate after the initial learning phase. An extension of the model of Peter Dayan and colleagues has been proposed [9], which avoids estimating the uncertainty of the model-based controller. This extension enabled to successfully reproduce behavioral performance of rats in various instrumental conditioning tasks. Thus we plan to test this extension on one of the robotic platform of the project (e.g. iCub or PR2).

Another essential issue in this context is self and environment monitoring. In this task, we plan to add a monitoring system to coordinate the multiple learning modules model described above. The idea is to enable the robot to perform cognitive control: detect when the environment changes, when in its own performance changes, or when they are conflict between goals to adjust its current behavioral policy. Such monitoring functions will enable the robot to flexibly and efficiently adapt its behavior to the environment and to the changing constraints of the task. They will also provide crucial representations of the robot’s own actions, their evaluations – in terms of expected consequences, uncertainty and long-term value –, and of the desired goals in relation with the current state of the robot and of the environment.
This work will be developed in extension from our previous implementation of a context-switching meta-controller for a navigating robot [7][8]. Such controller enabled the robot to detect changes in the context of the task by recognizing that the profile of propagation of goal information in its internal model of the environment had changed. This enabled to quickly adapt to the new context by dedicating a new memory component which then learns which behavioral strategy (model-based or model-free) is the most efficient in the context. And it enabled to quickly restore the relevant strategy when a previously experienced context is recognized. However, this metacontroller could not deal with changes in meta-goals (energy, rest, integrity) nor in goals (energy resource A, energy resource B).
Gaussier and colleagues recently developed a regulatory mechanism enabling to cope with this limitation by referring to the global notion of frustration [10]. When the robot failed in reaching a particular goal, or in attempting to perform a given behavioral strategy, a frustration level increased above a certain level and enabled a behavioral switch. They isolately tested 3 types of switched: switching between navigation strategies (place-based navigation versus path integration); switching between goals (when the robot fails in reaching food A, if their exist another resource food B, the robot selects food B as the new goal); switching between meta-goals (if food is unattainable, the robot gives up on food and starts to work for water). However, no generic test was attempted to have a robot decide, in a particular situation, which type of switch is the most relevant to perform.
Thus in this task, we plan to adapt the frustration mechanism to our model-based/model-free control system, and to extend it to recognize the type of switch that is relevant in different situations and contexts.

The self-monitoring system will be extended to meta-learning mechanisms [11]. These mechanisms consist in learning to learn. Indeed, the evaluation by the agent of its own performance shall permit to dynamically regulate meta-parameters of learning, such as the exploration rate, the time scale of reward expectation or the learning rate of the system [12][13][14]. We previously made a recent review showing that some of the prefrontal cortical functions related to cognitive control can be formalized with the meta-learning computational framework [15]. When applied to a neuro-inspired reinforcement learning controller, such meta-learning enabled an iCub humanoid robot to adaptively switch from exploration to exploitation and vice versa in a non-stationay task [16]. Here, we plan to apply the meta-learning framework to the regulation of the temporal scale of reward expectation (discount factor gamma) in our model-based/model-free control system. Several scenario will be tested, which are inspired by neurobiological studies: the agent will face multichoice tasks where it needs to learn to select among a set of targets on a touch screen with various probabilities (similar to a multi-arm bandit task), based on contextual stimuli presented.
The meta-learning algorithm should enable the agent to switch between conditions where reward is delivered in the long-term – several consecutive non rewarding choices or moderately punishing choices need to be performed before a large positive reward is obtained – and a short-term condition where the appropriate target delivers immediate small reward [17].

Back to Work Packages

[1]	• Sutton, R.S. & Barto, A.G. Reinforcement Learning I: Introduction. MIT Press, 1998.
[2]	• Doya, K. Complementary roles of basal ganglia and cerebellum in learning and motor control. Current Opinion in Neurobiology, 10:732-739, 2000.
[3]	• Dayan, P. & Balleine, B. Reward, motivation, and reinforcement learning.. Neuron, 36(2):285-298, 2002.
[4]	• Daw, N., Niv, Y. & Dayan, P. Uncertainty-based competition between prefrontal and dorsolateral striatal systems for behavioral control. Nature Neuroscience, 8(12):1704-1711, 2005.
[5]	• Samejima, K. & Doya, K. Multiple representations of belief states and action values in corticobasal ganglia loops. Annals of the New York Academy of Sciences, 1104:213-228, 2007.
[6]	• Dollé, L., Sheynikhovich, D., Girard, B., Chavarriaga, R. & Guillot, A. Path planning versus cue responding: a bioinspired model of switching between navigation strategies. Biological Cybernetics, 103(4):299-317, Springer Verlag, 2010.
[7]	• Caluwaerts, K., Grand, C., N'Guyen, S., Dollé, L., Guillot, A. & Khamassi, M. Design of a biologically inspired navigation system for the Psikharpax rodent robot. In International workshop on bio-inspired robots (CFP-2011), 2011.
[8]	• Caluwaerts, K., Staffa, M., N'Guyen, S., Grand, C., Dollé, L., Favre-Felix, A., Girard, B. & Khamassi, M. A biologically inspired meta-control navigation system for the Psikharpax rat robot. Bioinspiration & Biomimetics, 2012.
[9]	• Keramati, M., Dezfouli, A. & Piray, P. Speed/accuracy trade-off between the habitual and the goal-directed processes. PLoS Computational Biology, 7(5):e1002055, 2011.
[10]	• Hasson, C. & Gaussier, P. Frustration as a generical regulatory mechanism for motivated navigation. In Intelligent Robots and Systems (IROS), 2010 IEEE/RSJ International Conference on, pages 4704-4709, 2010.
[11]	• Giraud-Carrier, C., Brazdil, P. & Vilalta, R. Introduction to the special issue on meta-learning. Machine Learning, 54(3):187-193, 2004.
[12]	• Auer, P., Cesa-Bianchi, N. & Fischer, P. Finite-time analysis of the multiarmed bandit problem. Machine Learning, 47:235-256, 2002.
[13]	• Doya, K. Metalearning and neuromodulation. Neural Networks, 15(4-6):495-506, 2002.
[14]	• Ishii, S., Yoshida, W. & Yoshimoto, J. Control of exploitation-exploration meta-parameter in reinforcement learning. Neural Networks, 15(4-6):665-687, 2002.
[15]	• Khamassi, M., Wilson, C., Rothé, R., Quilodran, R., Dominey, P. & Procyk, E. Meta-learning, cognitive control, and physiological interactions between medial and lateral prefrontal cortex. In Neural Basis of Motivational and Cognitive Control, pages 351-370, Cambridge, MA: MIT Press, 2011.
[16]	• Khamassi, M., Lallée, S., Enel, P., Procyk, E. & Dominey, P. Robot cognitive control with a neurophysiologically inspired reinforcement learning model. Frontiers in Neurorobotics, 5:1, 2011.
[17]	• Tanaka, S., Doya, K., Okada, G., Ueda, K.O. & Yamawaki, S. Prediction of immediate and future rewards differentially recruits cortico-basal ganglia loops. Nature Neuroscience, 7(8):887-893, 2004.

RoboErgoSum project is funded by an ANR grant under reference ANR-12-CORD-0030