One of the most successful recipes in robot learning is to initialize the robot with a *good policy* via imitation learning and then refine the result with reinforcement learning. Strong assumption: the human must be capable of providing the initial demonstration. This assumption holds well for certain quasi-static tasks (grasping, reaching, etc.) or when the robot is capable of generating accelerations closer to that of a human (very few robots have this capability). However, these stringent assumptions often do not hold.
For example, how would you teach a limited-torque robot arm to swing-up a heavy ball to solve the cup-and-ball game? The robot cannot launch the ball vertically as most of us would do due to the lack of torque required to generate large upwards accelerations. The solution is then to swing the cup sideways to build up momentum. But how many swing ups are required? One? Two? … Five?
Here, we investigate how the human and robot can solve a task together when the human also lacks intuition for providing a single, initial demonstration. We assume the robot starts with a blank policy. And use policy search to provide the robot with a local exploration noise combined with human feedback in the action space. In a project led by Carlos Celemin (University of Chile) and in collaboration with Jens Kober (TU Delft), we proposed a method combining COrrective Advice Communicated by Humans (COACH) [pdf][BiBTeX] with policy search. This combination optimizes a movement primitive that is used to generate robot trajectories.
COACH allows for human feedback that is only qualitative (“go up”, “go left”), and thus easy for any non-expert user to also train the robot. Under the hood, the method actually models the human feedback to predict what is the magnitude of the human advice. Moreover, human feedback can be done at any time during the training, at every roll-out, once in a while, or not at all. In the latter, the process reduces to that of pure autonomous learning.
Preliminary results and a brief overview of the method can be found in this poster presented at IROS 2017 [pdf].