Reinforcement Learning for Robotics
Reinforcement learning has gained popularity after the success of agents playing Atari games at superhuman level. Applying RL for solving intelligent decision making problems in robotics is a key to building smarter robots. In this post, we briefly discuss how an MDP can be defined for a robot and talk about Deep Reinforcement learning and its shortcomings.
Markov Decision Process(MDP)
Reinforcement learning agents learn actions from scratch through their interactions with the environment. There are four basic components of a reinforcement learning system: states, actions, reward and policy. The state in RL is defined by the current configuration or the state of the robot. For example, the current sensor readings such as motor encoder values or position and orientation of the centre of mass of the robot can be taken to define the "state space". The actions comprises of the possible motions of the robot in response to its interaction with the environment. The action space can include the desired motor angles of the legs of the robot or desired joint angles of a manipulator. The reward function is a scalar signal for the robot to maximize. The main goal in RL is to maximize total reward. Manipulating this fact, researchers design the reward function in a way that desired behaviours are rewarded heavily and undesired outcomes are punished or given negative rewards. In this way, the robot learns the desired behaviour and tries to meet the expectation of human as close as possible.
The states, action and rewards define a Markov Decision Process(MDP) which is solved using various algorithms. A solution to a MDP is called a policy. A policy takes the current state as input and outputs a optimal action which it thinks would maximize the total reward. After a action is carried out, the agent gets a reward to correct its behavior or to get a feedback on the action taken by it. Because of its flexible and intelligent nature, reinforcement learning is chosen for decision making tasks. Apart from these three components, a transition function is optionally added to the MDP formulation. A transition function is used to model the stochasticity in the environment that the agent is acting in. In such environments, a probability is involved while taking actions. In constrast, for a deterministic environment, all actions are carried out with full certainity (probability of 1). Further, if the states of the robot is not fully observable, the input to the policy is a observation function which maintains a probability over the states and the MDP transforms into a Partially observable Markov Decision Process (POMDP).
Deep Reinforcement Learning
Its success was the development of AlphaGo agent that defeated the human world champion. Deep reinforcement learning has been applied to a wide range of applications and it is most prominent in robotics where the control policies are learnt directly from scratch without human intervention or guidance. The huge advantage of using Deep RL over hand-crafted policies is the property of generalization-the agent can generalize over unseen scenarios which it has not encountered during training. Subbranches in Deep reinforcement learning such as imitation learning and behaviour cloning deals with human demonstrations to learn a particular task. Inverse reinforcement learning uses expert trajectories to learn a reward function. These methods can then be generalized over scenarios not included in the training data as well. But a major shortcoming of using Deep reinforcement learning is the amount of data required for training. It requires a huge amount of data before the learning curve converges and therefore the training time is usually calculated in days. Some researchers focus their research specifically on reducing the training time while some focus on producing exceptional behaviour after training the agents by precise use of training algorithms and reward signals.
Comments
Post a Comment