Navigation with a Learned Policy #46

GM-Whooshi · 2021-05-25T06:55:34Z

Hi,

Firstly, thank you for making this amazing piece of work available. I've learned a lot from your code.

I have mainly been interested in the singleagent.py problems. I've worked with all of the different aviaries and the agent does not seem to be able to solve any of the control problems. So for now, I have just been working on the simplest problem, which is the hoveraviary.py where the agent learns to move the drone from its initial conditions to a position [0, 0, 1] and continue to hover at that position. I've done quite a couple of different things here and there to try and get the agent to solve the problem, and what I've been mostly looking at is different approaches to the reward function _computeReward() and termination conditions _computeDone().

One of the many iterations of the reward function for the hoveraviary.py, which I tried before in many different forms, resembles the following:

def _computeReward(self):

       state = self._getDroneStateVector(0)
    
       pos = state[0:3]
       att = state[7;10]
       vel = state[10:13]
       ang_vel = state[13:16]

        targer_pos = np.array([0, 0, 1)]
        pos_err = np.linalg.norm(target_pos - pos)

        att_err = np.linalg.norm(att)
        vel_err = np.linalg.norm(vel)
        ang_vel_err = np.linalg.norm(ang_vel)

        W_pos_err = 1
        W__att = 0
        W_vel = 0
        W_ang_vel = 0

        reward = ( -1 * W_pos_err * pos_err ) + ( -1 * W_att * att ) + ( -1 * W_vel * vel ) + ( -1 * W_ang_vel * ang_vel )

        if pos_err < 0.0001:
            reward += 1

        return reward

where W_pos_err, W_att, W_vel and W_ang_vel are weightings applied to different state errors from the desired set-point values. I've tested out a few different weightings, but none have worked out so far. In the above block of code, I just set the weightings such that the reward function more closely resembles the original. The function includes a conditional reward bonus as well if the drone is able to stay within a 0.0001 error of the set-point.

I also tried completely different approaches to the reward function entirely. Unfortunately, none of the iterations of this function (including the one above) worked out even after training for over 30 million time steps. It seems Between the 4-dimensional continuous action space; the complexity of learning the drone dynamics through exploring this massive action space; and the challenging task of simultaneously learning these dynamics and using them to move the drone from one point to a completely different point in space in a stable fashion; the problem is just far too complicated to learn within a reasonable time.

So I had this simple idea of how the problem could be better guide learning, and the results were very promising:
I initialised the drone such that it would already be located at the desired set-point. The objective of the agent was now to only try and keep the drone at this set-point.

Of course, the intuitive reasoning for why this work so much better is because the problem was much simpler than trying to move from point A to point B, but I think another way of understanding it is that now the agent could focus on learning the basic dynamics of the drone, instead of trying to learn how to move the drone around and at the same time solve the complex problem of learning how to manipulate these dynamics to move from one point to another.

The idea is that once the model is trained, it can be used as a basis for learning much more complex problems by continuing to train the same model.

But even with this much simpler problem, after more than 20 million time steps, the agent would only manage to get the drone to hover at the set-point for about 1 second before beginning to oscillate around and crash.

So, another change which I made was to the termination conditions in _computeDone(). If the drone began to drift too far away from the desired set-point, I would terminate the episode immediately so that the agent does not spend any more time exploring an action space that was already a lost cause. This immensely improved the learning rate, and the agent was able to figure out how to stabilise the drone in about 5 million time steps! The termination condition function was as follows:

    def _computeDone(self):

        state = self._getDroneStateVector(0)

        curr_pos = state[0:3]
        curr_att = state[7:10]
        curr_vel = state[10:13]
        curr_ang_vel = state[13:16]

        target_pos = np.array([0, 0, 1])
        pos_err = np.linalg.norm(target_pos - curr_pos)

        done = False

        if self.step_counter/self.SIM_FREQ > self.EPISODE_LEN_SEC:
            done = True

        if pos_err > 0.01:
            done = True

        return done

This was something I could not do as easily with the original hover aviary problem description (point A to point B) because terminating early would create inconsistent episode rewards. Recall that the original reward function was really a "cost function" where the agent receives penalties for increasing state errors. What would happen, at times, is that the agent would notice that if it went extremely off course from reaching point B the episode would terminate earlier and they would receive fewer penalties. As a result, the agent would try to intentional fly off course as soon as possible to terminate the episode and receive a lesser penalty (i.e. receive a higher reward). The agent is so greedy smh - it only wants one thing.
I guess the issue could be fixed by having a positive valued reward function instead of penalties so that the agent continues to try and reach the set-point by collecting the most reward instead of avoiding the most penalty. Another option was to give the drone a massive negative reward if the episode terminated early. These are just some of the options I briefly tried out.

But anyways for this new hover problem, another useful feature of having the episode terminate early when the drone looks like it is about to crash, is that now, during training, the length of the episode ep_len_mean which is printed in the terminal was a direct indicator of whether the agent had converged at a valid solution or not. So now, instead of monitoring the average reward ep_rew_mean which has some unknown value at the converged solution, you could just look at the length of the episode, because the drone would have only been able to solve the stabilisation problem if it could keep the drone at the set-point for the entire duration of the episode.

Hopefully this is helpful to anyone who is trying to get, at the least, a trained agent that works. As mentioned earlier, this hover model could be used as a base to train other more complicated tasks. However, for now, my hope is that with this simple problem which the agent was able to solve, the agent will be able to generalise other more complicated navigational problems - which brings me to my actual question:

How can I change the set-point which a trained model is trying to reach? In all the aviaries, the objective position is specified internally within _computeReward(). What if I wanted to feed a stream of new set-points to the agent to have it track a trajectory instead of training the agent to follow a specific trajectory?

What I've thought of so far, is that maybe the desired setpoint can be an additional argument to the environment class. Something resembling the following:

class BaseAviary(gym.Env):

    def __init__(self,
                 drone_model: DroneModel=DroneModel.CF2X,
                 num_drones: int=1,
                 neighbourhood_radius: float=np.inf,
                 initial_xyzs=None,
                 initial_rpys=None,
                 physics: Physics=Physics.PYB,
                 freq: int=240,
                 aggregate_phy_steps: int=1,
                 gui=False,
                 record=False,
                 obstacles=False,
                 user_debug_gui=True,
                 vision_attributes=False,
                 dynamics_attributes=False
                 set_point = None ########## Additional set_point agreement ##########
                 ):

and from there, the set-pont will be parsed to _computeReward() and _computeDone() like this:

def _computeReward(set_point):
        ...
        return reward

def _computeDone(set_point):
        ...
        return done

_compteReward() would use the set-point to calculate the reward as the agent tries to reduce the state error, and _computeDone() would tell the agent whether it has reached the set-point and can now move on to the next. So, there will be some sort of an outer loop that moves between set-points and creates an instance of the environment with the current set-point as an argument to the environment. The environment is then parsed to the trained model to predict the next action. The state and the required action for that state are calculated continuously with an inner loop that will only terminate when _computeDone() returns True, which tells us that the drone has reached the set-point. From there, we go back to the outer loop which steps to the next set-point and a new instance of the environment is created. The new set-point is parsed to the new environment, and the loop continues until the entire trajectory has been tracked. Something like this:

target_pos = np.array( [x_1, y_1, z_1] , [x_2, y_2, z_2], ....)

for i in range target_pos:

	env = HoverAviary( initial_xyzs = None,
                                       initial_rpys = None,
                                       obs = ObersvationTpe.KIN,
                                       act = ActionType.RPM,
                                       target_pos = target_pos ########## Additional set_point agreement ##########
                                      ) 

	model = PPO.load( "Name_of_Model.zip", env = env )

	obs - env.reset()

	done = False
	while done == False:
        	action, _states = model.predict(obs)
        	obs, rewards, done, info = env.step(action)
        	env.render()

My issue is that I don't think this works. Does an already trained model still use the reward function at all or is the specific problem the agent was trained to solve something which is internal to the model (i.e. parsing a set-point to the reward function does nothing for an already trained model).

If this is the case, then how would I go about testing the ability of a trained model to solve other navigational problems other than the one it was trained for?

Thanks again for sharing this amazing project.

The text was updated successfully, but these errors were encountered:

JacopoPan · 2021-05-27T17:25:42Z

The agent is so greedy smh - it only wants one thing.

lol

So now, instead of monitoring the average reward ep_rew_mean which has some unknown value at the converged solution, you could just look at the length of the episode, because the drone would have only been able to solve the stabilisation problem if it could keep the drone at the set-point for the entire duration of the episode.

Yes, for stabilization, e.g. in the classic CartPole environment, one way to do this is to give positive rewards at each tilmestep (regardless of state) and having a done condition that terminates for the robot to leave the desired position.
As you pointed out, the agent will try to stay in the episode longer to collect more reward.

Does an already trained model still use the reward function at all or is the specific problem the agent was trained to solve something which is internal to the model (i.e. parsing a set-point to the reward function does nothing for an already trained model).

Short answer, no.

One path you might want to try (but requires some work), what about training an RL agent with a stabilization policy first; then plug this policy into the action preprocessing; and finally training a second agent that uses the knowledge of the previously trained one but also allows for small perturbations of it to navigate waypoints/reaching a desire position?

I'm glad you are having fun with this repo and I'd want to re-state that having people trying to do the things you are trying to do is exactly why it exists.

alchemi5t · 2022-03-22T19:45:22Z

@GM-Whooshi Just reading this thread gave me so much more knowledge about the codebase than the readme did (not to put the readme down, that helped a lot too). I really think this should be part of some example readme if not already! Truly appreciate the details you've gone into with this issue!

Getting back to the question, Did you figure out how to train a model to a specific trajectory? If yes, how did you create the trajectory, and did you use euclidean norm for the reward function?

ngurnard · 2022-04-13T23:05:04Z

I am also curious as to what @alchemi5t asked!

HimGautam · 2022-05-21T12:58:11Z

@GM-Whooshi, what initalization and activation you have used in ur neural neworks.

JacopoPan added the question Further information is requested label May 25, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Navigation with a Learned Policy #46

Navigation with a Learned Policy #46

Navigation with a Learned Policy #46

Navigation with a Learned Policy #46

Comments