-
Notifications
You must be signed in to change notification settings - Fork 244
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Normalizing advantages or returns #20
Comments
Hi @DuaneNielsen, the return is normalized as described in the paper. And the value baseline is normalized by the same statistics so that they are in the same space. They are then subtracted. Effectively, this means the advantage is normalized by the statistics of the return (not of the advantage). Does that answer your question? |
Oh yeah. I get the value baseline is normalized by the same amount, so the advantage is normalized by the percentile scale. That's clear. But if I read the paper.. I come to the conclusion that I should just do normed_ret (no advantage). Perhaps I'm going wrong here. Empirically, the scale of normed_ret - normed_base is going to be about 20 times smaller than the scale of normed_ret - zero. I think... So this will increase the role the entropy loss will play quite a bit. ie: if you use the advantage loss, you will explore a whole lot more. If you use the loss as per paper, you will explore a lot less. I noticed this when I ran tests. Does that makes sense? Or do you think my reasoning/math is misguided? In any case.. I think you have provided me what I was looking for... the advantage estimate is what I should use. |
The papers shows the gradient of the expectation of the (normalized) return. If you estimate that gradient using Reinforce with baseline, you get the expectation of the (normalized) advantage times the gradient of the policy logprob that's implemented in the code. It's easy to confuse the two but the first one talks about returns and has the gradient outside of the expectation, whereas the second one talks about advantages and has the gradient inside. So the two are actually the same, despite looking different at first sight. Thus, despite using the advantage instead of the return, the code correctly trades off expected normalized returns and entropy. This is in contrast to most existing policy gradient algorithms that trade off expected returns that are scaled by the advantage scale and entropy, which I think makes less sense. |
Thanks a lot for clearing that up Danijar. I missed the subtlety of the expectation being on the outside vs inside, so I'll try deriving that to prove it to myself. Thanks for sharing the code and thanks for creating this wonderful algorithm! |
Cool! The derivation for Reinforce is basically this:
Then you can also show that subtracting a "baseline" from
If you put these together, you get:
Now if you follow these derivations but with the return and baseline being normalized as in DreamerV3 where
By the way, I think this can also be explained much better in the paper. We'll update it at some point. |
Hi Danijar,
Thanks so much for developing and sharing this algorithm. I've been following your work for some time and I think it's really great.
I'm attempting a re-implementation from scratch in pytorch and I have a question about the actor loss.
In the paper the loss is provided as...
However I see in the code we have something that looks more like an advantage estimate.
If I'm interpreting the code correctly, normed_base seems to come from the value of the state the actor was in prior to transition, and normed_ret is the percentile scaled return as per the paper.
Also there is a little trick at the end..
where 'weight' is computed as exponentially discounting the models future predictions
This all makes sense, and is a good policy gradient, but in the paper it's mentioned that it's important to keep the scale of the policy gradient loss proportional to the entropy.
I noticed a big difference in policy gradient scale between using the returns as presented in the paper, and what we have here.
Would be great if you can clarify. Did I just misread the paper, or are these simply implementation details that don't matter a whole lot in practice?
The text was updated successfully, but these errors were encountered: