-
Notifications
You must be signed in to change notification settings - Fork 695
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix: PQN needs to mask the return computation if next state is done #494
base: master
Are you sure you want to change the base?
Conversation
The latest updates on your projects. Learn more about Vercel for Git ↗︎
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you not need to zero the error for nextnonterminal rather than the return? Or are they equivalent?
@pseudo-rnd-thoughts No, I don't think so. In a terminal state, your TD target return is the episode's last reward (which is corrected in this PR), and you need to regress that (hence your TD error might not be zero). In a non-terminal state, your TD target return (with Q lambda) is the reward at that state plus some recursive factor of the returns from the future states. You always need to regress the TD targets so the loss shouldn't be zeroed out. |
Sorry I think I might be thinking of a different problem that I thought this was fixing. See #475 for more detail. In short, in a rollout, you will have the end of one episode and the first of the next episode. I hope that makes sense and to solve this issue requires zero outs this particular observation to prevent learning from this error |
@pseudo-rnd-thoughts This is a known limitation in all current CleanRL+Envpool implementations. The problem arises because the observation returned by EnvPool when I fixed this error by modifying the Like this:
In this way, the errors that will be masked are between the second last observation of the episode (now taken as the true last) and the first observation of the next episode. With these changes, now the observation accompanying the I can include this in the PQN implementations or do it for all the envpool implementations in CleanRL to close #475. Let me know. Tagging @vwxyzjn as well :) |
While I agree that this is an option, personally I dislike it. |
Understood! Still, this PR fixes a different issue regarding the return computation which needs to be fixed as well |
total, sorry for derailing this pr |
yeah the auto reset stuff is a bit tricky, also I did a benchmark a while ago and did not find too much performance impact, so overall it feels not worth it to fix it. The PR looks good to me @roger-creus. Would this be a performance-impacting change? if so please re-run the benchmark exps to make sure no significant regression. Otherwise feel free to merge. |
Description
In PQN, the
nextnonterminal
is currently used to mask the next value prediction:but it should also mask the next return, which can be fixed as:
This PR fixes this issue for all
pqn.py
,pqn_atari_envpool.py
andpqn_atari_envpool_lstm.py
Types of changes
Checklist:
pre-commit run --all-files
passes (required).mkdocs serve
.If you need to run benchmark experiments for a performance-impacting changes:
--capture_video
.python -m openrlbenchmark.rlops
.python -m openrlbenchmark.rlops
utility to the documentation.python -m openrlbenchmark.rlops ....your_args... --report
, to the documentation.