Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix: PQN needs to mask the return computation if next state is done #494

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

roger-creus
Copy link
Collaborator

Description

In PQN, the nextnonterminal is currently used to mask the next value prediction:

returns[t] = rewards[t] + args.gamma * (
    args.q_lambda * returns[t + 1] + (1 - args.q_lambda) * next_value * nextnonterminal
)

but it should also mask the next return, which can be fixed as:

returns[t] = rewards[t] + args.gamma * (
    args.q_lambda * returns[t + 1] + (1 - args.q_lambda) * next_value
) * nextnonterminal

This PR fixes this issue for all pqn.py, pqn_atari_envpool.py and pqn_atari_envpool_lstm.py

Types of changes

  • Bug fix
  • New feature
  • New algorithm
  • Documentation

Checklist:

  • I've read the CONTRIBUTION guide (required).
  • I have ensured pre-commit run --all-files passes (required).
  • I have updated the tests accordingly (if applicable).
  • I have updated the documentation and previewed the changes via mkdocs serve.
    • I have explained note-worthy implementation details.
    • I have explained the logged metrics.
    • I have added links to the original paper and related papers.

If you need to run benchmark experiments for a performance-impacting changes:

  • I have contacted @vwxyzjn to obtain access to the openrlbenchmark W&B team.
  • I have used the benchmark utility to submit the tracked experiments to the openrlbenchmark/cleanrl W&B project, optionally with --capture_video.
  • I have performed RLops with python -m openrlbenchmark.rlops.
    • For new feature or bug fix:
      • I have used the RLops utility to understand the performance impact of the changes and confirmed there is no regression.
    • For new algorithm:
      • I have created a table comparing my results against those from reputable sources (i.e., the original paper or other reference implementation).
    • I have added the learning curves generated by the python -m openrlbenchmark.rlops utility to the documentation.
    • I have added links to the tracked experiments in W&B, generated by python -m openrlbenchmark.rlops ....your_args... --report, to the documentation.

Copy link

vercel bot commented Jan 13, 2025

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Comments Updated (UTC)
cleanrl ✅ Ready (Inspect) Visit Preview 💬 Add feedback Jan 13, 2025 9:23pm

Copy link
Collaborator

@pseudo-rnd-thoughts pseudo-rnd-thoughts left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you not need to zero the error for nextnonterminal rather than the return? Or are they equivalent?

@roger-creus
Copy link
Collaborator Author

Do you not need to zero the error for nextnonterminal rather than the return? Or are they equivalent?

@pseudo-rnd-thoughts No, I don't think so. In a terminal state, your TD target return is the episode's last reward (which is corrected in this PR), and you need to regress that (hence your TD error might not be zero). In a non-terminal state, your TD target return (with Q lambda) is the reward at that state plus some recursive factor of the returns from the future states. You always need to regress the TD targets so the loss shouldn't be zeroed out.

@pseudo-rnd-thoughts
Copy link
Collaborator

Sorry I think I might be thinking of a different problem that I thought this was fixing.

See #475 for more detail.

In short, in a rollout, you will have the end of one episode and the first of the next episode.
The problem with the original PQN and I think your implementation is that we don't account for this and still compute the error between the last and first observations of neighbouring episodes.

I hope that makes sense and to solve this issue requires zero outs this particular observation to prevent learning from this error

@roger-creus
Copy link
Collaborator Author

@pseudo-rnd-thoughts This is a known limitation in all current CleanRL+Envpool implementations. The problem arises because the observation returned by EnvPool when done=True is the final observation of the episode. CleanRL's current implementation incorrectly masks the error between the penultimate and the final observations of each episode while it should mask the error between the final and the first observations of the next episode. For example, here’s the current observation accompanying a done=True signal:

doneOld_13064

I fixed this error by modifying the RecordEpisodeStatistics wrapper in the envpool scripts to account for the auto-resetting differences between envpool and gymnasium, as done here

Like this:

def step(self, action):
        observations, rewards, dones, infos = super().step(action)
        
        # Fix envpool auto-resetting
        env_ids_to_reset = np.where(dones)[0]
        if len(env_ids_to_reset) > 0:
            (
                reset_obs,
                _,
                _,
                _,
            ) = self.env.step(np.zeros_like(action), env_ids_to_reset)
            observations[env_ids_to_reset] = reset_obs
        
        self.episode_returns += infos["reward"]
        self.episode_lengths += 1
        self.returned_episode_returns[:] = self.episode_returns
        self.returned_episode_lengths[:] = self.episode_lengths
        self.episode_returns *= 1 - infos["terminated"]
        self.episode_lengths *= 1 - infos["terminated"]
        infos["r"] = self.returned_episode_returns
        infos["l"] = self.returned_episode_lengths
        return (
            observations,
            rewards,
            dones,
            infos,
        )

In this way, the errors that will be masked are between the second last observation of the episode (now taken as the true last) and the first observation of the next episode. With these changes, now the observation accompanying the done=True signal is the following:

doneFixed_11392

I can include this in the PQN implementations or do it for all the envpool implementations in CleanRL to close #475. Let me know.

Tagging @vwxyzjn as well :)

@pseudo-rnd-thoughts
Copy link
Collaborator

While I agree that this is an option, personally I dislike it.
I would prefer to keep all the observations as we can terminate and truncate correctly with all the data and then conduct masking of the loss function to solve the issue of autoreset.
I know this adds some complexity but it should be possible.

@roger-creus
Copy link
Collaborator Author

roger-creus commented Jan 14, 2025

While I agree that this is an option, personally I dislike it.
I would prefer to keep all the observations as we can terminate and truncate correctly with all the data and then conduct masking of the loss function to solve the issue of autoreset.
I know this adds some complexity but it should be possible.

Understood! Still, this PR fixes a different issue regarding the return computation which needs to be fixed as well

@pseudo-rnd-thoughts
Copy link
Collaborator

Still, this PR fixes a different issue regarding the return computation which needs to be fixed as well

total, sorry for derailing this pr

@vwxyzjn
Copy link
Owner

vwxyzjn commented Jan 22, 2025

yeah the auto reset stuff is a bit tricky, also I did a benchmark a while ago and did not find too much performance impact, so overall it feels not worth it to fix it.

The PR looks good to me @roger-creus. Would this be a performance-impacting change? if so please re-run the benchmark exps to make sure no significant regression. Otherwise feel free to merge.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants