Fix: PQN needs to mask the return computation if next state is done #494

roger-creus · 2025-01-13T21:23:21Z

Description

In PQN, the nextnonterminal is currently used to mask the next value prediction:

returns[t] = rewards[t] + args.gamma * (
    args.q_lambda * returns[t + 1] + (1 - args.q_lambda) * next_value * nextnonterminal
)

but it should also mask the next return, which can be fixed as:

returns[t] = rewards[t] + args.gamma * (
    args.q_lambda * returns[t + 1] + (1 - args.q_lambda) * next_value
) * nextnonterminal

This PR fixes this issue for all pqn.py, pqn_atari_envpool.py and pqn_atari_envpool_lstm.py

Types of changes

Bug fix
New feature
New algorithm
Documentation

Checklist:

I've read the CONTRIBUTION guide (required).
I have ensured pre-commit run --all-files passes (required).
I have updated the tests accordingly (if applicable).
I have updated the documentation and previewed the changes via mkdocs serve.
- I have explained note-worthy implementation details.
- I have explained the logged metrics.
- I have added links to the original paper and related papers.

If you need to run benchmark experiments for a performance-impacting changes:

I have contacted @vwxyzjn to obtain access to the openrlbenchmark W&B team.
I have used the benchmark utility to submit the tracked experiments to the openrlbenchmark/cleanrl W&B project, optionally with --capture_video.
I have performed RLops with python -m openrlbenchmark.rlops.
- For new feature or bug fix:
  - I have used the RLops utility to understand the performance impact of the changes and confirmed there is no regression.
- For new algorithm:
  - I have created a table comparing my results against those from reputable sources (i.e., the original paper or other reference implementation).
- I have added the learning curves generated by the python -m openrlbenchmark.rlops utility to the documentation.
- I have added links to the tracked experiments in W&B, generated by python -m openrlbenchmark.rlops ....your_args... --report, to the documentation.

vercel · 2025-01-13T21:23:26Z

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name	Status	Preview	Comments	Updated (UTC)
cleanrl	✅ Ready (Inspect)	Visit Preview	💬 Add feedback	Jan 13, 2025 9:23pm

pseudo-rnd-thoughts

Do you not need to zero the error for nextnonterminal rather than the return? Or are they equivalent?

roger-creus · 2025-01-13T22:32:15Z

Do you not need to zero the error for nextnonterminal rather than the return? Or are they equivalent?

@pseudo-rnd-thoughts No, I don't think so. In a terminal state, your TD target return is the episode's last reward (which is corrected in this PR), and you need to regress that (hence your TD error might not be zero). In a non-terminal state, your TD target return (with Q lambda) is the reward at that state plus some recursive factor of the returns from the future states. You always need to regress the TD targets so the loss shouldn't be zeroed out.

pseudo-rnd-thoughts · 2025-01-14T12:41:35Z

Sorry I think I might be thinking of a different problem that I thought this was fixing.

See #475 for more detail.

In short, in a rollout, you will have the end of one episode and the first of the next episode.
The problem with the original PQN and I think your implementation is that we don't account for this and still compute the error between the last and first observations of neighbouring episodes.

I hope that makes sense and to solve this issue requires zero outs this particular observation to prevent learning from this error

roger-creus · 2025-01-14T13:59:56Z

@pseudo-rnd-thoughts This is a known limitation in all current CleanRL+Envpool implementations. The problem arises because the observation returned by EnvPool when done=True is the final observation of the episode. CleanRL's current implementation incorrectly masks the error between the penultimate and the final observations of each episode while it should mask the error between the final and the first observations of the next episode. For example, here’s the current observation accompanying a done=True signal:

I fixed this error by modifying the RecordEpisodeStatistics wrapper in the envpool scripts to account for the auto-resetting differences between envpool and gymnasium, as done here

Like this:

def step(self, action):
        observations, rewards, dones, infos = super().step(action)
        
        # Fix envpool auto-resetting
        env_ids_to_reset = np.where(dones)[0]
        if len(env_ids_to_reset) > 0:
            (
                reset_obs,
                _,
                _,
                _,
            ) = self.env.step(np.zeros_like(action), env_ids_to_reset)
            observations[env_ids_to_reset] = reset_obs
        
        self.episode_returns += infos["reward"]
        self.episode_lengths += 1
        self.returned_episode_returns[:] = self.episode_returns
        self.returned_episode_lengths[:] = self.episode_lengths
        self.episode_returns *= 1 - infos["terminated"]
        self.episode_lengths *= 1 - infos["terminated"]
        infos["r"] = self.returned_episode_returns
        infos["l"] = self.returned_episode_lengths
        return (
            observations,
            rewards,
            dones,
            infos,
        )

In this way, the errors that will be masked are between the second last observation of the episode (now taken as the true last) and the first observation of the next episode. With these changes, now the observation accompanying the done=True signal is the following:

I can include this in the PQN implementations or do it for all the envpool implementations in CleanRL to close #475. Let me know.

Tagging @vwxyzjn as well :)

pseudo-rnd-thoughts · 2025-01-14T14:06:52Z

While I agree that this is an option, personally I dislike it.
I would prefer to keep all the observations as we can terminate and truncate correctly with all the data and then conduct masking of the loss function to solve the issue of autoreset.
I know this adds some complexity but it should be possible.

roger-creus · 2025-01-14T16:50:48Z

While I agree that this is an option, personally I dislike it.
I would prefer to keep all the observations as we can terminate and truncate correctly with all the data and then conduct masking of the loss function to solve the issue of autoreset.
I know this adds some complexity but it should be possible.

Understood! Still, this PR fixes a different issue regarding the return computation which needs to be fixed as well

pseudo-rnd-thoughts · 2025-01-14T22:05:02Z

Still, this PR fixes a different issue regarding the return computation which needs to be fixed as well

total, sorry for derailing this pr

vwxyzjn · 2025-01-22T16:46:48Z

yeah the auto reset stuff is a bit tricky, also I did a benchmark a while ago and did not find too much performance impact, so overall it feels not worth it to fix it.

The PR looks good to me @roger-creus. Would this be a performance-impacting change? if so please re-run the benchmark exps to make sure no significant regression. Otherwise feel free to merge.

Fix: PQN needs to mask the return computation if next state is done

bb913d6

vercel bot deployed to Preview January 13, 2025 21:23 View deployment

pseudo-rnd-thoughts reviewed Jan 13, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix: PQN needs to mask the return computation if next state is done #494

Fix: PQN needs to mask the return computation if next state is done #494

roger-creus commented Jan 13, 2025

vercel bot commented Jan 13, 2025 •

edited

Loading

pseudo-rnd-thoughts left a comment

roger-creus commented Jan 13, 2025

pseudo-rnd-thoughts commented Jan 14, 2025

roger-creus commented Jan 14, 2025

pseudo-rnd-thoughts commented Jan 14, 2025

roger-creus commented Jan 14, 2025 •

edited

Loading

pseudo-rnd-thoughts commented Jan 14, 2025

vwxyzjn commented Jan 22, 2025

Fix: PQN needs to mask the return computation if next state is done #494

Are you sure you want to change the base?

Fix: PQN needs to mask the return computation if next state is done #494

Conversation

roger-creus commented Jan 13, 2025

Description

Types of changes

Checklist:

vercel bot commented Jan 13, 2025 • edited Loading

pseudo-rnd-thoughts left a comment

Choose a reason for hiding this comment

roger-creus commented Jan 13, 2025

pseudo-rnd-thoughts commented Jan 14, 2025

roger-creus commented Jan 14, 2025

pseudo-rnd-thoughts commented Jan 14, 2025

roger-creus commented Jan 14, 2025 • edited Loading

pseudo-rnd-thoughts commented Jan 14, 2025

vwxyzjn commented Jan 22, 2025

vercel bot commented Jan 13, 2025 •

edited

Loading

roger-creus commented Jan 14, 2025 •

edited

Loading