You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on Jun 13, 2024. It is now read-only.
I came across your paper and found it to be interesting. One of the doubts I have is with the implementation of the optimistic policies. Why are you computing gradients of the upper bound w.r.t pre-tanh of the policies? As per the paper, isn' it supposed to be the deterministic action (output of the tanh policy)?
Regards,
Kartik
The text was updated successfully, but these errors were encountered:
Sign up for freeto subscribe to this conversation on GitHub.
Already have an account?
Sign in.
Hi Quan,
I came across your paper and found it to be interesting. One of the doubts I have is with the implementation of the optimistic policies. Why are you computing gradients of the upper bound w.r.t pre-tanh of the policies? As per the paper, isn' it supposed to be the deterministic action (output of the tanh policy)?
Regards,
Kartik
The text was updated successfully, but these errors were encountered: