-
Notifications
You must be signed in to change notification settings - Fork 74
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Why single process on Push not work #19
Comments
I find that with a larger batch size, HER still not work, do you know why? |
@Ericonaldo Hi, in actually, |
Hi, I've tried 4 processes and 2 processes, they both work but a single process with 2048 batch size cannot work. |
@Ericonaldo Hi - What I guess is because of the diversity of samples - before the agent updates the network, if you use single process, in each epoch, it will only collect 2*50 = 100 episodes. Then, the agent will sample batch size of episodes from replay buffer and sample one transition from each of sampled episode for the training. In this case, even for 50 epochs, the agent only collects 5000 unique episodes (50 * 100). Although you use |
If this is true, we should be able to succeed by scaling the number of episodes by K times? However, it seems not work either. |
@Ericonaldo Hmm - that's a good point. An interesting finding is here: https://github.com/TianhongDai/hindsight-experience-replay/blob/master/mpi_utils/mpi_utils.py#L21-L22 . I follow the setting of OpenAI, they use |
Great and many thanks. I did this because I find my own implementation of HER can only reach a success rate of 70-80% and I am figuring out what really matters in the training. |
@Ericonaldo Yes - it's quiet tricky of HER implementation... |
@Ericonaldo I found that the comm.Allreduce(flat_grads, global_grads, op=MPI.SUM)
# average the gradient.
global_grads /= comm.Get_size() Then, I plot the training curve using 2 MPI workers, and when the gradient is averaged, the performance will drop. In this case - if we don't average the gradient, the update of the network will become something like: |
This seems an important reason, but when I run with a single process, it just can not get any evidence of learning... (at least the avg gradient of 2 processes works slowly) |
@Ericonaldo Yes - I agree, need to carry out more experiment to verify. We can use this channel to continue the discussion. |
I think the learning rates for both the policy network and the value network are important hyper-parameters for these goal-conditioned tasks, after fine-tune some values I found that with only a single process can achieve some good results. |
@Ericonaldo Thanks! This is a great finding. |
Hi, Tianhong, thanks for sharing the code. I've tried to run your code based on the guidance in readme
BUt surprisingly I find that running
does not work at all.
Do you happen to know the reason why it does not work?
The text was updated successfully, but these errors were encountered: