A question about the buffer size #30
-
I would like to know the size of the buffer used in the SMAC scenario in the paper, as I reproduced the experiment with a buffer size of 100,000 on the 3m_good dataset and found that the performance of the BC and BCQ methods was significantly better than the results shown in your paper. At the same time, I noticed that the original size of this dataset is 120,569. I suspect that the reason for the above phenomenon is that the 20,569 trajectories I did not obtain are relatively random or inferior trajectories. |
Beta Was this translation helpful? Give feedback.
Replies: 30 comments
-
Hi @zyh1999, I suspect the difference in performance is due to the missing trajectories. The results in the paper used all of the trajectories. Can you try re-run your experiments, making sure to set the buffer size to be greater than the number of trajectories in the respective dataset. You can get the number of trajectories in the dataset from the table in the appendix of the paper. Ill work on a feature to set the buffer size automatically. |
Beta Was this translation helpful? Give feedback.
-
Thanks, I have test 3m_good buffer including 120569 trajectories, but the performance of BC method is still much better than the results shown in your paper. Another thing confuses me is that I check the average return for the 3m_good buffer (mean Episode return), but the average return is about 10.69, much far from 16.0 in your paper.
` |
Beta Was this translation helpful? Give feedback.
-
Hi there, I will be back at my PC on Monday and will be able to investigate the discrepancy in the reported performance for BC on 3m then. But in the mean time I wanted to to respond to your question about the episode return. There is a problem with how you are computing the average return. Each sample contains 20 sequential timesteps, which are not necessarily an entire episode (episodes in 3m are usually around 50 timesteps long). So an episode may be split across two samples. When the remainder of an episode does not fill out the entire 20 timesteps in a sample, we zero pad the end of the sample. To work around this, I recorded the episode_return that is associated with each sample in the dataset which can be accessed like I have added an example of how to compute the mean episode return to the script If it would be helpful to you, I can work on stitching the samples together into entire episodes, rather than 20 timestep snippets. |
Beta Was this translation helpful? Give feedback.
-
Thanks, so do you mean in the buffer, you may divide a completed trajectory into two samples, so that the average return I calculated is smaller than the original one? May I ask why you should do this "dividing operation"? Is a completed trajectory not good? |
Beta Was this translation helpful? Give feedback.
-
Another thing I would like to inquire about is whether these baseline performances will eventually converge or if they have only reached a higher point in the past. For example, when I ran qmix on the 3m_good scenario, it seems that qmix's performance only reached around 13.8 in the early stages of training and then gradually decreased to a much lower value. |
Beta Was this translation helpful? Give feedback.
-
The reason the samples are only portions of an entire trajectory is simply a relic of how my replay buffer was implemented. It was convenient to unroll the recurrent neural networks over shorter sequences (e.g. 20 timesteps) rather than the full episode because longer sequences pose several challenges. For example, in environments with many agents, it can be challenging to do the RNN unrolling calculation across all agents and the entire episode without running out of VRAM on the GPU. Using shorter sequences was an easy work around. Furthermore, training RNN policies on long sequences can sometimes cause instability during training and simply using shorter sequences can mitigate that. Having said that, I do want to add support to OG-MARL for loading entire trajectories from the dataset. Now that I am back at work, Ill try implement it as soon as possible. |
Beta Was this translation helpful? Give feedback.
-
With regards to your second question, when training offline, the main challenge is compounding extrapolation error on out-of-distribution actions. Basically, when training offline, the neural network is likely to erroneously assign a high-value to out-distribution actions (i.e. actions not seen in the dataset). This is called extrapolation error. In online RL such erroneous extrapolations are quickly corrected through interactions and feedback from the environment. But in offline RL such feedback is not available and errors are not corrected. These errors then compound during the course of training due to the bellman update where the value of a given state-action pair is updated to be the reward plus the max value over next actions. That max operation means that erroneously high values are preferred and ultimately cause the performance of the policy to collaps. Thus, in offline RL, the longer one trains for, the more likely it is that performance has begun to degrade due to compounding extrapolation error on out-of-distribution actions. I hope this clarifies things for you. |
Beta Was this translation helpful? Give feedback.
-
So you mean that you selected the highest performance during training period as the data for the table in your paper, rather than the final performance? |
Beta Was this translation helpful? Give feedback.
-
I agree that reporting the best performance during training would not be fair in the offline setting. Therefore, we did not do that. We set an offline training budget (50 000 offline training steps in SMAC) and then reported the final performance at the end of that, for all algorithms. Furthermore, we tuned hyper-parameters on 3m only and kept them fixed on all other scenarios. For your reference, here is a WANDB report I made with all of the runs. You can see that I reported the final performance of the run in the table. https://api.wandb.ai/links/off-the-grid-marl-team/0iopyeen |
Beta Was this translation helpful? Give feedback.
-
Thank you for your detailed explanation! |
Beta Was this translation helpful? Give feedback.
-
Hello, I noticed the hyperparameters in the appendix said that you use the soft targert update rate for qmix, but I found it has been deleted in og-marl/og_marl/tf2/systems/idrqn.py Line 267 in e752264 So, which version is used for the original performance in paper? |
Beta Was this translation helpful? Give feedback.
-
By the way, I test BC method on terran_5_5, It also much higher than the performance in your paper. And my performance of BC method is similar with the "sample mean return". |
Beta Was this translation helpful? Give feedback.
-
Hi, have you investigate the discrepancy of BC method yet? |
Beta Was this translation helpful? Give feedback.
-
Sorry for the late reply. We have been very busy recently. We are releasing a fairly big update to OG-MARL soon, which should make its easier to use. But to try and answer your questions.
In the paper we used soft-updates. But since refactoring the code we swapped to using hard-target updates to be similar to the MAICQ implementation.
I am re-running the refactored baselines on terran_5_vs_5 to see if there is a discrepancy in the BC results. I will share the results with you as soon as possible. |
Beta Was this translation helpful? Give feedback.
-
Hi @zyh1999 Here are the results from my latest benchmarking sweep: https://api.wandb.ai/links/off-the-grid-marl-team/fmnpbz44 The mean performance of BC on terran_5_vs_5 seems inline with what we reported in the paper. Maybe slightly better(8.4 rather than 7.3), but I did only run 5 seeds this time. I also ran these experiments using the code on the branch We will be merging that branch into main today. |
Beta Was this translation helpful? Give feedback.
-
Thanks for your latest report. It also seems qmix totally fall down at the end of the training, which is also different from the original results and similar with my early response:
|
Beta Was this translation helpful? Give feedback.
-
And inTerran_5_5, it seems the perfomaces of most of the methdos except the DBC become lower than the original. I will also reproduce then with my code(based on [https://github.com/oxwhirl/pymarl/]) soon. |
Beta Was this translation helpful? Give feedback.
-
Hello, As you have updated the code and fixed some bugs, and the experimental results you subsequently presented (as mentioned above) also a little differ from those in the original paper. I sincerely hope that you could update the experimental results provided in the original paper, as it will facilitate the comparison for our subsequent work. Thank your very much! |
Beta Was this translation helpful? Give feedback.
-
Hi @zyh1999, thanks for all the discussions! Ill update the results in the paper this week! Good luck with your research! I look forward to reading it when you have published it. |
Beta Was this translation helpful? Give feedback.
-
Hi, have you updated the experimental results? Because I just found out that when running some experimental environments (e.g., 5m_vs_6m_good) using the original MAICQ code(https://github.com/yiqinyang/icq), the loss quickly explodes into inf. I would like to know if you have encountered this problem with the replicated MAICQ(tf version). |
Beta Was this translation helpful? Give feedback.
-
Also, I would like to know how many different well-trained models typically collect the remaining data in your dataset, such as "Good", in order to ensure the diversity of the similar performance strategies, excluding any additional perturbation data? |
Beta Was this translation helpful? Give feedback.
-
Hello, have the results in paper updated? |
Beta Was this translation helpful? Give feedback.
-
Hi there, sorry about the late reply. We used at least 4 independently trained policies for each dataset and we included epsilon=0.05 exploration to ensure diversity. Concerning the result we have not finished all the benchmarks we were hoping to do before we update the paper. But if you let me know which results are particularly urgent for you, I can update those. Are they the SMAC results? And do you want the results in the paper to reflect the results here: https://instadeepai.github.io/og-marl/baselines/smac_v1/? |
Beta Was this translation helpful? Give feedback.
-
The main thing particularly urgent for me is I found the loss of official MAICQ code(https://github.com/yiqinyang/icq) sometimes quickly explodes into inf(like 8m and 5m_vs_6m). But it seems not happened in your Tensorflow version MAICQ code, I want to know are there any differences between them? In other word, I found the result of MAIC I reproduced with official MAICQ code is not always as stable as yours. |
Beta Was this translation helpful? Give feedback.
-
Yeah, I also found their official implementation can be unstable. But even my version explodes on some datasets like on our Flatland datasets. But as you found, our version is at least always stable on SMAC. Unfortunately, I am not 100% sure why that is. I am not aware of any major differences between our implementations. But I did implement it quite a while ago, so I may have forgotten about something. But this kind of instability can also come from a very minor implementation detail which may be hard to track down. |
Beta Was this translation helpful? Give feedback.
-
Sorry for bothering you again. I roughly looked at your code for the MAICQ in the TensorFlow version, and it seems there are two minor differences. First, it appears that you did not use "td_lambda_targets" in og_marl/tf2/systems/maicq.py, 249 line. Second, you did not apply "clip_grad_norm" during gradient descent. I'm not sure if I missed anything? Additionally, I would like to ask if you made any recent modifications to the network aspect of the og-marl version of MAICQ. I noticed that you seemed to add a CNN embedding. Will this have any effect on the performance in the SMAC environment, considering that the input for the SMAC environment is not an image? |
Beta Was this translation helpful? Give feedback.
-
Hi there, if I was you, I would try removing gradient clipping from their implementation. I have sometimes found that it effects the stability of algorithms that use mixing networks. With regards to TD Lambda, I had implemented it previously and found it did not help that much, so I removed it. So, I do not think that is the reason that the original MAICQ is failing. The CNN embedding is not used in SMAC. I only add a CNN embedding for environments with pixel observations (e.g. the PettingZoo datasets). |
Beta Was this translation helpful? Give feedback.
-
Ok, thanks, I will try it. |
Beta Was this translation helpful? Give feedback.
-
Hello, may I ask if it is possible to obtain the original data set without trajectory splitting or if you could please provide guidance on how to recover the complete trajectory data set based on the existing code? I have observed that ICQ's approximation estimation using softmax based on mini-batches results in larger errors by trajectory splitting . In other words, instead of performing softmax for a batch of s_t, it now involves performing softmax for s_t and s_{t+20} together. |
Beta Was this translation helpful? Give feedback.
-
Could you look at this tutorial and let me know if it does what you want. |
Beta Was this translation helpful? Give feedback.
Could you look at this tutorial and let me know if it does what you want.
https://colab.research.google.com/github/instadeepai/og-marl/blob/main/examples/dataset_api_demo.ipynb