You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
However, the rewards I achieved were significantly lower than those reported in the paper. For instance, here are the training logs from my CartPole environment:
python train_rl_il.py --env CartPole --algo ppo --steps 1000000
> Training with cuda
> No pre-training
> Training PPO...
Num steps: 20000 , Return: 27.9 , Min/Max Return: 20.4 /31.7
Num steps: 40000 , Return: 38.3 , Min/Max Return: 27.3 /52.4
Num steps: 60000 , Return: 30.4 , Min/Max Return: 25.3 /36.7
Num steps: 80000 , Return: 37.1 , Min/Max Return: 29.6 /45.1
Num steps: 100000, Return: 19.2 , Min/Max Return: 18.2 /20.8
Num steps: 120000, Return: 21.7 , Min/Max Return: 19.2 /26.5
Num steps: 140000, Return: 29.6 , Min/Max Return: 24.2 /39.9
Num steps: 160000, Return: 38.9 , Min/Max Return: 31.3 /48.5
Num steps: 180000, Return: 30.1 , Min/Max Return: 26.5 /34.3
Num steps: 200000, Return: 51.7 , Min/Max Return: 32.7 /94.9
Num steps: 220000, Return: 43.5 , Min/Max Return: 37.8 /54.0
Num steps: 240000, Return: 31.6 , Min/Max Return: 28.3 /33.9
Num steps: 260000, Return: 36.2 , Min/Max Return: 29.2 /42.1
Num steps: 280000, Return: 30.3 , Min/Max Return: 27.9 /35.6
Num steps: 300000, Return: 31.1 , Min/Max Return: 27.5 /36.1
Num steps: 320000, Return: 19.7 , Min/Max Return: 18.6 /20.9
Num steps: 340000, Return: 46.6 , Min/Max Return: 34.7 /64.5
Num steps: 360000, Return: 40.5 , Min/Max Return: 30.2 /53.2
Num steps: 380000, Return: 28.8 , Min/Max Return: 24.9 /32.1
Num steps: 400000, Return: 48.6 , Min/Max Return: 30.7 /76.2
Num steps: 420000, Return: 55.4 , Min/Max Return: 43.2 /78.2
Num steps: 440000, Return: 31.7 , Min/Max Return: 29.0 /35.4
Num steps: 460000, Return: 39.0 , Min/Max Return: 30.2 /43.9
Num steps: 480000, Return: 34.8 , Min/Max Return: 29.2 /41.2
Num steps: 500000, Return: 39.7 , Min/Max Return: 34.2 /44.9
Num steps: 520000, Return: 50.4 , Min/Max Return: 40.1 /64.2
Num steps: 540000, Return: 34.2 , Min/Max Return: 27.9 /43.6
Num steps: 560000, Return: 35.3 , Min/Max Return: 29.2 /43.6
Num steps: 580000, Return: 43.1 , Min/Max Return: 36.1 /46.3
Num steps: 600000, Return: 41.8 , Min/Max Return: 34.1 /49.6
Num steps: 620000, Return: 46.7 , Min/Max Return: 34.6 /58.7
Num steps: 640000, Return: 38.3 , Min/Max Return: 30.2 /41.1
Num steps: 660000, Return: 46.7 , Min/Max Return: 39.5 /56.8
Num steps: 680000, Return: 41.1 , Min/Max Return: 32.2 /54.3
Num steps: 700000, Return: 40.6 , Min/Max Return: 31.7 /53.0
Num steps: 720000, Return: 41.5 , Min/Max Return: 30.8 /56.2
Num steps: 740000, Return: 35.5 , Min/Max Return: 30.2 /44.1
Num steps: 760000, Return: 49.5 , Min/Max Return: 38.3 /69.2
Num steps: 780000, Return: 55.8 , Min/Max Return: 42.3 /80.5
Num steps: 800000, Return: 59.2 , Min/Max Return: 34.4 /109.7
Num steps: 820000, Return: 40.1 , Min/Max Return: 34.6 /54.1
Num steps: 840000, Return: 41.6 , Min/Max Return: 28.8 /53.7
Num steps: 860000, Return: 40.9 , Min/Max Return: 31.1 /56.4
Num steps: 880000, Return: 45.9 , Min/Max Return: 38.9 /50.8
Num steps: 900000, Return: 50.6 , Min/Max Return: 33.9 /108.5
Num steps: 920000, Return: 56.2 , Min/Max Return: 35.7 /123.4
Num steps: 940000, Return: 31.3 , Min/Max Return: 27.7 /35.2
Num steps: 960000, Return: 50.9 , Min/Max Return: 35.4 /63.1
Num steps: 980000, Return: 38.2 , Min/Max Return: 31.3 /58.3
Num steps: 1000000, Return: 45.5 , Min/Max Return: 35.2 /76.3
100%|████████████████████████████████| 1000000/1000000 [42:34<00:00, 391.46it/s]
> Done in 2555s
For both the PPO and LYGE policies, the outcomes are substantially lower than those reported in the paper (refer to Figure 5 below).
For the first step of training a PPO initial policy, I have also experimented with other environments, such as InvertedPendulum, F16GCAS, and NeuralLander, each for 1 million steps. However, I observed that the rewards across all these environments were notably below the figures presented in Figure 5 of the paper (InvertedPendulum: approximately 100-200, F16GCAS: approximately 100, NeuralLander: approximately 2000).
I would like to inquire about any potential missteps I may have taken or if there might be some misinterpretation of the paper and code on my part. Thanks very much for your attention and time!
The text was updated successfully, but these errors were encountered:
Hi, thanks very much for sharing this inspiring work!
I have attempted to execute the code to replicate the results using the following scripts:
However, the rewards I achieved were significantly lower than those reported in the paper. For instance, here are the training logs from my CartPole environment:
For both the PPO and LYGE policies, the outcomes are substantially lower than those reported in the paper (refer to Figure 5 below).
For the first step of training a PPO initial policy, I have also experimented with other environments, such as InvertedPendulum, F16GCAS, and NeuralLander, each for 1 million steps. However, I observed that the rewards across all these environments were notably below the figures presented in Figure 5 of the paper (InvertedPendulum: approximately 100-200, F16GCAS: approximately 100, NeuralLander: approximately 2000).
I would like to inquire about any potential missteps I may have taken or if there might be some misinterpretation of the paper and code on my part. Thanks very much for your attention and time!
The text was updated successfully, but these errors were encountered: