Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to run the code and replicate the results of an initial policy like PPO? #7

Open
zhaozj89 opened this issue Aug 8, 2024 · 0 comments

Comments

@zhaozj89
Copy link

zhaozj89 commented Aug 8, 2024

Hi, thanks very much for sharing this inspiring work!

I have attempted to execute the code to replicate the results using the following scripts:

python train_rl_il.py --env CartPole --algo ppo --steps 1000000

python collect_demo.py --env CartPole --agent-path ./logs/CartPole/ppo/seed0_20240808225352

python train_clf.py --env CartPole --buffer ./demos/CartPole/size20000_std0.1_bias0.1_reward11.pkl

However, the rewards I achieved were significantly lower than those reported in the paper. For instance, here are the training logs from my CartPole environment:

python train_rl_il.py --env CartPole --algo ppo --steps 1000000

> Training with cuda
> No pre-training
> Training PPO...
Num steps: 20000 , Return: 27.9 , Min/Max Return: 20.4 /31.7                    
Num steps: 40000 , Return: 38.3 , Min/Max Return: 27.3 /52.4                    
Num steps: 60000 , Return: 30.4 , Min/Max Return: 25.3 /36.7                    
Num steps: 80000 , Return: 37.1 , Min/Max Return: 29.6 /45.1                    
Num steps: 100000, Return: 19.2 , Min/Max Return: 18.2 /20.8                    
Num steps: 120000, Return: 21.7 , Min/Max Return: 19.2 /26.5                    
Num steps: 140000, Return: 29.6 , Min/Max Return: 24.2 /39.9                    
Num steps: 160000, Return: 38.9 , Min/Max Return: 31.3 /48.5                    
Num steps: 180000, Return: 30.1 , Min/Max Return: 26.5 /34.3                    
Num steps: 200000, Return: 51.7 , Min/Max Return: 32.7 /94.9                    
Num steps: 220000, Return: 43.5 , Min/Max Return: 37.8 /54.0                    
Num steps: 240000, Return: 31.6 , Min/Max Return: 28.3 /33.9                    
Num steps: 260000, Return: 36.2 , Min/Max Return: 29.2 /42.1                    
Num steps: 280000, Return: 30.3 , Min/Max Return: 27.9 /35.6                    
Num steps: 300000, Return: 31.1 , Min/Max Return: 27.5 /36.1                    
Num steps: 320000, Return: 19.7 , Min/Max Return: 18.6 /20.9                    
Num steps: 340000, Return: 46.6 , Min/Max Return: 34.7 /64.5                    
Num steps: 360000, Return: 40.5 , Min/Max Return: 30.2 /53.2                    
Num steps: 380000, Return: 28.8 , Min/Max Return: 24.9 /32.1                    
Num steps: 400000, Return: 48.6 , Min/Max Return: 30.7 /76.2                    
Num steps: 420000, Return: 55.4 , Min/Max Return: 43.2 /78.2                    
Num steps: 440000, Return: 31.7 , Min/Max Return: 29.0 /35.4                    
Num steps: 460000, Return: 39.0 , Min/Max Return: 30.2 /43.9                    
Num steps: 480000, Return: 34.8 , Min/Max Return: 29.2 /41.2                    
Num steps: 500000, Return: 39.7 , Min/Max Return: 34.2 /44.9                    
Num steps: 520000, Return: 50.4 , Min/Max Return: 40.1 /64.2                    
Num steps: 540000, Return: 34.2 , Min/Max Return: 27.9 /43.6                    
Num steps: 560000, Return: 35.3 , Min/Max Return: 29.2 /43.6                    
Num steps: 580000, Return: 43.1 , Min/Max Return: 36.1 /46.3                    
Num steps: 600000, Return: 41.8 , Min/Max Return: 34.1 /49.6                    
Num steps: 620000, Return: 46.7 , Min/Max Return: 34.6 /58.7 
Num steps: 640000, Return: 38.3 , Min/Max Return: 30.2 /41.1                    
Num steps: 660000, Return: 46.7 , Min/Max Return: 39.5 /56.8                    
Num steps: 680000, Return: 41.1 , Min/Max Return: 32.2 /54.3                    
Num steps: 700000, Return: 40.6 , Min/Max Return: 31.7 /53.0                    
Num steps: 720000, Return: 41.5 , Min/Max Return: 30.8 /56.2                    
Num steps: 740000, Return: 35.5 , Min/Max Return: 30.2 /44.1                    
Num steps: 760000, Return: 49.5 , Min/Max Return: 38.3 /69.2                    
Num steps: 780000, Return: 55.8 , Min/Max Return: 42.3 /80.5                    
Num steps: 800000, Return: 59.2 , Min/Max Return: 34.4 /109.7                   
Num steps: 820000, Return: 40.1 , Min/Max Return: 34.6 /54.1                    
Num steps: 840000, Return: 41.6 , Min/Max Return: 28.8 /53.7                    
Num steps: 860000, Return: 40.9 , Min/Max Return: 31.1 /56.4                    
Num steps: 880000, Return: 45.9 , Min/Max Return: 38.9 /50.8                    
Num steps: 900000, Return: 50.6 , Min/Max Return: 33.9 /108.5                   
Num steps: 920000, Return: 56.2 , Min/Max Return: 35.7 /123.4                   
Num steps: 940000, Return: 31.3 , Min/Max Return: 27.7 /35.2                    
Num steps: 960000, Return: 50.9 , Min/Max Return: 35.4 /63.1                    
Num steps: 980000, Return: 38.2 , Min/Max Return: 31.3 /58.3                    
Num steps: 1000000, Return: 45.5 , Min/Max Return: 35.2 /76.3                   
100%|████████████████████████████████| 1000000/1000000 [42:34<00:00, 391.46it/s]
> Done in 2555s
python collect_demo.py --env CartPole --agent-path ./logs/CartPole/ppo/seed0_20240808225352

epi: 0, reward: 11, steps: 20                                                                                                                                                                                      
epi: 1, reward: 21, steps: 73                                                                                                                                                                                      
epi: 2, reward: 8, steps: 33                                                                                                                                                                                       
epi: 3, reward: 10, steps: 50                                                                                                                                                                                      
epi: 4, reward: 17, steps: 27                                                                                                                                                                                      
epi: 5, reward: 6, steps: 15                                                                                                                                                                                       
epi: 6, reward: 22, steps: 52                                                                                                                                                                                      
epi: 7, reward: 10, steps: 24                                                                                                                                                                                      
epi: 8, reward: 11, steps: 18                                                                                                                                                                                      
epi: 9, reward: 3, steps: 9                                                                                                                                                                                        
epi: 10, reward: 17, steps: 35                                                                                                                                                                                     
epi: 11, reward: 15, steps: 32                                                                                                                                                                                     
epi: 12, reward: 10, steps: 18                                                                                                                                                                                     
epi: 13, reward: 7, steps: 17                                                                                                                                                                                      
epi: 14, reward: 4, steps: 11                                                                                                                                                                                      
epi: 15, reward: 14, steps: 30                                                                                                                                                                                     
epi: 16, reward: 11, steps: 21                                                                                                                                                                                     
epi: 17, reward: 8, steps: 17                                                                                                                                                                                      
epi: 18, reward: 12, steps: 24                                                                                                                                                                                     
epi: 19, reward: 10, steps: 56                                                                                                                                                                                     
epi: 20, reward: 13, steps: 28                                                                                                                                                                                     
epi: 21, reward: 6, steps: 12                                                                                                                                                                                      
epi: 22, reward: 14, steps: 26                                                                                                                                                                                     
epi: 23, reward: 11, steps: 17                                                                                                                                                                                     
epi: 24, reward: 12, steps: 23                                                                                                                                                                                     
epi: 25, reward: 18, steps: 39                                                                                                                                                                                     
epi: 26, reward: 11, steps: 25                                                                                                                                                                                     
epi: 27, reward: 8, steps: 52                                                                                                                                                                                      
epi: 28, reward: 17, steps: 32             
...
epi: 666, reward: 19, steps: 73                                                                                                                                                                                    
epi: 667, reward: 9, steps: 19                                                                                                                                                                                     
epi: 668, reward: 6, steps: 12                                                                                                                                                                                     
epi: 669, reward: 4, steps: 9                                                                                                                                                                                      
epi: 670, reward: 11, steps: 65                                                                                                                                                                                    
epi: 671, reward: 9, steps: 19                                                                                                                                                                                     
epi: 672, reward: 11, steps: 29                                                                                                                                                                                    
epi: 673, reward: 7, steps: 15                                                                                                                                                                                     
epi: 674, reward: 13, steps: 25                                                                                                                                                                                    
epi: 675, reward: 6, steps: 13                                                                                                                                                                                     
epi: 676, reward: 12, steps: 37                                                                                                                                                                                    
epi: 677, reward: 14, steps: 24                                                                                                                                                                                    
epi: 678, reward: 9, steps: 30                                                                                                                                                                                     
epi: 679, reward: 17, steps: 35                                                                                                                                                                                    
epi: 680, reward: 11, steps: 22                                                                                                                                                                                    
epi: 681, reward: 15, steps: 46                                                                                                                                                                                    
epi: 682, reward: 6, steps: 34                                                                                                                                                                                     
epi: 683, reward: 24, steps: 45                                                                                                                                                                                    
epi: 684, reward: 18, steps: 52                                                                                                                                                                                    
epi: 685, reward: 14, steps: 27                                                                                                                                                                                    
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 20000/20000 [01:24<00:00, 236.97it/s]
> buffer saved, mean reward: 11.35, std: 6.36
> video saved
python train_clf.py --env CartPole --buffer ./demos/CartPole/size20000_std0.1_bias0.1_reward11.pkl

> Training with cuda
> Using tuned hyper-parameters
> Training BC policy...
iter: 0, loss: 5.90e-01                                                         
iter: 500, loss: 4.12e-01                                                       
iter: 1000, loss: 3.98e-01                                                      
iter: 1500, loss: 3.48e-01                                                      
iter: 2000, loss: 3.62e-01                                                      
iter: 2500, loss: 4.65e-01                                                      
iter: 3000, loss: 3.31e-01                                                      
iter: 3500, loss: 3.60e-01                                                      
iter: 4000, loss: 3.94e-01                                                      
iter: 4500, loss: 3.55e-01                                                      
iter: 4999, loss: 3.23e-01                                                      
100%|██████████████████████████████████████| 5000/5000 [00:34<00:00, 146.67it/s]
> Done
> Evaluating BC policy
average reward: 20.88, average length: 28.40
> Done
> ----------- Outer iter: 1 -----------
> Training environment model...
iter: 0, loss: 1.11e+01                                                         
iter: 500, loss: 3.52e-01                                                       
iter: 1000, loss: 1.24e+00                                                      
iter: 1500, loss: 4.83e-01                        
...
> ----------- Outer iter: 2 -----------
> Training environment model...
iter: 0, loss: 2.45e-01                                                         
iter: 500, loss: 9.09e-02                                                       
iter: 1000, loss: 1.77e-01                                                      
iter: 1500, loss: 9.78e-02                                                      
iter: 2000, loss: 2.33e-01                                                      
iter: 2500, loss: 2.25e-01                                                      
iter: 3000, loss: 8.33e-02                                                      
iter: 3500, loss: 1.53e-01                                                      
iter: 4000, loss: 1.63e-01                                                      
iter: 4500, loss: 9.96e-02                                                      
iter: 4999, loss: 1.57e-01                                                      
100%|██████████████████████████████████████| 5000/5000 [00:28<00:00, 174.13it/s]
> Done
> Training CLF controller...
iter: 200, loss: 3.63e+01                                                       
iter: 400, loss: 2.67e+01                                                       
iter: 600, loss: 2.71e+01                                                       
iter: 800, loss: 2.68e+01                                                       
iter: 1000, loss: 2.09e+01                                                      
iter: 1200, loss: 1.99e+01                                                      
iter: 1400, loss: 1.89e+01                                                      
iter: 1600, loss: 1.81e+01                                                      
iter: 1800, loss: 2.69e+01                                                      
iter: 1999, loss: 2.06e+01                                                      
100%|███████████████████████████████████████| 1999/1999 [09:43<00:00,  3.42it/s]
> Evaluating policy...
average reward: 26.57, average length: 38.80
> Done
> Collecting demos...
mean reward: 17.36
> Done
> Final demo saved
> Iter time: 637s, total time: 1288s
...
> ----------- Outer iter: 48 -----------
> Training environment model...
iter: 0, loss: 5.08e-03                                                         
iter: 500, loss: 2.13e-03                                                       
iter: 1000, loss: 3.25e-01                                                      
iter: 1500, loss: 7.04e-03                                                      
iter: 2000, loss: 1.06e-02                                                      
iter: 2500, loss: 1.67e-03                                                      
iter: 3000, loss: 5.57e-02                                                      
iter: 3500, loss: 2.24e-03                                                      
iter: 4000, loss: 2.24e-03                                                      
iter: 4500, loss: 1.56e-01                                                      
iter: 4999, loss: 9.13e-03                                                      
100%|██████████████████████████████████████| 5000/5000 [00:23<00:00, 211.62it/s]
> Done
> Training CLF controller...
iter: 200, loss: 5.55e+00                                                       
iter: 400, loss: 5.22e+00                                                       
iter: 600, loss: 6.63e+00                                                       
iter: 800, loss: 5.17e+00                                                       
iter: 1000, loss: 5.09e+00                                                      
iter: 1200, loss: 4.63e+00                                                      
iter: 1400, loss: 8.60e+00                                                      
iter: 1600, loss: 5.35e+00                                                      
iter: 1800, loss: 4.95e+00                                                      
iter: 1999, loss: 5.37e+00                                                      
100%|███████████████████████████████████████| 1999/1999 [01:47<00:00, 18.52it/s]
> Evaluating policy...
average reward: 69.63, average length: 97.40
> Done
> Collecting demos...
mean reward: 34.59
> Done
> Final demo saved
> Iter time: 146s, total time: 10898s
> ----------- Outer iter: 49 -----------
> Training environment model...
iter: 0, loss: 7.01e-02                                                         
iter: 500, loss: 9.32e-02                                                       
iter: 1000, loss: 5.55e-02                                                      
iter: 1500, loss: 7.36e-03                                                      
iter: 2000, loss: 1.15e-02                                                      
iter: 2500, loss: 1.96e-03                                                      
iter: 3000, loss: 1.28e-02                                                      
iter: 3500, loss: 3.23e-03                                                      
iter: 4000, loss: 2.97e-03                                                      
iter: 4500, loss: 1.22e-03                                                      
iter: 4999, loss: 3.33e-03                                                      
100%|██████████████████████████████████████| 5000/5000 [00:24<00:00, 207.89it/s]
> Done
> Training CLF controller...
iter: 200, loss: 5.38e+00                                                       
iter: 400, loss: 6.41e+00                                                       
iter: 600, loss: 5.36e+00                                                       
iter: 800, loss: 5.71e+00                                                       
iter: 1000, loss: 5.38e+00                                                      
iter: 1200, loss: 4.79e+00                                                      
iter: 1400, loss: 5.99e+00                                                      
iter: 1600, loss: 5.13e+00                                                      
iter: 1800, loss: 4.94e+00                                                      
iter: 1999, loss: 5.89e+00                                                      
100%|███████████████████████████████████████| 1999/1999 [01:47<00:00, 18.66it/s]
> Evaluating policy...
average reward: 74.99, average length: 101.80
> Done
> Collecting demos...
mean reward: 34.93
> Done
> Final demo saved
> Iter time: 145s, total time: 11043s

For both the PPO and LYGE policies, the outcomes are substantially lower than those reported in the paper (refer to Figure 5 below).

微信截图_20240809000140

For the first step of training a PPO initial policy, I have also experimented with other environments, such as InvertedPendulum, F16GCAS, and NeuralLander, each for 1 million steps. However, I observed that the rewards across all these environments were notably below the figures presented in Figure 5 of the paper (InvertedPendulum: approximately 100-200, F16GCAS: approximately 100, NeuralLander: approximately 2000).

I would like to inquire about any potential missteps I may have taken or if there might be some misinterpretation of the paper and code on my part. Thanks very much for your attention and time!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant