Skip to content

Solving the cartpole environment with RL, specifically PPO

Notifications You must be signed in to change notification settings

rsrinivasan1/cartpole

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

22 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

RL with PPO on the Cartpole environment

This github repo aims to document my progress learning proximal policy optimization. Thus far, I have attempted to solve PPO on the cartpole environment, using this page as a source of inspiration and guidance: https://medium.com/deeplearningmadeeasy/simple-ppo-implementation-e398ca0f2e7c

Very initial attempt with smaller network:

Screen.Recording.2024-06-23.at.7.59.52.PM.mov

Second attempt with larger network:

ppo_better.mov

This project is still a work in progress for me, and I will be gradually updating it with my results. In my initial attempts at solving the problem, I'm noticing a couple details:

  • With a smaller network (1st attempt), the algorithm loves to tilt the pole slightly in one direction and move continuously in that direction. This results in a mediocre but good enough reward for the model, and it struggles to overcome this plateau
  • With a larger network (2nd attempt) and the LeakyReLU activation function, the model actually learns to balance, yet the cart tends to still move gradually in one direction. I think modifying the reward function to incentivize the cart to stay near the center might be important - I'll try this next!
  • I think the termination step of each episode is actually very important for the model to learn. Getting a singular reward with no predicted further rewards (because the episode has ended) influences the model to make changes and prevent the episode from ending

I'm planning on continuing this problem on this environment until the model is consistently able to score an average of 195 over 100 consecutive episodes, then move on to bigger and better things!

Third and final attempt with even larger network, and batched update using recent memory:

This time, I found some resources online to help me implement PPO memory, and reimplemented the code using advantage values gathered from sequential states in this memory. This version finally passes the benchmark that I wanted to achieve, getting a score of 195 over 100 consecutive episodes relatively consistently.

I'm happy with my progress here, and am ready to move on to other projects. Here's the output from one of the better runs:

Screenshot 2024-11-14 at 11 24 56 PM

New update!!

I made some further small optimizations, like using an annealed learning rate and changing the Adams optimizer epsilon value, which greatly stabilizes my algorithm. With this, I was able to get a score of 36000 on one of the runs:

best

About

Solving the cartpole environment with RL, specifically PPO

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages