RL with PPO on the Cartpole environment

This github repo aims to document my progress learning proximal policy optimization. Thus far, I have attempted to solve PPO on the cartpole environment, using this page as a source of inspiration and guidance: https://medium.com/deeplearningmadeeasy/simple-ppo-implementation-e398ca0f2e7c

Very initial attempt with smaller network:

Screen.Recording.2024-06-23.at.7.59.52.PM.mov

Second attempt with larger network:

ppo_better.mov

This project is still a work in progress for me, and I will be gradually updating it with my results. In my initial attempts at solving the problem, I'm noticing a couple details:

With a smaller network (1st attempt), the algorithm loves to tilt the pole slightly in one direction and move continuously in that direction. This results in a mediocre but good enough reward for the model, and it struggles to overcome this plateau
With a larger network (2nd attempt) and the LeakyReLU activation function, the model actually learns to balance, yet the cart tends to still move gradually in one direction. I think modifying the reward function to incentivize the cart to stay near the center might be important - I'll try this next!
I think the termination step of each episode is actually very important for the model to learn. Getting a singular reward with no predicted further rewards (because the episode has ended) influences the model to make changes and prevent the episode from ending

I'm planning on continuing this problem on this environment until the model is consistently able to score an average of 195 over 100 consecutive episodes, then move on to bigger and better things!

Third and final attempt with even larger network, and batched update using recent memory:

This time, I found some resources online to help me implement PPO memory, and reimplemented the code using advantage values gathered from sequential states in this memory. This version finally passes the benchmark that I wanted to achieve, getting a score of 195 over 100 consecutive episodes relatively consistently.

I'm happy with my progress here, and am ready to move on to other projects. Here's the output from one of the better runs:

New update!!

I made some further small optimizations, like using an annealed learning rate and changing the Adams optimizer epsilon value, which greatly stabilizes my algorithm. With this, I was able to get a score of 36000 on one of the runs:

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
README.md		README.md
cartpole.py		cartpole.py
config.py		config.py
memory.py		memory.py
network.py		network.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

RL with PPO on the Cartpole environment

Very initial attempt with smaller network:

Second attempt with larger network:

Third and final attempt with even larger network, and batched update using recent memory:

New update!!

About

Releases

Packages

Languages

rsrinivasan1/cartpole

Folders and files

Latest commit

History

Repository files navigation

RL with PPO on the Cartpole environment

Very initial attempt with smaller network:

Second attempt with larger network:

Third and final attempt with even larger network, and batched update using recent memory:

New update!!

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages