This github repo aims to document my progress learning proximal policy optimization. Thus far, I have attempted to solve PPO on the cartpole environment, using this page as a source of inspiration and guidance: https://medium.com/deeplearningmadeeasy/simple-ppo-implementation-e398ca0f2e7c
Screen.Recording.2024-06-23.at.7.59.52.PM.mov
ppo_better.mov
This project is still a work in progress for me, and I will be gradually updating it with my results. In my initial attempts at solving the problem, I'm noticing a couple details:
- With a smaller network (1st attempt), the algorithm loves to tilt the pole slightly in one direction and move continuously in that direction. This results in a mediocre but good enough reward for the model, and it struggles to overcome this plateau
- With a larger network (2nd attempt) and the LeakyReLU activation function, the model actually learns to balance, yet the cart tends to still move gradually in one direction. I think modifying the reward function to incentivize the cart to stay near the center might be important - I'll try this next!
- I think the termination step of each episode is actually very important for the model to learn. Getting a singular reward with no predicted further rewards (because the episode has ended) influences the model to make changes and prevent the episode from ending
I'm planning on continuing this problem on this environment until the model is consistently able to score an average of 195 over 100 consecutive episodes, then move on to bigger and better things!
This time, I found some resources online to help me implement PPO memory, and reimplemented the code using advantage values gathered from sequential states in this memory. This version finally passes the benchmark that I wanted to achieve, getting a score of 195 over 100 consecutive episodes relatively consistently.
I'm happy with my progress here, and am ready to move on to other projects. Here's the output from one of the better runs:
![Screenshot 2024-11-14 at 11 24 56 PM](https://private-user-images.githubusercontent.com/52140136/386457990-e353f917-836e-4a8b-a869-65da9f8269a7.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3Mzk0MTE3MDYsIm5iZiI6MTczOTQxMTQwNiwicGF0aCI6Ii81MjE0MDEzNi8zODY0NTc5OTAtZTM1M2Y5MTctODM2ZS00YThiLWE4NjktNjVkYTlmODI2OWE3LnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNTAyMTMlMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjUwMjEzVDAxNTAwNlomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPWVmZjNhYjE0MDczM2NhMTA5MjNmMDYxZjc1ZWU2NzEyYzEwYTEyYWY0YmI4MWYxYTM5ZmM5NDM0ZDhhODc3MTYmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0In0.nLZSTCiUFpPH3Ft5ar24QzzQYKToQw_5VYlh2q721ho)
I made some further small optimizations, like using an annealed learning rate and changing the Adams optimizer epsilon value, which greatly stabilizes my algorithm. With this, I was able to get a score of 36000 on one of the runs:
![best](https://private-user-images.githubusercontent.com/52140136/386737352-b2418d77-8b27-4069-9971-7db4f28bfcec.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3Mzk0MTE3MDYsIm5iZiI6MTczOTQxMTQwNiwicGF0aCI6Ii81MjE0MDEzNi8zODY3MzczNTItYjI0MThkNzctOGIyNy00MDY5LTk5NzEtN2RiNGYyOGJmY2VjLnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNTAyMTMlMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjUwMjEzVDAxNTAwNlomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPTJlMjgzZGQzMGU4OGNhOGFkMmZiOGQxODVhYWVkNGUxZDVhODczMTdmOTZjNDVlYWNlOGNhMWIwYmMyMjgwOGEmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0In0.0ZXzECaPJ6AQ2P7WejS5NioVGWXDQtehRR4zdY8Ev58)