This repository contains the course project and presentation slides for the course IEOR E8100: Reinforcement Learning from Human Feedback and Related Topics, taught by Professor Shipra Agrawal in Fall 2024.
The main contribution of this project is a novel method for recovering a transformed reward function that produces the same optimal policy as the observed data. This transformed reward function is also robust to environmental changes, enabling it to generate the optimal policy even when the environment varies.
The method incorporates the policy gradient theorem, reward shaping, and pessimistic learning, offering fast and stable training without the need to solve dynamic programming problems or train GANs.
Assuming the observations (sequences of state and action pairs) are generated from an optimal policy and an unobserved, stable reward function, we derive a
By manipulating the
Since only partial state and action pairs are observed, we assume the unobserved pairs yield lower rewards. This introduces a higher penalty when the transformed reward function is used to derive the optimal policy.
We compare various methods for solving environments under default settings. The demonstration uses true rewards for validation. Behavioral cloning and the transformed reward method are trained and evaluated on the same dataset. Episode rewards and lengths are averaged over 1,000 trajectories.
The results show that the transformed reward method can replicate the same rewards as the true reward function in identical environments.
Default Setting | Demonstration (DQN) | Behavioral clone | Transformed reward (A2C) |
---|---|---|---|
Acrobot-v1 | |||
reward | -77.41 (11.65) | -77.07 (41.51) | -77.01 (17.19) |
length | 78.41 (11.65) | 78.07 (41.51) | 78.01 (17.19) |
MountainCar-v0 | |||
reward | 101.55 (9.79) | 101.51 (9.81) | 98.70 (7.16) |
length | 102.55 (9.79) | 102.51 (9.81) | 99.70 (7.16) |
LunarLander-v3 | |||
reward | 230.39 (93.59) | 234.97 (91.35) | 234.18 (65.16) |
length | 235.24 (188.50) | 236.52 (180.36) | 240.33 (78.53) |
Table: Comparison of different methods for solving environments under default settings. Demonstration uses the true rewards. Behavioral clone and Transformed reward are trained and inferred on the same data. Episode rewards and length are averaged over 1,000 trajectories.
When the environment changes, the transformed reward method still performs effectively, generating the same rewards, while behavioral cloning fails to adapt.
Different Initialization | Demonstration (DQN) | Behavioral clone | Transformed reward (A2C) |
---|---|---|---|
Acrobot-v1 | |||
reward | -78.43 (32.67) | -77.07 (41.51) | -78.31 (23.23) |
length | 79.43 (32.67) | 78.07 (41.51) | 79.31 (23.23) |
MountainCar-v0 | |||
reward | -65.92 (8.4) | -100.71 (9.25) | -65.27 (12.83) |
length | 66.92 (8.4) | 101.71 (9.25) | 66.27 (12.83) |
LunarLander-v3 | |||
reward | 265.59 (70.63) | 208.11 (122.88) | 256.87 (63.13) |
length | 296.45 (216.22) | 211.99 (159.65) | 289.33 (111.52) |
Table: Comparison of different methods for solving environments with modified initialization settings. Demonstration uses the true rewards. Behavioral clone and Transformed reward are trained on data collected from the original settings and evaluated on the modified environments. Episode rewards and length are averaged over 1,000 trajectories.