Skip to content

yu45020/recovering-preference-from-observations

Repository files navigation

Will You Skip The Next Song? Recovering Consumer Preference From Dynamic Decision-Making

This repository contains the course project and presentation slides for the course IEOR E8100: Reinforcement Learning from Human Feedback and Related Topics, taught by Professor Shipra Agrawal in Fall 2024.

Contribution

The main contribution of this project is a novel method for recovering a transformed reward function that produces the same optimal policy as the observed data. This transformed reward function is also robust to environmental changes, enabling it to generate the optimal policy even when the environment varies.

The method incorporates the policy gradient theorem, reward shaping, and pessimistic learning, offering fast and stable training without the need to solve dynamic programming problems or train GANs.

Method (main ideas)

Assuming the observations (sequences of state and action pairs) are generated from an optimal policy and an unobserved, stable reward function, we derive a $Q$ function that ensures the policy gradient is zero.

By manipulating the $Q$ function, we define a transformed reward function that mathematically guarantees it can replicate the same optimal policy as the true reward function.

Since only partial state and action pairs are observed, we assume the unobserved pairs yield lower rewards. This introduces a higher penalty when the transformed reward function is used to derive the optimal policy.

Benchmark

Same Environment

We compare various methods for solving environments under default settings. The demonstration uses true rewards for validation. Behavioral cloning and the transformed reward method are trained and evaluated on the same dataset. Episode rewards and lengths are averaged over 1,000 trajectories.

The results show that the transformed reward method can replicate the same rewards as the true reward function in identical environments.

Default Setting Demonstration (DQN) Behavioral clone Transformed reward (A2C)
Acrobot-v1
reward -77.41 (11.65) -77.07 (41.51) -77.01 (17.19)
length 78.41 (11.65) 78.07 (41.51) 78.01 (17.19)
MountainCar-v0
reward 101.55 (9.79) 101.51 (9.81) 98.70 (7.16)
length 102.55 (9.79) 102.51 (9.81) 99.70 (7.16)
LunarLander-v3
reward 230.39 (93.59) 234.97 (91.35) 234.18 (65.16)
length 235.24 (188.50) 236.52 (180.36) 240.33 (78.53)

Table: Comparison of different methods for solving environments under default settings. Demonstration uses the true rewards. Behavioral clone and Transformed reward are trained and inferred on the same data. Episode rewards and length are averaged over 1,000 trajectories.

Different Environment

When the environment changes, the transformed reward method still performs effectively, generating the same rewards, while behavioral cloning fails to adapt.

Different Initialization Demonstration (DQN) Behavioral clone Transformed reward (A2C)
Acrobot-v1
reward -78.43 (32.67) -77.07 (41.51) -78.31 (23.23)
length 79.43 (32.67) 78.07 (41.51) 79.31 (23.23)
MountainCar-v0
reward -65.92 (8.4) -100.71 (9.25) -65.27 (12.83)
length 66.92 (8.4) 101.71 (9.25) 66.27 (12.83)
LunarLander-v3
reward 265.59 (70.63) 208.11 (122.88) 256.87 (63.13)
length 296.45 (216.22) 211.99 (159.65) 289.33 (111.52)

Table: Comparison of different methods for solving environments with modified initialization settings. Demonstration uses the true rewards. Behavioral clone and Transformed reward are trained on data collected from the original settings and evaluated on the modified environments. Episode rewards and length are averaged over 1,000 trajectories.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages