Will You Skip The Next Song? Recovering Consumer Preference From Dynamic Decision-Making

This repository contains the course project and presentation slides for the course IEOR E8100: Reinforcement Learning from Human Feedback and Related Topics, taught by Professor Shipra Agrawal in Fall 2024.

Contribution

The main contribution of this project is a novel method for recovering a transformed reward function that produces the same optimal policy as the observed data. This transformed reward function is also robust to environmental changes, enabling it to generate the optimal policy even when the environment varies.

The method incorporates the policy gradient theorem, reward shaping, and pessimistic learning, offering fast and stable training without the need to solve dynamic programming problems or train GANs.

Method (main ideas)

Assuming the observations (sequences of state and action pairs) are generated from an optimal policy and an unobserved, stable reward function, we derive a $Q$ function that ensures the policy gradient is zero.

By manipulating the $Q$ function, we define a transformed reward function that mathematically guarantees it can replicate the same optimal policy as the true reward function.

Since only partial state and action pairs are observed, we assume the unobserved pairs yield lower rewards. This introduces a higher penalty when the transformed reward function is used to derive the optimal policy.

Benchmark

Same Environment

We compare various methods for solving environments under default settings. The demonstration uses true rewards for validation. Behavioral cloning and the transformed reward method are trained and evaluated on the same dataset. Episode rewards and lengths are averaged over 1,000 trajectories.

The results show that the transformed reward method can replicate the same rewards as the true reward function in identical environments.

Default Setting	Demonstration (DQN)	Behavioral clone	Transformed reward (A2C)
Acrobot-v1
reward	-77.41 (11.65)	-77.07 (41.51)	-77.01 (17.19)
length	78.41 (11.65)	78.07 (41.51)	78.01 (17.19)
MountainCar-v0
reward	101.55 (9.79)	101.51 (9.81)	98.70 (7.16)
length	102.55 (9.79)	102.51 (9.81)	99.70 (7.16)
LunarLander-v3
reward	230.39 (93.59)	234.97 (91.35)	234.18 (65.16)
length	235.24 (188.50)	236.52 (180.36)	240.33 (78.53)

Table: Comparison of different methods for solving environments under default settings. Demonstration uses the true rewards. Behavioral clone and Transformed reward are trained and inferred on the same data. Episode rewards and length are averaged over 1,000 trajectories.

Different Environment

When the environment changes, the transformed reward method still performs effectively, generating the same rewards, while behavioral cloning fails to adapt.

Different Initialization	Demonstration (DQN)	Behavioral clone	Transformed reward (A2C)
Acrobot-v1
reward	-78.43 (32.67)	-77.07 (41.51)	-78.31 (23.23)
length	79.43 (32.67)	78.07 (41.51)	79.31 (23.23)
MountainCar-v0
reward	-65.92 (8.4)	-100.71 (9.25)	-65.27 (12.83)
length	66.92 (8.4)	101.71 (9.25)	66.27 (12.83)
LunarLander-v3
reward	265.59 (70.63)	208.11 (122.88)	256.87 (63.13)
length	296.45 (216.22)	211.99 (159.65)	289.33 (111.52)

Table: Comparison of different methods for solving environments with modified initialization settings. Demonstration uses the true rewards. Behavioral clone and Transformed reward are trained on data collected from the original settings and evaluated on the modified environments. Episode rewards and length are averaged over 1,000 trajectories.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
scripts/simulation/rewards		scripts/simulation/rewards
.gitignore		.gitignore
ReadMe.md		ReadMe.md
Will You Skip The Next Song - Recovering Consumer Preference.pdf		Will You Skip The Next Song - Recovering Consumer Preference.pdf
presentation.pdf		presentation.pdf
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Will You Skip The Next Song? Recovering Consumer Preference From Dynamic Decision-Making

Contribution

Method (main ideas)

Benchmark

Same Environment

Different Environment

About

Releases

Packages

Languages

yu45020/recovering-preference-from-observations

Folders and files

Latest commit

History

Repository files navigation

Will You Skip The Next Song? Recovering Consumer Preference From Dynamic Decision-Making

Contribution

Method (main ideas)

Benchmark

Same Environment

Different Environment

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages