Merge pull request #41 from instadeepai/docs/research-page

docs: reseach page
instadeepai · Sep 13, 2024 · ff0eaf3 · ff0eaf3
2 parents 9114be8 + 851f194
commit ff0eaf3
Show file tree

Hide file tree

Showing 24 changed files with 147 additions and 0 deletions.
diff --git a/docs/assets/research/polygames/buffer-animation.gif b/docs/assets/research/polygames/buffer-animation.gif
diff --git a/docs/assets/research/polygames/joint-policy-and-datapoint.png b/docs/assets/research/polygames/joint-policy-and-datapoint.png
diff --git a/docs/assets/research/polygames/joint-update.png b/docs/assets/research/polygames/joint-update.png
diff --git a/docs/assets/research/polygames/maddpg+pjap-animation.gif b/docs/assets/research/polygames/maddpg+pjap-animation.gif
diff --git a/docs/assets/research/polygames/maddpg-animation.gif b/docs/assets/research/polygames/maddpg-animation.gif
diff --git a/docs/assets/research/polygames/mamujoco-results.png b/docs/assets/research/polygames/mamujoco-results.png
diff --git a/docs/assets/research/polygames/notebook.png b/docs/assets/research/polygames/notebook.png
diff --git a/docs/assets/research/polygames/overview.png b/docs/assets/research/polygames/overview.png
diff --git a/docs/assets/research/polygames/sign-agreement-surface.png b/docs/assets/research/polygames/sign-agreement-surface.png
diff --git a/docs/assets/research/polygames/twin-peaks-game.png b/docs/assets/research/polygames/twin-peaks-game.png
diff --git a/docs/assets/research/polygames/x-and-y-updates.png b/docs/assets/research/polygames/x-and-y-updates.png
diff --git a/docs/assets/research/selective-reincarnation/3_reincarnated_agents.png b/docs/assets/research/selective-reincarnation/3_reincarnated_agents.png
diff --git a/docs/assets/research/selective-reincarnation/4_reincarnated_agents.png b/docs/assets/research/selective-reincarnation/4_reincarnated_agents.png
diff --git a/docs/assets/research/selective-reincarnation/5_reincarnated_agents.png b/docs/assets/research/selective-reincarnation/5_reincarnated_agents.png
diff --git a/...assets/research/selective-reincarnation/arbitrarily_selective_reincarnation.png b/...assets/research/selective-reincarnation/arbitrarily_selective_reincarnation.png
diff --git a/docs/assets/research/selective-reincarnation/dataset_quality.png b/docs/assets/research/selective-reincarnation/dataset_quality.png
diff --git a/docs/assets/research/selective-reincarnation/halfcheetah.png b/docs/assets/research/selective-reincarnation/halfcheetah.png
diff --git a/docs/assets/research/selective-reincarnation/instadeep_square_logo.png b/docs/assets/research/selective-reincarnation/instadeep_square_logo.png
diff --git a/...sets/research/selective-reincarnation/university_of_cape_town_and_instadeep.png b/...sets/research/selective-reincarnation/university_of_cape_town_and_instadeep.png
diff --git a/docs/js/katex.js b/docs/js/katex.js
@@ -0,0 +1,10 @@
+document$.subscribe(({ body }) => {
+    renderMathInElement(body, {
+      delimiters: [
+        { left: "$$",  right: "$$",  display: true },
+        { left: "$",   right: "$",   display: false },
+        { left: "\\(", right: "\\)", display: false },
+        { left: "\\[", right: "\\]", display: true }
+      ],
+    })
+  })
diff --git a/docs/research/index.md b/docs/research/index.md
@@ -0,0 +1 @@
+These pages detail the cutting-edge offline MARL research directions which utilise OG-MARL. We strive to update this list regularly. Please open a pull-request on the GitHub repo if you would like to be featured!
diff --git a/docs/research/polygames.md b/docs/research/polygames.md
@@ -0,0 +1,73 @@
+# Coordination Failure in Cooperative Offline MARL
+
+*[Paper](https://arxiv.org/abs/2407.01343) | [Notebook](https://tinyurl.com/pjap-polygames) | [Announcement](https://x.com/callumtilbury/status/1816489404766224479)*
+
+What happens when trying to learn multi-agent coordination from a static dataset? Catastrophe, if you’re not careful! This is the topic of our work on ✨Coordination Failure in Offline Multi-Agent Reinforcement Learning ✨
+
+<p align="center"><img src="../assets/research/polygames/overview.png" alt="" width="100%"/></p>
+
+Many offline MARL methods build on an MADDPG-style update, which we call the “Best Response Under Dataset” (BRUD). Essentially, agents optimise their action in best response to the other agents’ actions, as sampled from the dataset 🤼
+
+<p align="center">
+    <b>But this can lead to catastrophic miscoordination! 🥊</b>
+</p>
+
+To illustrate this phenomenon, we use polynomial games for tractable insights. For example, consider a simple game, $R = xy$, dubbed the "sign-agreement" game. Agents X and Y aim to choose actions of the same sign ($++$ or $--$) to yield good rewards. 📈
+<p align="center"><img src="../assets/research/polygames/sign-agreement-surface.png" alt="" width="40%"/></p>
+
+Suppose in this game that Agent X currently takes a NEGATIVE action, and Agent Y currently takes a POSITIVE action—illustrated by the _Current Policy_ on the left. Now suppose we sample a point from the static dataset, where X took a POSITIVE action and Y took a negative action, illustrated on the right.
+
+<p align="center"><img src="../assets/research/polygames/joint-policy-and-datapoint.png" alt="" width="80%"/></p>
+
+With a BRUD-style update, the agent policies will update according to the illustration below. Agent X looks at the datapoint, where Y took a negative action, and makes its action more negative in best response. The opposite happens for Agent Y when looking at the datapoint from X, making its action more positive.
+<p align="center"><img src="../assets/research/polygames/x-and-y-updates.png" alt="" width="80%"/></p>
+
+The result is catastrophic! Agents move towards a low-reward region, in the opposite direction of the true optimal update. Our work goes further to ground this result mathematically, and demonstrates how and why other instances of miscoordination arise in a variety of polynomial games. 🤓
+<p align="center"><img src="../assets/research/polygames/joint-update.png" alt="" width="40%"/></p>
+
+How do we solve this problem? Our key insight is that miscoordination arises because of  the dissimilarity between the current joint policy output, and the sampled joint action.
+
+<p align="center">
+    <b>⚠️ Not all data is equally important at all times ⚠️</b>
+</p>
+
+Instead: we want to prioritise sampling experience from a dataset-generating policy similar to the current joint policy. We do this by setting the priorities to be inversely proportional to some function of the distance between the policies.
+
+We call this *Proximal Joint-Action Prioritisation (PJAP)* 🤠
+
+Returning to the sign-agreement game from before, here we see how vanilla MADDPG using a static dataset fails to learn the optimal policy 😭 But the experience is just sampled uniformly from the dataset!
+<p align="center"><img src="../assets/research/polygames/maddpg-animation.gif" alt="" width="100%"/></p>
+
+If we instead prioritise sampling actions that are close to our current joint policy, using PJAP, then MADDPG can find the optimal reward region! 🎉
+<p align="center"><img src="../assets/research/polygames/maddpg+pjap-animation.gif" alt="" width="100%"/></p>
+
+Here’s a visualisation of the priorities in the underlying buffer. Prioritised experience replay is already a popular tool in RL, so PJAP can easily be integrated with existing code. 😌
+
+<p align="center"><img src="../assets/research/polygames/buffer-animation.gif" alt="" width="50%"/></p>
+
+In a more complex polynomial game, clear improvement occurs once again. Crucially, we see how the mean distance between the sampled actions and current policy is reduced, which leads to higher returns. 💃
+<p align="center"><img src="../assets/research/polygames/twin-peaks-game.png" alt="" width="100%"/></p>
+
+Excitingly, this result transfers to more complex scenarios! Here we look at 2halfcheetah from MAMuJoCo, and see that PJAP yields lower average distance between the sample actions and the current joint policy, which leads to statistically significant higher returns 🐆🔥
+<p align="center"><img src="../assets/research/polygames/mamujoco-results.png" alt="" width="70%"/></p>
+
+Importantly, our work shows how insights drawn from simplified, tractable games can lead to useful, theoretically grounded insights that transfer to more complex contexts. A core dimension of offering is an interactive notebook, from which almost all of our results can be reproduced, simply in a browser! 💻
+
+<a href="https://tinyurl.com/pjap-polygames" target="_blank">
+    <p align="center"><img src="../assets/research/polygames/notebook.png" alt="" width="100%"/></p>
+</a>
+
+We presented this paper at the [ARLET workshop](https://icml.cc/virtual/2024/workshop/29964) at ICML 2024.
+
+
+## Cite
+
+```
+@inproceedings{tilbury2024coordination,
+    title={Coordination Failure in Cooperative Offline MARL},
+    author={Tilbury, Callum Rhys and Formanek, Juan Claude and Beyers, Louise and Shock, Jonathan Phillip and Pretorius, Arnu},
+    booktitle={ICML 2024 Workshop: Aligning Reinforcement Learning Experimentalists and Theorists},
+    year={2024},
+    url={https://arxiv.org/abs/2407.01343}
+}
+```
diff --git a/docs/research/selective-reincarnation.md b/docs/research/selective-reincarnation.md
@@ -0,0 +1,49 @@
+# Selective Reincarnation in Multi-Agent Reinforcement Learning
+
+[Reincarnation](https://agarwl.github.io/reincarnating_rl/) in reinforcement learning has been proposed as a formalisation of reusing prior computation from past experiments when training an agent in an environment. In this work, we present a brief foray into the paradigm of reincarnation in the multi-agent reinforcement learning (MARL) context. We consider the case where only some agents are reincarnated, whereas the others are trained from scratch — *selective reincarnation*.
+
+## Selectively-Reincarnated Policy-to-Value MARL
+In this work we present a case study in multi-agent *policy-to-value RL* (PVRL), focusing on one of the methods invoked by [Agarwal et al. (2022)](https://arxiv.org/abs/2206.01626), called ‘Rehearsal’ ([Gülçehre et al., 2020](https://openreview.net/forum?id=SygKyeHKDH)).
+
+For the sake of the current question of selective reincarnation, we use the 6-Agent HALFCHEETAH environment from [Multi-Agent MuJoCo](https://github.com/schroederdewitt/multiagent_mujoco), where each of the six degrees-of-freedom is controlled by a separate agent.
+
+<p align="center"><img src="../assets/research/selective-reincarnation/halfcheetah.png" width="450" height="300" alt="6-Agent HALFCHEETA"></p>
+
+We enumerate all combinations of agents for reincarnation, a total of 2^6 = 64 subsets. For each subset, we retrain the system on HALFCHEETAH, where that particular group of agents gains access to their teacher's offline data (i.e. they are reincarnated). For each combination, we train the system for *200k* timesteps, remove the teacher data, and then train for a further *50k* timesteps on student data alone.
+
+### Impact of Teacher Dataset Quality in Reincarnating MARL
+First, we show that fully reincarnating a MARL system can speed up convergence. Additionally, we show that providing access solely to *Good* teacher data initially does not help speed up training and even seems to hamper it. It is only after around *125k* timesteps that we observe a dramatic peak in performance, thereafter significantly outperforming the *tabula rasa* system. In contrast, having additional *Medium* samples enables higher returns from the beginning of training – converging faster than the solely *Good* dataset.
+
+<p align="center"><img src="../assets/research/selective-reincarnation/dataset_quality.png" width="450" height="300" alt="Impact of Teacher Datasets"></p>
+
+### Arbitrarily Selective Reincarnation
+Next we show that a selectively reincarnated setup also yields benefits – e.g. reincarnating with just half of the agents provides an improvement over *tabula rasa*.
+
+<p align="center"><img src="../assets/research/selective-reincarnation/arbitrarily_selective_reincarnation.png" width="450" height="300" alt="Arbitrarily Selective Reincarnation"></p>
+
+### Targeted Selective Reincarnation Matters
+Finally, we present a vital consideration: in a multi-agent system, even in the simpler homogeneous case, agents can sometimes assume dissimilar roles with different degrees of importance to the whole system. In the HALFCHEETAH environment particularly, consider the unique requirements for the ankle, knee, and hip joints, and how these differ across the front and back legs, in order for the cheetah to walk. It is thus important that we compare, for a given integer *x*, the results across various combinations of *x* reincarnated agents. That is, e.g., compare reincarnating the back ankle and back knee (BA, BK) with the back ankle and back hip (BA, BH). We find that the choice of which agents to reincarnate plays a significant role in the
+experiment’s outcome.
+
+#### Best and Worst of Three Reincarnated Agents
+<p align="center"><img src="../assets/research/selective-reincarnation/3_reincarnated_agents.png" width="450" height="300" alt="Targeted Selective Reincarnation 3 Agents"></p>
+
+#### Best and Worst of Four Reincarnated Agents
+<p align="center"><img src="../assets/research/selective-reincarnation/4_reincarnated_agents.png" width="450" height="300" alt="Targeted Selective Reincarnation 4 Agents"></p>
+
+
+#### Best and Worst of Five Reincarnated Agents
+<p align="center"><img src="../assets/research/selective-reincarnation/5_reincarnated_agents.png" width="450" height="300" alt="Targeted Selective Reincarnation 5 Agents"></p>
+
+## Cite
+
+```
+@inproceedings{
+    formanek2023selective,
+    title={Reduce, Reuse, Recycle: Selective Reincarnation in Multi-Agent Reinforcement Learning},
+    author={Juan Claude Formanek and Callum Rhys Tilbury and Jonathan Phillip Shock and Kale-ab Tessera and Arnu Pretorius},
+    booktitle={Workshop on Reincarnating Reinforcement Learning at ICLR 2023},
+    year={2023},
+    url={https://openreview.net/forum?id=_Nz9lt2qQfV}
+}
+```
diff --git a/mkdocs.yml b/mkdocs.yml
@@ -38,6 +38,10 @@ theme:
 
 nav:
   - "Home": 'index.md'
+  - "Research":
+    - "": research/index.md
+    - "Coordination Failure in Cooperative Offline MARL": research/polygames.md
+    - "Selective Reincarnation in Multi-Agent Reinforcement Learning": research/selective-reincarnation.md
   - "Baseline Results":
     - "SMACv1": baselines/smac_v1.md
     - "SMACv2": baselines/smac_v2.md
@@ -66,6 +70,16 @@ markdown_extensions:
   - def_list
   - admonition
   - pymdownx.details
+  - pymdownx.arithmatex:
+      generic: true
+
+extra_javascript:
+  - js/katex.js
+  - https://unpkg.com/katex@0/dist/katex.min.js
+  - https://unpkg.com/katex@0/dist/contrib/auto-render.min.js
+
+extra_css:
+  - https://unpkg.com/katex@0/dist/katex.min.css
 
 # Git repo
 repo_name: instadeepai/og-marl
Original file line number	Diff line number	Diff line change
		@@ -0,0 +1 @@
		These pages detail the cutting-edge offline MARL research directions which utilise OG-MARL. We strive to update this list regularly. Please open a pull-request on the GitHub repo if you would like to be featured!