diff --git a/README.md b/README.md new file mode 100644 index 0000000..9db4090 --- /dev/null +++ b/README.md @@ -0,0 +1,20 @@ +# 2nd collaboration workshop on Reinforcement Learning for Autonomous Accelerators (RL4AA'24) + +This repository contains the material for the second day of the [RL4AA'24](https://indico.scc.kit.edu/event/3746/timetable/#all.detailed) event. + +Homepage for RL4AA Collaboration: [https://rl4aa.github.io/](https://rl4aa.github.io/) + +## Theory slides for the tutorial + +- [Advanced concepts in RL](https://github.com/RL4AA/RL4AA23/blob/main/slides/Hirlaender_advanced_concepts.pdf), Simon Hirländer + +## Python tutorial: meta reinforcement learning implementation example + +- Github repository containing the material: [https://github.com/RL4AA/RL4AA23](https://github.com/RL4AA/rl4aa24-tutorial/) +- Tutorial in slide form: [here](https://github.com/RL4AA/rl4aa24-tutorial/tutorial.html#/) + +## Getting started + +- First, download the material to your local disk by cloning the repository: +`git clone https://github.com/RL4AA/rl4aa24-tutorial.git` +- If you don't have git installed, you can click on the green button that says "Code", and choose to download it as a `.zip` file. \ No newline at end of file diff --git a/tutorial.html b/tutorial.html new file mode 100644 index 0000000..f1cb7d2 --- /dev/null +++ b/tutorial.html @@ -0,0 +1,15270 @@ + + +
+ + +Simon Hirländer, Jan Kaiser, Chenran Xu, Andrea Santamaria Garcia
+git clone https://github.com/RL4AA/rl4aa24-tutorial.git
+
conda env create -f environment.yml
+
This should create an environment named rl-tutorial
and install the necessary packages inside.
Afterwards, activate the environment using
+conda activate rl-tutorial
+
If you don't have conda installed:
+Alternatively, you can create the virtual env with
+python venv -n rl-tutlrial
+
and activate the env with $ source <venv>/bin/activate
(bash) or C:> <venv>/Scripts/activate.bat
(Windows)
Then, install the packages with pip
within the activated environment
python -m pip3 install -r requirements.txt
+
Afterwards, you should be able to run the provided scripts.
+ +AWAKE is an accelerator R&D project based at CERN. It investigates the use of plasma wakefields driven by a proton bunch to accelerate charged particles.
++ +
The goal is to minimize the distance $\Delta x_i$ of an initial beam trajectory to a target trajectory at different points $i$ (here marked as "position") in the accelerator in the least amount of steps.
+The problem is formulated in an episodic manner.
+Optics might be different from what we expect in real life.
+ +The environment dynamics are determined by the response matrix, which in linear systems can encapsulate the dynamics of the problem.
+More specifically: given the response matrix is $\mathbf R$, the change in actions $\Delta a$ (corrector magnet strength), and the change in states $\Delta s$ (BPM readings), we have:
+\begin{align} + \Delta s &= \mathbf{R}\Delta a\\ +\end{align}During this tutorial we want to compare the trained policies we obtain with different methods to a benchmark policy.
+For this problem, our "benchmark policy" is just the inverse of the environment's response matrix.
+More specifically, we have: +\begin{align} + \Delta a &= \mathbf{R}^{-1}\Delta s +\end{align}
+$\implies$ The resolution of the problem can be theoretically achieved by applying the inverse response matrix, $\mathbf{R}^{-1}$, directly to the system.
+ +Side note:
+ppo.py
: runs the training and evaluation stages sequentially.configs/maml/verification_tasks.pkl
: contains 5 tasks (environments/optics) upon which the policies will be evaluated.n_env
= 1n_steps
= 2048 (default params)buffer_size
= n_steps x n_env = 2048n_epochs
= 10 (default params)int((total_timesteps / buffer_size)) * n_epochs
Go to ppo.py
and change the total_timesteps
to 100. This can be done by providing the command line argument --steps [num_steps]
Run it in the terminal with python ppo.py --steps 100
$\implies$ Considering the PPO agent settings: will we fill the buffer? what do you expect that happens?
+$\implies$ What is the difference in episode length between the benchmark policy and PPO?
+$\implies$ Look at the cumulative episode length, which policy takes longer?
+$\implies$ Compare both cumulative rewards, which reward is higher and why?
+$\implies$ Look at the final reward (-10*RMS(BPM readings)) and consider the convergence (in red) and termination conditions mentioned before. What can you say about how the episode was ended?
+Set total_timesteps
to 50,000 this time. Run it in the terminal with python ppo.py --steps 50000
$\implies$ What are the main differences between the untrained and trained PPO policies?
+Meta-learning occurs when one learning system progressively adjusts the operation of a second learning system, such that the latter operates with increasing speed and efficiency
+This scenario is often described in terms of two ‘loops’ of learning, an ‘outer loop’ that uses its experiences over many task contexts to gradually adjust parameters that govern the operation of an ‘inner loop’, so that the inner loop can adjust rapidly to new tasks (see Figure 1). Meta-RL refers to the case where both the inner and outer loop implement RL algorithms, learning from reward outcomes and optimizing toward behaviors that yield maximal reward.
+There are MANY flavors of meta RL
+ + +In this tutorial we will adapt the parameters of our model (policy) through gradient descent with the MAML algorithm.
+meta-batch-size
in the code) from a task distribution, each one with its particular initial task policy $\varphi_{0}^i=\phi_0$.
+$\beta$ is the meta learning rate, $\alpha$ is the fast learning rate (for inner loop gradient updates)
+train.py
: performs the meta-training on AWAKE problemtest.py
: performs the evaluation of the trained policyconfigs/
: stores the yaml files for training configurationsRun the following code to train the task policy $\varphi_0^0$ for 500 steps:
+python test.py --experiment-name tutorial --experiment-type adapt_from_scratch --num-batches=500 --plot-interval=50 --task-ids 0
Once it has run, you can look at the adaptation progress by running:
+python read_out_train.py --experiment-name tutorial --experiment-type adapt_from_scratch
You can run now several tasks.
+ +You can run the meta-training via (but don't run it now!):
+python train.py --experiment-name <give_a_meaningful_name>
+
Note: The meta-training takes about 30 mins for the current configuration. +Therefore we have provided a pre-trained policy which can be used for evaluation later.
+ +We will now use a pre-trained policy located in awake/pretrained_policy.th
and evalulate it against a certain number of fixed tasks.
python test.py --experiment-name tutorial --experiment-type test_meta --use-meta-policy --policy awake/pretrained_policy.th --num-batches=500 --plot-interval=50 --task-ids 0 1 2 3 4
--task-ids 0 1 2 3 4
to run evaluation against all 5 tasks, or e.g. --task-ids 0
to evaluate only for task 0.--use-meta-policy
so that it uses the pre-trained policy.Afterwards, you can look at the adaptation progress by running:
+python read_out_train.py --experiment-name tutorial --experiment-type test_meta
$\implies$ What difference can you see compared to the untrained policy?
+
This part is important if you want to have a deeper understanding of the MAML algorithm.
+maml_rl/metalearners/maml_trpo.py
implements the TRPO algorithm for the outer-loop.maml_rl/policies/normal_mlp.py
implements a simple MLP policy for the RL agent.maml_rl/utils/reinforcement_learning.py
implements the Reinforce algorithm for the inner-loop.maml_rl/samplers/
handles the sampling of the meta-trajectories of the environment using the multiprocessing package.maml_rl/baseline.py
A linear baseline for the advantage calculation in RL.maml_rl/episodes.py
A custom class to store the results and statistics of the episodes for meta-training.