This repository contains a proof-of-concept study using conditional diffusion models to learn the mapping of partons to hadrons in quantum chromodynamics (QCD). Understanding this hadronization process is one of the biggest open questions in fundamental physics, since the confining interactions of QCD generate approximately 99% of the mass in the visible universe – and is key to the Clay Millenium Prize Yang-Mills mass gap problem.
We implement:
- Generation of a data set of simulated high-energy electron-positron collisions, consisting of paired images (partons, hadrons) where each pixel location represents an angular coordinate of a particle and the pixel intensity represents the energy of a particle.
- Training of conditional diffusion models to learn the forward or inverse hadronization mapping.
Using the Shift-DDPM implementation of conditional diffusion, we have been able to demonstrate a proof-of-concept that the ML model can successfully map different parton images to their corresponding hadron images. This result is highly nontrivial since the hadronization mapping is stochastic: the parton-to-hadron mapping is a one-to-many mapping. We show that, in our simplified setup, we can robustly invert any sampled hadron image into its single corresponding parton image, as shown below.
The data pipeline consists of the following steps:
- Create dataset
- Generate simulated electron-positron collisions. This requires installation of our heppy package (available via docker image), which provides a python interface for PYTHIA physics simulations and reconstruction.
- Perform jet reconstruction and record relevant particle information from each event.
- Load dataset and do ML training and analysis
The pipeline is steered by the script steer_analysis.py
, where you can specify which parts of the pipeline you want to run, along with a config file config.yaml
.
Remember that you will first need to initialize the python virtual environment. An example of the required setup is:
cd /path/to/QCD-Diffusion
source init.sh --install
You may need to adapt this to your own system setup, for example utilizing the heppy docker image.
python analysis/steer_analysis.py --generate --write
python analysis/steer_analysis.py --read /path/to/training_data.h5 --analyze
python analysis/steer_analysis.py --generate --analyze
Click for details
For reference, our reading and exercise list is linked here:
To begin, we need to set up a few things to be able to run our code and keep track of our changes with version control. Don't allow yourself to get stuck – if you are spending more than e.g. 10 minutes on a given step and are not sure what to do, ask one of us – don't hesitate.
We also encourage you to liberally use ChatGPT for software questions, both technical (e.g. "How do I navigate to a certain directory on a linux terminal?", "I got this error after trying to do X: ") and conceptual ("Why do I want to use version control when writing code?", "What is a python virtual environment?").
To start, do the following:
- Create a GitHub account
- We will create an account for you on the
hiccup
cluster, a local computing cluster that we will use this summer.- Open a terminal on your laptop and try to login:
ssh <user>@hic.lbl.gov
- Your home directory (
/home/<user>
) is where you can store your code - The
/rstorage
directory should be used to store data that you generate from your analysis (e.g. ML training datasets)
- Your home directory (
- generate an SSH key and upload it to your GitHub account
- Clone this repository:
git clone <url>
- Open a terminal on your laptop and try to login:
- On your laptop, download VSCode
- Install the
Remote-SSH
extension – this will allow you to easily edit code on hiccup via your laptop's editor - Create a new workspace that ssh to hiccup, and add the folder for this repository to the workspace
- Now, try to open a file and check that you can edit it successfully (with the changes being reflected on hiccup)
- Install the
Now we are ready to set up the specific environment for our analysis.
If you are using the terminal inside of VSCode, you can logon to the hiccupgpu node by install the "Remote-SSH" extension in VSCode and adding a new remote server:
Host hic.lbl.gov
...
Hostname hic.lbl.gov
User <usr>
Port 1142
Alternately, you can log directly onto the hiccup GPU node with:
ssh <user>@hic.lbl.gov -p 1142
Now we need to initialize the environment: load heppy (for Monte Carlo event generation and jet finding), set the python version, and create a virtual environment for python packages. We have set up an initialization script to take care of this. The first time you set up, you can do:
cd ML_Jets_Summer2023
source init.sh --install
On subsequent times, you don't need to pass the install
flag:
cd ML_Jets_Summer2023
source init.sh
Now we are ready to run our scripts.