Conditional diffusion models for QCD confinement

This repository contains a proof-of-concept study using conditional diffusion models to learn the mapping of partons to hadrons in quantum chromodynamics (QCD). Understanding this hadronization process is one of the biggest open questions in fundamental physics, since the confining interactions of QCD generate approximately 99% of the mass in the visible universe – and is key to the Clay Millenium Prize Yang-Mills mass gap problem.

We implement:

Generation of a data set of simulated high-energy electron-positron collisions, consisting of paired images (partons, hadrons) where each pixel location represents an angular coordinate of a particle and the pixel intensity represents the energy of a particle.
Training of conditional diffusion models to learn the forward or inverse hadronization mapping.

Using the Shift-DDPM implementation of conditional diffusion, we have been able to demonstrate a proof-of-concept that the ML model can successfully map different parton images to their corresponding hadron images. This result is highly nontrivial since the hadronization mapping is stochastic: the parton-to-hadron mapping is a one-to-many mapping. We show that, in our simplified setup, we can robustly invert any sampled hadron image into its single corresponding parton image, as shown below.

Running the analysis pipeline

The data pipeline consists of the following steps:

Create dataset
- Generate simulated electron-positron collisions. This requires installation of our heppy package (available via docker image), which provides a python interface for PYTHIA physics simulations and reconstruction.
- Perform jet reconstruction and record relevant particle information from each event.
Load dataset and do ML training and analysis

The pipeline is steered by the script steer_analysis.py, where you can specify which parts of the pipeline you want to run, along with a config file config.yaml.

Remember that you will first need to initialize the python virtual environment. An example of the required setup is:

cd /path/to/QCD-Diffusion
source init.sh --install

You may need to adapt this to your own system setup, for example utilizing the heppy docker image.

To generate a simulated data set and write it to file:

python analysis/steer_analysis.py --generate --write

To read a simulated data set from file and do ML analysis:

python analysis/steer_analysis.py --read /path/to/training_data.h5 --analyze

To generate a simulated data set and do ML analysis ("on-the-fly"):

python analysis/steer_analysis.py --generate --analyze

Setup instructions for students

Click for details

For reference, our reading and exercise list is linked here:

To begin, we need to set up a few things to be able to run our code and keep track of our changes with version control. Don't allow yourself to get stuck – if you are spending more than e.g. 10 minutes on a given step and are not sure what to do, ask one of us – don't hesitate.

We also encourage you to liberally use ChatGPT for software questions, both technical (e.g. "How do I navigate to a certain directory on a linux terminal?", "I got this error after trying to do X: ") and conceptual ("Why do I want to use version control when writing code?", "What is a python virtual environment?").

To start, do the following:

Create a GitHub account
We will create an account for you on the hiccup cluster, a local computing cluster that we will use this summer.
- Open a terminal on your laptop and try to login: ssh <user>@hic.lbl.gov
  - Your home directory (/home/<user>) is where you can store your code
  - The /rstorage directory should be used to store data that you generate from your analysis (e.g. ML training datasets)
- generate an SSH key and upload it to your GitHub account
- Clone this repository: git clone <url>
On your laptop, download VSCode
- Install the Remote-SSH extension – this will allow you to easily edit code on hiccup via your laptop's editor
- Create a new workspace that ssh to hiccup, and add the folder for this repository to the workspace
- Now, try to open a file and check that you can edit it successfully (with the changes being reflected on hiccup)

Now we are ready to set up the specific environment for our analysis.

Setup software environment for our analysis – on hiccup cluster

If you are using the terminal inside of VSCode, you can logon to the hiccupgpu node by install the "Remote-SSH" extension in VSCode and adding a new remote server:

 Host hic.lbl.gov 
 ...
   Hostname hic.lbl.gov
   User <usr>
   Port 1142

Alternately, you can log directly onto the hiccup GPU node with:

ssh <user>@hic.lbl.gov -p 1142

Now we need to initialize the environment: load heppy (for Monte Carlo event generation and jet finding), set the python version, and create a virtual environment for python packages. We have set up an initialization script to take care of this. The first time you set up, you can do:

cd ML_Jets_Summer2023
source init.sh --install

On subsequent times, you don't need to pass the install flag:

cd ML_Jets_Summer2023
source init.sh

Now we are ready to run our scripts.

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
analysis		analysis
config		config
utils		utils
LICENSE		LICENSE
README.md		README.md
init.sh		init.sh
requirements_cpu.txt		requirements_cpu.txt
requirements_gpu.txt		requirements_gpu.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Conditional diffusion models for QCD confinement

Running the analysis pipeline

To generate a simulated data set and write it to file:

To read a simulated data set from file and do ML analysis:

To generate a simulated data set and do ML analysis ("on-the-fly"):

Setup instructions for students

Setup software environment for our analysis – on hiccup cluster

About

Releases 2

Packages

Languages

License

jdmulligan/QCD-Diffusion

Folders and files

Latest commit

History

Repository files navigation

Conditional diffusion models for QCD confinement

Running the analysis pipeline

To generate a simulated data set and write it to file:

To read a simulated data set from file and do ML analysis:

To generate a simulated data set and do ML analysis ("on-the-fly"):

Setup instructions for students

Setup software environment for our analysis – on hiccup cluster

About

Resources

License

Stars

Watchers

Forks

Releases 2

Packages 0

Languages

Packages