a

ml-jku · Feb 14, 2024 · cc8a5a9 · cc8a5a9
1 parent 326f162
commit cc8a5a9
Show file tree

Hide file tree

Showing 2 changed files with 47 additions and 142 deletions.
diff --git a/README.md b/README.md
@@ -1,121 +1,27 @@
-# MLPlayground
+# MIM-Refiner
 
-## Setup
+Pytorch implementation of MIM-Refiner.
 
-#### environment
 
-`conda env create --file environment_<OS>.yml --name <NAME>`
-this will most likely install pytorch 2.0.0 with some old cuda version -> install newer cuda version
-`pip install torch==2.0.0+cu117 torchvision==0.15.0+cu117 --index-url https://download.pytorch.org/whl/cu117`
-`pip install torch==2.0.0+cu118 torchvision==0.15.0+cu118 --index-url https://download.pytorch.org/whl/cu118`
-`pip install torch==2.1.1+cu121 torchvision==0.16.1+cu121 --index-url https://download.pytorch.org/whl/cu121`
+# Pre-trained Models
 
-you can check the installed version with:
 
-```
-import torch
-torch.__version__ 
-torch.version.cuda 
-```
-
-#### Optional: special libraries
-
-- `pip install cyanure-mkl` (logistic regression; only available on linux; make sure you are on a GPU node to install)
-
-#### configuration
-
-#### static_config.yaml
-
-choose one of the two options:
-
-- copy a template and adjust values to your setup `cp template_static_config_iml.yaml static_config.yaml`
-- create a file `static_config.yaml` with the first line `template: ${yaml:template_static_config_iml}`
-    - overwrite values in the template by adding lines of the format `template.<property_to_overwrite>: <value>`
-        - `template.account_name: <ACCOUNT_NAME>`
-        - `template.output_path: <OUTPUT_PATH>`
-    - add new values
-        - `model_path: <MODEL_PATH>`
-    - example:
-  ```
-  template: ${yaml:template_static_config_iml}
-  template.account_name: <ACCOUNT_NAME>
-  template.output_path: /system/user/publicwork/ssl/save
-  template.local_dataset_path: /localdata
-  ```
-
-#### optional configs configs
-- create wandb config(s) (use via `--wandb_config <WANDB_CONFIG_NAME>` in CLI or `wandb: <WANDB_CONFIG_NAME` in yaml)
-    - `cp template_wandb_config.yaml wandb_configs/<WANDB_CONFIG_NAME>.yaml`
-    - adjust values to your setup
-- create a default wandb config (this will be used when no wandb config is defined)
-    - `cp template_wandb_config.yaml wandb_config.yaml`
-    - adjust values to your setup
-- create `sbatch_config.yaml` (only required for `main_sbatch.py` on slurm clusters)
-    - `cp template_config_sbatch.yaml sbatch_config.yaml`
-- create `template_sbatch_nodes.yaml` (only required for running `main_sbatch.py --nodes <NODES>` on slurm clusters)
-    - `cp template_sbatch_nodes_<HPC>.yaml template_sbatch_nodes.yaml`
-- create `template_sbatch_gpus.yaml` (only required for running `main_sbatch.py --gpus <GPUS>` on slurm clusters)
-    - `cp template_sbatch_gpus_<HPC>.yaml template_sbatch_gpus.yaml`
-
-## Run
-
-### Runs require the following arguments
-
-- `--hp <YAML>` e.g. `--hp hyperparams.yaml` define what to run
-- `--devices <DEVICES>` e.g. `--devices 0` to run on GPU0 or `--devices 0,1,2,3` to run on 4 GPUs
-
-### Run with SLURM
-
-`python main_sbatch.py --time 2-00:00:00 --qos default --nodes 1 ADDITIONAL_ARGUMENTS`
-`python main_sbatch.py --time 2-00:00:00 --qos default --nodes 1 --hp <HP> --name <NAME>`
-
-### Optional arguments (most important ones)
-
-- `--name <NAME>` what name to assign in wandb
-- `--wandb_config <YAML>` what wandb configuration to use (by default the `wandb_config.yaml` in the MLPlayground
-  directory will be used)
-    - only required if you have either `default_wandb_mode` to `online`/`offline` or pass `--wandb_mode <WANDB_MODE>`
-      which is `online`/`offline` (a warning will be logged if you specify it with  `wandb_mode=disabled`)
-- `--num_workers` specify how many workers will be used for data loading
-    - by default `num_workers` will be `number_of_cpus / number_of_gpus`
-
-### Development arguments
-
-- `--accelerator cpu` runs on cpu (can still use multiple devices for debugging multi-gpu runs but with cpu)
-- `--mindatarun` adjusts datasets length, epochs, logger intervals and batchsize to a minimum
-- `--minmodelrun` replaces all values in the hp yaml of the pattern `${select:model_key:${yaml:models/...}}`
-  with `${select:debug:${yaml:models/...}}`
-    - you can define your model size with a model key and it will automatically replace it with a minimal model
-    - e.g. `encoder_model_key: tiny` for a ViT-T as encoder
-      with `encoder_params: ${select:${vars.encoder_model_key}:${yaml:models/vit}}` will select a very light-weight ViT
-- `--testrun` combination of `--mindatarun` and `--minmodelrun`
-
-## Data setup
+TODO
 
-#### data_loading_mode == "local"
 
-- ImageFolder datasets can be stored as zip files (see SETUP.md for creating these)
-  - 1 zip per split (slow unpacking): ImageNet/train -> ImageNet/train.zip
-  - 1 zip per class per split (fast unpacking): ImageNet/train/n1830348 -> ImageNet/train/n1830348.zip
-- sync zipped folders to other servers `rsync -r /localdata/imagenet1k host:/data/`
+# Train your own models
 
-## Resume run
+Instructions to setup the codebase on your own environment are provided in SETUP_CODE, SETUP_DATA and SETUP_MODELS.
 
-### Via CLI
-- `--resume_stage_id <STAGGE_ID>` resume from `cp=latest`
-- `--resume_stage_id <STAGGE_ID> --resume_checkpoint E100` resume from epoch 100
-- `--resume_stage_id <STAGGE_ID> --resume_checkpoint U100` resume from update 100
-- `--resume_stage_id <STAGGE_ID> --resume_checkpoint S1024` resume from sample 1024
+# Citation
 
-### Via yaml
-add a resume initializer to the trainer
+If you like our work, please consider giving it a star :star: and cite us
 
 ```
-trainer:
-  ...
-  initializer:
-    kind: resume_initializer
-    stage_id: ???
-    checkpoint:
-      epoch: 100
+@article{alkin2024mimrefiner,
+      title={TODO}, 
+      author={TODO},
+      journal={TODO},
+      year={TODO}
+}
 ```
diff --git a/SETUP_CODE.md b/SETUP_CODE.md
@@ -59,59 +59,58 @@ yamls create a folder `wandb_configs`, copy the `template_wandb_config.yaml` int
 `entity`/`project` in this file and rename it to `v4.yaml`.
 Every run that defines `wandb: v4` will now fetch the details from this file and log your metrics to this W&B project.
 
+## SLURM config
 
-## Run
+This codebase supports runs in SLURM environments. For this, you need to provide some additional configurations.
+Copy the `template_sbatch_config.yaml`, rename it to `sbatch_config.yaml` and adjust the values to your setup.
 
-### Runs require the following arguments
+Copy the `template_sbatch_nodes_github.sh`, rename it to `template_sbatch_nodes.sh` and adjust the values to your setup.
 
-- `--hp <YAML>` e.g. `--hp hyperparams.yaml` define what to run
-- `--devices <DEVICES>` e.g. `--devices 0` to run on GPU0 or `--devices 0,1,2,3` to run on 4 GPUs
 
-### Run with SLURM
+## Start Runs
+
+You can start runs with the `main_train.py` file. For example
 
-`python main_sbatch.py --time 2-00:00:00 --qos default --nodes 1 ADDITIONAL_ARGUMENTS`
-`python main_sbatch.py --time 2-00:00:00 --qos default --nodes 1 --hp <HP> --name <NAME>`
+You can queue up runs in SLURM environments by running `python main_sbatch.py --hp <YAML> --time <TIME> --nodes <NODES>`
+which will queue up a run that uses the hyperparameters from `<YAML>` and queues up a run on `<NODES>` nodes.
 
-### Optional arguments (most important ones)
 
-- `--name <NAME>` what name to assign in wandb
-- `--wandb_config <YAML>` what wandb configuration to use (by default the `wandb_config.yaml` in the MLPlayground
-  directory will be used)
-    - only required if you have either `default_wandb_mode` to `online`/`offline` or pass `--wandb_mode <WANDB_MODE>`
-      which is `online`/`offline` (a warning will be logged if you specify it with  `wandb_mode=disabled`)
-- `--num_workers` specify how many workers will be used for data loading
-    - by default `num_workers` will be `number_of_cpus / number_of_gpus`
+## Run
+
+All hyperparameters have to be defined in a yaml file that is passed via the `--hp <YAML>` CLI argument.
+You can start runs on "normal" servers or SLURM environments.
 
-### Development arguments
+### Run on "Normal" Servers
+
+Define how many (and which) GPUs you want to use with the `--devices` CLI argument
+- `--devices 0` will start the run on the GPU with index 0
+- `--devices 2` will start the run on the GPU with index 2
+- `--devices 0,1,2,3` will start the run on 4 GPUs
+
+Examples:
+- `python main_train.py --devices 0,1,2,3 --hp yamls/stage2/l16_mae.yaml`
+- `python main_train.py --devices 0,1,2,3,4,5,6,7 --hp yamls/stage3/l16_mae.yaml`
+
+### Run with SLURM
 
-- `--accelerator cpu` runs on cpu (can still use multiple devices for debugging multi-gpu runs but with cpu)
-- `--mindatarun` adjusts datasets length, epochs, logger intervals and batchsize to a minimum
-- `--minmodelrun` replaces all values in the hp yaml of the pattern `${select:model_key:${yaml:models/...}}`
-  with `${select:debug:${yaml:models/...}}`
-    - you can define your model size with a model key and it will automatically replace it with a minimal model
-    - e.g. `encoder_model_key: tiny` for a ViT-T as encoder
-      with `encoder_params: ${select:${vars.encoder_model_key}:${yaml:models/vit}}` will select a very light-weight ViT
-- `--testrun` combination of `--mindatarun` and `--minmodelrun`
+To start runs in SLURM environments, you need to setup the configurations for SLURM as outlined above.
+Then start runs with the `main_sbatch.py` script.
 
-## Data setup
+Example:
+- `python main_sbatch.py --time 24:00:00 --nodes 4 --hp yamls/stage3/l16_mae.yaml`
 
-#### data_loading_mode == "local"
 
-- ImageFolder datasets can be stored as zip files (see SETUP.md for creating these)
-  - 1 zip per split (slow unpacking): ImageNet/train -> ImageNet/train.zip
-  - 1 zip per class per split (fast unpacking): ImageNet/train/n1830348 -> ImageNet/train/n1830348.zip
-- sync zipped folders to other servers `rsync -r /localdata/imagenet1k host:/data/`
+#### Resume run
 
-## Resume run
+Add these flags to your `python main_train.py` or `python main_sbatch.py` command.
 
-### Via CLI
 - `--resume_stage_id <STAGGE_ID>` resume from `cp=latest`
 - `--resume_stage_id <STAGGE_ID> --resume_checkpoint E100` resume from epoch 100
 - `--resume_stage_id <STAGGE_ID> --resume_checkpoint U100` resume from update 100
 - `--resume_stage_id <STAGGE_ID> --resume_checkpoint S1024` resume from sample 1024
 
-### Via yaml
-add a resume initializer to the trainer
+#### Via yaml
+Add a resume initializer to the trainer
 
 ```
 trainer: