llama-stack server segfaults when post_training is triggered for `torchtune` #1141

booxter · 2025-02-18T17:39:03Z

System Info

It's a Macbook M3 Pro. Hence no CUDA.

Information

The official example scripts
My own modified scripts

🐛 Describe the bug

When I trigger training, llama-stack server crashes. I've found the workaround is setting OMP_NUM_THREADS=1 in the server environment.

This seems to be a known issue in pytorch: pytorch/pytorch#121101 and probably stems from the fact that they bundle libomp (same as some other projects do for libomp), and apparently there may be a conflict between versions of the library. (This is just a theory, not a fact though.)

At the very least, we should probably document the workaround, though maybe there's a better solution.

The script I use (its training part):

from llama_stack_client.types import (
    post_training_supervised_fine_tune_params,
)

training_config = post_training_supervised_fine_tune_params.TrainingConfig(
    data_config=post_training_supervised_fine_tune_params.TrainingConfigDataConfig(
        batch_size=1,
        data_format="instruct",
        #data_format="dialog",
        dataset_id=simpleqa_dataset_id,
        shuffle=True,
    ),
    gradient_accumulation_steps=1,
    max_steps_per_epoch=1,
    max_validation_steps=1,
    n_epochs=1,
    optimizer_config=post_training_supervised_fine_tune_params.TrainingConfigOptimizerConfig(
        lr=2e-5,
        num_warmup_steps=1,
        optimizer_type="adam",
        weight_decay=0.01,
    ),
)

from llama_stack_client.types import (
    algorithm_config_param,
)

algorithm_config = algorithm_config_param.LoraFinetuningConfig(
    alpha=1,
    apply_lora_to_mlp=True,
    # Invalid value: apply_lora_to_output is currently not supporting in llama3.2 1b and 3b,as the projection layer weights are tied to the embeddings
    apply_lora_to_output=False,
    lora_attn_modules=['q_proj'], # todo?
    rank=1,
    type="LoRA",
)

response = client.post_training.supervised_fine_tune(
    job_uuid=job_uuid,
    logger_config={},
    model=training_model,
    hyperparam_search_config={},
    training_config=training_config,
    algorithm_config=algorithm_config,
    checkpoint_dir="null", # API claims it's not needed but - 400 if not passed.
)

Error logs

On server console:

17:27:42.573 [START] /v1/post-training/supervised-fine-tune
DEBUG 2025-02-18 12:27:42,573 torchtune.utils._logging:60: Setting manual seed to local seed 429841145. Local seed is seed + rank = 429841145 + 0
INFO 2025-02-18 12:27:42,616 torchtune.utils._logging:64: Identified model_type = Llama3_2. Ignoring output.weight in checkpoint in favor of the tok_embedding.weight tied weights.
/Users/ihrachys/src/llama-stack/llama_stack/distribution/start_venv.sh: line 71: 86644 Segmentation fault: 11  python -m llama_stack.distribution.server.server --yaml-config "$yaml_config" --port "$port" $env_vars $other_args
++ error_handler 71
++ echo 'Error occurred in script at line: 71'
Error occurred in script at line: 71
++ exit 1

From Console.app stack trace for the process:

Crashed Thread:        6

Exception Type:        EXC_BAD_ACCESS (SIGSEGV)
Exception Codes:       KERN_INVALID_ADDRESS at 0x0000000000000008
Exception Codes:       0x0000000000000001, 0x0000000000000008

Termination Reason:    Namespace SIGNAL, Code 11 Segmentation fault: 11
Terminating Process:   exc handler [86644]

[...]

Thread 6 Crashed:
0   libomp.dylib                  	       0x113a59828 void __kmp_suspend_64<false, true>(int, kmp_flag_64<false, true>*) + 44
1   libomp.dylib                  	       0x113c05520 kmp_flag_64<false, true>::wait(kmp_info*, int, void*) + 1880
2   libomp.dylib                  	       0x113c00560 __kmp_hyper_barrier_release(barrier_type, kmp_info*, int, int, int, void*) + 184
3   libomp.dylib                  	       0x113c040e8 __kmp_fork_barrier(int, int) + 628
4   libomp.dylib                  	       0x113be0e14 __kmp_launch_thread + 340
5   libomp.dylib                  	       0x113c1f00c __kmp_launch_worker(void*) + 280
6   libsystem_pthread.dylib       	       0x186c0c2e4 _pthread_start + 136
7   libsystem_pthread.dylib       	       0x186c070fc thread_start + 8


Thread 6 crashed with ARM Thread State (64-bit):
    x0: 0x0000000000000001   x1: 0x000000016ca7ab30   x2: 0xffffff766371a925   x3: 0x0000000fffffc088
    x4: 0x0000000000000001   x5: 0x0000000000000000   x6: 0x000000016ca80000   x7: 0x000000016b150850
    x8: 0x0000000000000000   x9: 0x000000007fffffff  x10: 0x00000000000003e8  x11: 0xde5cd20d8d5d00c5
   x12: 0x00000000016e3600  x13: 0x0000000000124df8  x14: 0x0000000000000000  x15: 0x000000017793c004
   x16: 0x0000000113a597fc  x17: 0x00000001f8bbdf28  x18: 0x0000000000000000  x19: 0x00000001373c2840
   x20: 0x0000000003a37139  x21: 0x000000016ca7ab30  x22: 0x0000000113c56c80  x23: 0x0000000113c4c5a8
   x24: 0x0000000000000001  x25: 0x0000000000000000  x26: 0x0000000113c4c548  x27: 0x00000001373c2d88
   x28: 0x0000000113c4f1e0   fp: 0x000000016ca7aa50   lr: 0x0000000113c05520
    sp: 0x000000016ca7a9b0   pc: 0x0000000113a59828 cpsr: 0x80001000
   far: 0x0000000000000008  esr: 0x92000006 (Data Abort) byte read Translation fault

Expected behavior

It doesn't crash. :)

The text was updated successfully, but these errors were encountered:

booxter · 2025-02-18T17:42:48Z

Looking at more info in the Console.app dump about the crash, we can see that a number of libomp dylibs are loaded:

       0x13311c000 -        0x1331abfff libomp.dylib (*) <ee1d3262-a116-33e7-9d1d-386c0c4399b8> /Users/USER/*/libomp.dylib
       0x133500000 -        0x133587fff libomp.dylib (*) <f53b1e01-af16-30fc-8690-f7b131eb6ce5> /Users/USER/*/libomp.dylib
       0x15bb4c000 -        0x15bbbffff libomp.dylib (*) <e3a31ab3-3ae5-3371-87d0-7fd870a41a0d> /Users/USER/*/libomp.dylib

So yeah, probably a conflict between bundled library builds.

booxter added the bug Something isn't working label Feb 18, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

llama-stack server segfaults when post_training is triggered for `torchtune` #1141

llama-stack server segfaults when post_training is triggered for `torchtune` #1141

booxter commented Feb 18, 2025

booxter commented Feb 18, 2025

llama-stack server segfaults when post_training is triggered for torchtune #1141

llama-stack server segfaults when post_training is triggered for torchtune #1141

Comments

booxter commented Feb 18, 2025

System Info

Information

🐛 Describe the bug

Error logs

Expected behavior

booxter commented Feb 18, 2025

llama-stack server segfaults when post_training is triggered for `torchtune` #1141

llama-stack server segfaults when post_training is triggered for `torchtune` #1141