Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

llama-stack server segfaults when post_training is triggered for torchtune #1141

Open
1 of 2 tasks
booxter opened this issue Feb 18, 2025 · 1 comment
Open
1 of 2 tasks
Labels
bug Something isn't working

Comments

@booxter
Copy link
Contributor

booxter commented Feb 18, 2025

System Info

It's a Macbook M3 Pro. Hence no CUDA.

Information

  • The official example scripts
  • My own modified scripts

🐛 Describe the bug

When I trigger training, llama-stack server crashes. I've found the workaround is setting OMP_NUM_THREADS=1 in the server environment.

This seems to be a known issue in pytorch: pytorch/pytorch#121101 and probably stems from the fact that they bundle libomp (same as some other projects do for libomp), and apparently there may be a conflict between versions of the library. (This is just a theory, not a fact though.)

At the very least, we should probably document the workaround, though maybe there's a better solution.


The script I use (its training part):

from llama_stack_client.types import (
    post_training_supervised_fine_tune_params,
)

training_config = post_training_supervised_fine_tune_params.TrainingConfig(
    data_config=post_training_supervised_fine_tune_params.TrainingConfigDataConfig(
        batch_size=1,
        data_format="instruct",
        #data_format="dialog",
        dataset_id=simpleqa_dataset_id,
        shuffle=True,
    ),
    gradient_accumulation_steps=1,
    max_steps_per_epoch=1,
    max_validation_steps=1,
    n_epochs=1,
    optimizer_config=post_training_supervised_fine_tune_params.TrainingConfigOptimizerConfig(
        lr=2e-5,
        num_warmup_steps=1,
        optimizer_type="adam",
        weight_decay=0.01,
    ),
)

from llama_stack_client.types import (
    algorithm_config_param,
)

algorithm_config = algorithm_config_param.LoraFinetuningConfig(
    alpha=1,
    apply_lora_to_mlp=True,
    # Invalid value: apply_lora_to_output is currently not supporting in llama3.2 1b and 3b,as the projection layer weights are tied to the embeddings
    apply_lora_to_output=False,
    lora_attn_modules=['q_proj'], # todo?
    rank=1,
    type="LoRA",
)

response = client.post_training.supervised_fine_tune(
    job_uuid=job_uuid,
    logger_config={},
    model=training_model,
    hyperparam_search_config={},
    training_config=training_config,
    algorithm_config=algorithm_config,
    checkpoint_dir="null", # API claims it's not needed but - 400 if not passed.
)

Error logs

On server console:

17:27:42.573 [START] /v1/post-training/supervised-fine-tune
DEBUG 2025-02-18 12:27:42,573 torchtune.utils._logging:60: Setting manual seed to local seed 429841145. Local seed is seed + rank = 429841145 + 0
INFO 2025-02-18 12:27:42,616 torchtune.utils._logging:64: Identified model_type = Llama3_2. Ignoring output.weight in checkpoint in favor of the tok_embedding.weight tied weights.
/Users/ihrachys/src/llama-stack/llama_stack/distribution/start_venv.sh: line 71: 86644 Segmentation fault: 11  python -m llama_stack.distribution.server.server --yaml-config "$yaml_config" --port "$port" $env_vars $other_args
++ error_handler 71
++ echo 'Error occurred in script at line: 71'
Error occurred in script at line: 71
++ exit 1

From Console.app stack trace for the process:

Crashed Thread:        6

Exception Type:        EXC_BAD_ACCESS (SIGSEGV)
Exception Codes:       KERN_INVALID_ADDRESS at 0x0000000000000008
Exception Codes:       0x0000000000000001, 0x0000000000000008

Termination Reason:    Namespace SIGNAL, Code 11 Segmentation fault: 11
Terminating Process:   exc handler [86644]

[...]

Thread 6 Crashed:
0   libomp.dylib                  	       0x113a59828 void __kmp_suspend_64<false, true>(int, kmp_flag_64<false, true>*) + 44
1   libomp.dylib                  	       0x113c05520 kmp_flag_64<false, true>::wait(kmp_info*, int, void*) + 1880
2   libomp.dylib                  	       0x113c00560 __kmp_hyper_barrier_release(barrier_type, kmp_info*, int, int, int, void*) + 184
3   libomp.dylib                  	       0x113c040e8 __kmp_fork_barrier(int, int) + 628
4   libomp.dylib                  	       0x113be0e14 __kmp_launch_thread + 340
5   libomp.dylib                  	       0x113c1f00c __kmp_launch_worker(void*) + 280
6   libsystem_pthread.dylib       	       0x186c0c2e4 _pthread_start + 136
7   libsystem_pthread.dylib       	       0x186c070fc thread_start + 8


Thread 6 crashed with ARM Thread State (64-bit):
    x0: 0x0000000000000001   x1: 0x000000016ca7ab30   x2: 0xffffff766371a925   x3: 0x0000000fffffc088
    x4: 0x0000000000000001   x5: 0x0000000000000000   x6: 0x000000016ca80000   x7: 0x000000016b150850
    x8: 0x0000000000000000   x9: 0x000000007fffffff  x10: 0x00000000000003e8  x11: 0xde5cd20d8d5d00c5
   x12: 0x00000000016e3600  x13: 0x0000000000124df8  x14: 0x0000000000000000  x15: 0x000000017793c004
   x16: 0x0000000113a597fc  x17: 0x00000001f8bbdf28  x18: 0x0000000000000000  x19: 0x00000001373c2840
   x20: 0x0000000003a37139  x21: 0x000000016ca7ab30  x22: 0x0000000113c56c80  x23: 0x0000000113c4c5a8
   x24: 0x0000000000000001  x25: 0x0000000000000000  x26: 0x0000000113c4c548  x27: 0x00000001373c2d88
   x28: 0x0000000113c4f1e0   fp: 0x000000016ca7aa50   lr: 0x0000000113c05520
    sp: 0x000000016ca7a9b0   pc: 0x0000000113a59828 cpsr: 0x80001000
   far: 0x0000000000000008  esr: 0x92000006 (Data Abort) byte read Translation fault

Expected behavior

It doesn't crash. :)

@booxter booxter added the bug Something isn't working label Feb 18, 2025
@booxter
Copy link
Contributor Author

booxter commented Feb 18, 2025

Looking at more info in the Console.app dump about the crash, we can see that a number of libomp dylibs are loaded:

       0x13311c000 -        0x1331abfff libomp.dylib (*) <ee1d3262-a116-33e7-9d1d-386c0c4399b8> /Users/USER/*/libomp.dylib
       0x133500000 -        0x133587fff libomp.dylib (*) <f53b1e01-af16-30fc-8690-f7b131eb6ce5> /Users/USER/*/libomp.dylib
       0x15bb4c000 -        0x15bbbffff libomp.dylib (*) <e3a31ab3-3ae5-3371-87d0-7fd870a41a0d> /Users/USER/*/libomp.dylib

So yeah, probably a conflict between bundled library builds.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant