You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When I trigger training, llama-stack server crashes. I've found the workaround is setting OMP_NUM_THREADS=1 in the server environment.
This seems to be a known issue in pytorch: pytorch/pytorch#121101 and probably stems from the fact that they bundle libomp (same as some other projects do for libomp), and apparently there may be a conflict between versions of the library. (This is just a theory, not a fact though.)
At the very least, we should probably document the workaround, though maybe there's a better solution.
The script I use (its training part):
from llama_stack_client.types import (
post_training_supervised_fine_tune_params,
)
training_config = post_training_supervised_fine_tune_params.TrainingConfig(
data_config=post_training_supervised_fine_tune_params.TrainingConfigDataConfig(
batch_size=1,
data_format="instruct",
#data_format="dialog",
dataset_id=simpleqa_dataset_id,
shuffle=True,
),
gradient_accumulation_steps=1,
max_steps_per_epoch=1,
max_validation_steps=1,
n_epochs=1,
optimizer_config=post_training_supervised_fine_tune_params.TrainingConfigOptimizerConfig(
lr=2e-5,
num_warmup_steps=1,
optimizer_type="adam",
weight_decay=0.01,
),
)
from llama_stack_client.types import (
algorithm_config_param,
)
algorithm_config = algorithm_config_param.LoraFinetuningConfig(
alpha=1,
apply_lora_to_mlp=True,
# Invalid value: apply_lora_to_output is currently not supporting in llama3.2 1b and 3b,as the projection layer weights are tied to the embeddings
apply_lora_to_output=False,
lora_attn_modules=['q_proj'], # todo?
rank=1,
type="LoRA",
)
response = client.post_training.supervised_fine_tune(
job_uuid=job_uuid,
logger_config={},
model=training_model,
hyperparam_search_config={},
training_config=training_config,
algorithm_config=algorithm_config,
checkpoint_dir="null", # API claims it's not needed but - 400 if not passed.
)
Error logs
On server console:
17:27:42.573 [START] /v1/post-training/supervised-fine-tune
DEBUG 2025-02-18 12:27:42,573 torchtune.utils._logging:60: Setting manual seed to local seed 429841145. Local seed is seed + rank = 429841145 + 0
INFO 2025-02-18 12:27:42,616 torchtune.utils._logging:64: Identified model_type = Llama3_2. Ignoring output.weight in checkpoint in favor of the tok_embedding.weight tied weights.
/Users/ihrachys/src/llama-stack/llama_stack/distribution/start_venv.sh: line 71: 86644 Segmentation fault: 11 python -m llama_stack.distribution.server.server --yaml-config "$yaml_config" --port "$port" $env_vars $other_args
++ error_handler 71
++ echo 'Error occurred in script at line: 71'
Error occurred in script at line: 71
++ exit 1
System Info
It's a Macbook M3 Pro. Hence no CUDA.
Information
🐛 Describe the bug
When I trigger training, llama-stack server crashes. I've found the workaround is setting
OMP_NUM_THREADS=1
in the server environment.This seems to be a known issue in pytorch: pytorch/pytorch#121101 and probably stems from the fact that they bundle libomp (same as some other projects do for libomp), and apparently there may be a conflict between versions of the library. (This is just a theory, not a fact though.)
At the very least, we should probably document the workaround, though maybe there's a better solution.
The script I use (its training part):
Error logs
On server console:
From
Console.app
stack trace for the process:Expected behavior
It doesn't crash. :)
The text was updated successfully, but these errors were encountered: