Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug: llamafile won't load user defined system prompt file #698

Open
stonez56 opened this issue Feb 18, 2025 · 0 comments
Open

Bug: llamafile won't load user defined system prompt file #698

stonez56 opened this issue Feb 18, 2025 · 0 comments

Comments

@stonez56
Copy link

stonez56 commented Feb 18, 2025

Contact Details

[email protected]

What happened?

I would like to include my own system prompt file when start the Llamafile.
I have tried: -spf FNAME & --system-prompt-file FNAME both won't work.

How did I verify?
Simply ask assistant's name and it should reply "My name is Lisa".
After start Llamafile, it always said "her name is Nova".

Here is the simple command I issued:
pi@raspberrypi:~/Downloads/llamafile $ ./Llama-3.2-1B-Instruct.Q6_K.llamafile --system-prompt-file ./myPrompt.json --verbose

here is my prompt file content:
{ "system_prompt": { "prompt": "Your name is Lisa. You are a helpful, kind, and honest assistant. You provide short, concise, and useful answers. *Before answering, carefully consider the question and the available information. Ensure your response is accurate and well-reasoned.* If you are unsure of the answer or do not know the answer, say 'I do not know' or 'I am unsure.' Do not fabricate or invent information.", "anti_prompt": "User:", "assistant_name": "Lisa:" } }

Version

llamafile 0.9.0

What operating system are you seeing the problem on?

Linux

Relevant log output

██╗     ██╗      █████╗ ███╗   ███╗ █████╗ ███████╗██╗██╗     ███████╗
██║     ██║     ██╔══██╗████╗ ████║██╔══██╗██╔════╝██║██║     ██╔════╝
██║     ██║     ███████║██╔████╔██║███████║█████╗  ██║██║     █████╗
██║     ██║     ██╔══██║██║╚██╔╝██║██╔══██║██╔══╝  ██║██║     ██╔══╝
███████╗███████╗██║  ██║██║ ╚═╝ ██║██║  ██║██║     ██║███████╗███████╗
╚══════╝╚══════╝╚═╝  ╚═╝╚═╝     ╚═╝╚═╝  ╚═╝╚═╝     ╚═╝╚══════╝╚══════╝
note: if you have an AMD or NVIDIA GPU then you need to pass -ngl 9999 to enable GPU offloading
llama_model_loader: loaded meta data with 28 key-value pairs and 147 tensors from Llama-3.2-1B-Instruct.Q6_K.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                         general.size_label str              = 1.2B
llama_model_loader: - kv   3:                            general.license str              = llama3.2
llama_model_loader: - kv   4:                               general.tags arr[str,6]       = ["facebook", "meta", "pytorch", "llam...
llama_model_loader: - kv   5:                          general.languages arr[str,8]       = ["en", "de", "fr", "it", "pt", "hi", ...
llama_model_loader: - kv   6:                          llama.block_count u32              = 16
llama_model_loader: - kv   7:                       llama.context_length u32              = 131072
llama_model_loader: - kv   8:                     llama.embedding_length u32              = 2048
llama_model_loader: - kv   9:                  llama.feed_forward_length u32              = 8192
llama_model_loader: - kv  10:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv  11:              llama.attention.head_count_kv u32              = 8
llama_model_loader: - kv  12:                       llama.rope.freq_base f32              = 500000.000000
llama_model_loader: - kv  13:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  14:                 llama.attention.key_length u32              = 64
llama_model_loader: - kv  15:               llama.attention.value_length u32              = 64
llama_model_loader: - kv  16:                          general.file_type u32              = 18
llama_model_loader: - kv  17:                           llama.vocab_size u32              = 128256
llama_model_loader: - kv  18:                 llama.rope.dimension_count u32              = 64
llama_model_loader: - kv  19:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  20:                         tokenizer.ggml.pre str              = llama-bpe
llama_model_loader: - kv  21:                      tokenizer.ggml.tokens arr[str,128256]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  22:                  tokenizer.ggml.token_type arr[i32,128256]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  23:                      tokenizer.ggml.merges arr[str,280147]  = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv  24:                tokenizer.ggml.bos_token_id u32              = 128000
llama_model_loader: - kv  25:                tokenizer.ggml.eos_token_id u32              = 128009
llama_model_loader: - kv  26:                    tokenizer.chat_template str              = {{- bos_token }}\n{%- if custom_tools ...
llama_model_loader: - kv  27:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   34 tensors
llama_model_loader: - type q6_K:  113 tensors
llm_load_vocab: special tokens cache size = 256
llm_load_vocab: token to piece cache size = 0.7999 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 128256
llm_load_print_meta: n_merges         = 280147
llm_load_print_meta: vocab_only       = 0
llm_load_print_meta: n_ctx_train      = 131072
llm_load_print_meta: n_embd           = 2048
llm_load_print_meta: n_layer          = 16
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_rot            = 64
llm_load_print_meta: n_swa            = 0
llm_load_print_meta: n_embd_head_k    = 64
llm_load_print_meta: n_embd_head_v    = 64
llm_load_print_meta: n_gqa            = 4
llm_load_print_meta: n_embd_k_gqa     = 512
llm_load_print_meta: n_embd_v_gqa     = 512
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 8192
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 500000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn  = 131072
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = ?B
llm_load_print_meta: model ftype      = Q6_K
llm_load_print_meta: model params     = 1.24 B
llm_load_print_meta: model size       = 967.00 MiB (6.56 BPW)
llm_load_print_meta: general.name     = n/a
llm_load_print_meta: BOS token        = 128000 '<|begin_of_text|>'
llm_load_print_meta: EOS token        = 128009 '<|eot_id|>'
llm_load_print_meta: LF token         = 128 'Ä'
llm_load_print_meta: EOT token        = 128009 '<|eot_id|>'
llm_load_print_meta: max token length = 256
llm_load_tensors: ggml ctx size =    0.08 MiB
llm_load_tensors:        CPU buffer size =   967.00 MiB
.............................................................
INFO [              server_cli] build info | build=1500 commit="a30b324" tid="546162797472" timestamp=1739868432
INFO [              server_cli] system info | n_threads=4 n_threads_batch=4 system_info="AVX = 0 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 1 | SVE = 0 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 0 | SSSE3 = 0 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | " tid="546162797472" timestamp=1739868432 total_threads=4
llama_new_context_with_model: n_ctx      = 8192
llama_new_context_with_model: n_batch    = 2048
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 500000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:        CPU KV buffer size =   256.00 MiB
llama_new_context_with_model: KV self size  =  256.00 MiB, K (f16):  128.00 MiB, V (f16):  128.00 MiB
llama_new_context_with_model:        CPU  output buffer size =     0.49 MiB
llama_new_context_with_model:        CPU compute buffer size =   544.01 MiB
llama_new_context_with_model: graph nodes  = 518
llama_new_context_with_model: graph splits = 1
INFO [              initialize] initializing slots | n_slots=1 tid="546162797472" timestamp=1739868432
INFO [              initialize] new slot | n_ctx_slot=8192 slot_id=0 tid="546162797472" timestamp=1739868432
INFO [              server_cli] model loaded | tid="546162797472" timestamp=1739868432

llama server listening at http://127.0.0.1:8080

software: llamafile 0.9.0
model:    Llama-3.2-1B-Instruct.Q6_K.gguf
INFO [              server_cli] HTTP server listening | hostname="127.0.0.1" port="8080" tid="546162797472" timestamp=1739868432 url_prefix=""
compute:  Raspberry Pi 4 Model B Rev 1.2
server:   http://127.0.0.1:8080/

llama_new_context_with_model: n_ctx      = 8192
llama_new_context_with_model: n_batch    = 256
llama_new_context_with_model: n_ubatch   = 256
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 500000.0
llama_new_context_with_model: freq_scale = 1
VERB [              start_loop] new task may arrive | tid="546162797472" timestamp=1739868432
VERB [              start_loop] callback_all_task_finished | tid="546162797472" timestamp=1739868432
INFO [            update_slots] updating system prompt | tid="546162797472" timestamp=1739868432
system prompt updated
VERB [              start_loop] wait for new task | tid="546162797472" timestamp=1739868433
llama_kv_cache_init:        CPU KV buffer size =   256.00 MiB
llama_new_context_with_model: KV self size  =  256.00 MiB, K (f16):  128.00 MiB, V (f16):  128.00 MiB
llama_new_context_with_model:        CPU  output buffer size =     0.49 MiB
llama_new_context_with_model:        CPU compute buffer size =   272.00 MiB
llama_new_context_with_model: graph nodes  = 518
llama_new_context_with_model: graph splits = 1
A chat between a curious human and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the human's questions.
>>> what's your name?
Nice to meet you! My name is Nova, and I'm an AI assistant here to help answer any questions you may have. I'm a large language model, which means I've been trained on a vast amount of text data to provide accurate and helpful responses. I'm here to assist you with any topics you'd like to discuss, from science and history to entertainment and culture. How can I help you today?
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant