Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

can't run gguf which download from ollama #1137

Open
Sherlock-Holo opened this issue Feb 13, 2025 · 4 comments
Open

can't run gguf which download from ollama #1137

Sherlock-Holo opened this issue Feb 13, 2025 · 4 comments

Comments

@Sherlock-Holo
Copy link

I want to try use mistral.rs to run a gguf model, and I have download deepseek r1 14b by ollama, it store the gguf file as /var/lib/ollama/blobs/sha256-6e9f90f02bb3b39b59e81916e8cfce9deb45aeaeb9a54a5be4414486b907dc1e

then I clone the https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-14B without lfs, because I already have the model file I think I just need other files in the repo

then I run ~/git/mistral.rs/target/release/mistralrs-server -i gguf -m . -f /var/lib/ollama/blobs/sha256-6e9f90f02bb3b39b59e81916e8cfce9deb45aeaeb9a54a5be4414486b907dc1e

the log is

2025-02-13T11:20:15.283182Z  INFO mistralrs_server: avx: true, neon: false, simd128: false, f16c: true
2025-02-13T11:20:15.283206Z  INFO mistralrs_server: Sampling method: penalties -> temperature -> topk -> topp -> minp -> multinomial
2025-02-13T11:20:15.283210Z  INFO mistralrs_server: Using flash attention.
2025-02-13T11:20:15.283230Z  WARN mistralrs_server: Using flash attention with a quantized model has no effect!
2025-02-13T11:20:15.283238Z  INFO mistralrs_server: Model kind is: gguf quantized from gguf (no adapters)
2025-02-13T11:20:15.283265Z  INFO candle_hf_hub: Token file not found "/home/sherlock/.cache/huggingface/token"
2025-02-13T11:20:15.283278Z  INFO mistralrs_core::utils::tokens: Could not load token at "/home/sherlock/.cache/huggingface/token", using no HF token.
2025-02-13T11:20:15.283332Z  INFO candle_hf_hub: Token file not found "/home/sherlock/.cache/huggingface/token"
2025-02-13T11:20:15.283338Z  INFO mistralrs_core::utils::tokens: Could not load token at "/home/sherlock/.cache/huggingface/token", using no HF token.
2025-02-13T11:20:15.283348Z  INFO mistralrs_core::pipeline::paths: Loading `/var/lib/ollama/blobs/sha256-6e9f90f02bb3b39b59e81916e8cfce9deb45aeaeb9a54a5be4414486b907dc1e` locally at `/var/lib/ollama/blobs/sha256-6e9f90f02bb3b39b59e81916e8cfce9deb45aeaeb9a54a5be4414486b907dc1e`
2025-02-13T11:20:15.283384Z  INFO mistralrs_core::pipeline::gguf: Loading `generation_config.json` at `.`
2025-02-13T11:20:15.283390Z  INFO mistralrs_core::pipeline::gguf: Loading `generation_config.json` locally at `./generation_config.json`
2025-02-13T11:20:15.283423Z  INFO mistralrs_core::pipeline::gguf: Prompt chunk size is 512.
2025-02-13T11:20:15.675548Z  INFO mistralrs_core::gguf::content: Model config:
general.architecture: qwen2
general.basename: DeepSeek-R1-Distill-Qwen
general.file_type: 15
general.name: DeepSeek R1 Distill Qwen 14B
general.quantization_version: 2
general.size_label: 14B
general.type: model
qwen2.attention.head_count: 40
qwen2.attention.head_count_kv: 8
qwen2.attention.layer_norm_rms_epsilon: 0.00001
qwen2.block_count: 48
qwen2.context_length: 131072
qwen2.embedding_length: 5120
qwen2.feed_forward_length: 13824
qwen2.rope.freq_base: 1000000
2025-02-13T11:20:15.686999Z  INFO mistralrs_core::utils::normal: Detected minimum CUDA compute capability 8.9
2025-02-13T11:20:15.751448Z  INFO mistralrs_core::utils::normal: DType selected is BF16.
Error: Cannot find tensor info for blk.0.attn_output.bias

what cause the problem Error: Cannot find tensor info for blk.0.attn_output.bias

@Sherlock-Holo
Copy link
Author

I am running on c9ac321 with a little patch

 From ff7f52bf4ab3ed1b67bbc7fe240554fd1e792929 Mon Sep 17 00:00:00 2001
From: Sherlock Holo <[email protected]>
Date: Thu, 13 Feb 2025 15:56:04 +0800
Subject: [PATCH] fix cuda 12.8

---
 Cargo.lock                | 4 ++--
 Cargo.toml                | 4 ++++
 mistralrs-core/Cargo.toml | 1 +
 3 files changed, 7 insertions(+), 2 deletions(-)

diff --git a/Cargo.lock b/Cargo.lock
index bfa84a5b5..1013af2ee 100644
--- a/Cargo.lock
+++ b/Cargo.lock
@@ -811,8 +811,7 @@ dependencies = [
 [[package]]
 name = "cudarc"
 version = "0.13.4"
-source = "registry+https://github.com/rust-lang/crates.io-index"
-checksum = "3b68d7c284d40d96a4251330ab583c2718b412f4fc53239d295b3a1f8735f426"
+source = "git+https://github.com/wcork/cudarc?branch=feat-cuda12080#a749c6c1bd1bd6d08c0d03cadb63b34585f6c7cc"
 dependencies = [
  "half",
  "libloading",
@@ -2388,6 +2387,7 @@ dependencies = [
  "chrono",
  "clap",
  "csv",
+ "cudarc",
  "derive-new",
  "derive_more",
  "dirs",
diff --git a/Cargo.toml b/Cargo.toml
index 188f2700f..de2e47382 100644
--- a/Cargo.toml
+++ b/Cargo.toml
@@ -11,6 +11,9 @@ members = [
 ]
 resolver = "2"

+[patch.crates-io]
+cudarc = { git = "https://github.com/wcork/cudarc", branch = "feat-cuda12080" }
+
 [workspace.package]
 version = "0.4.0"
 edition = "2021"
@@ -23,6 +26,7 @@ license = "MIT"
 rust-version = "1.82"

 [workspace.dependencies]
+cudarc = {version="0.13",features=["cuda-12080"]}
 anyhow = "1.0.80"
 candle-core = { git = "https://github.com/EricLBuehler/candle.git", version = "0.8.0", rev = "fb5cc8c" }
 candle-nn = { git = "https://github.com/EricLBuehler/candle.git", version = "0.8.0", rev = "fb5cc8c" }
diff --git a/mistralrs-core/Cargo.toml b/mistralrs-core/Cargo.toml
index 91ded26ef..861e88e75 100644
--- a/mistralrs-core/Cargo.toml
+++ b/mistralrs-core/Cargo.toml
@@ -12,6 +12,7 @@ license.workspace = true
 homepage.workspace = true

 [dependencies]
+cudarc.workspace = true
 anyhow.workspace = true
 candle-core.workspace = true
 candle-nn.workspace = true
--
2.48.1

to fix the cuda problem (which cudarc not support cuda 12.8)

@Sherlock-Holo
Copy link
Author

llama-cli -ngl 99 -m sha256-6e9f90f02bb3b39b59e81916e8cfce9deb45aeaeb9a54a5be4414486b907dc1e -cnv

ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 4080, compute capability 8.9, VMM: yes
register_backend: registered backend CUDA (1 devices)
register_device: registered device CUDA0 (NVIDIA GeForce RTX 4080)
register_backend: registered backend BLAS (1 devices)
register_device: registered device BLAS (OpenBLAS)
register_backend: registered backend RPC (0 devices)
register_backend: registered backend CPU (1 devices)
register_device: registered device CPU (AMD Ryzen 9 5900X 12-Core Processor)
build: 4702 (a394039db) with cc (GCC) 14.2.1 20250207 for x86_64-pc-linux-gnu (debug)
main: llama backend init
main: load the model and apply lora adapter, if any
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 4080) - 15454 MiB free
llama_model_loader: loaded meta data with 26 key-value pairs and 579 tensors from sha256-6e9f90f02bb3b39b59e81916e8cfce9deb45aeaeb9a54a5be4414486b907dc1e (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = qwen2
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = DeepSeek R1 Distill Qwen 14B
llama_model_loader: - kv   3:                           general.basename str              = DeepSeek-R1-Distill-Qwen
llama_model_loader: - kv   4:                         general.size_label str              = 14B
llama_model_loader: - kv   5:                          qwen2.block_count u32              = 48
llama_model_loader: - kv   6:                       qwen2.context_length u32              = 131072
llama_model_loader: - kv   7:                     qwen2.embedding_length u32              = 5120
llama_model_loader: - kv   8:                  qwen2.feed_forward_length u32              = 13824
llama_model_loader: - kv   9:                 qwen2.attention.head_count u32              = 40
llama_model_loader: - kv  10:              qwen2.attention.head_count_kv u32              = 8
llama_model_loader: - kv  11:                       qwen2.rope.freq_base f32              = 1000000.000000
llama_model_loader: - kv  12:     qwen2.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  13:                          general.file_type u32              = 15
llama_model_loader: - kv  14:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  15:                         tokenizer.ggml.pre str              = qwen2
llama_model_loader: - kv  16:                      tokenizer.ggml.tokens arr[str,152064]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  17:                  tokenizer.ggml.token_type arr[i32,152064]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  18:                      tokenizer.ggml.merges arr[str,151387]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv  19:                tokenizer.ggml.bos_token_id u32              = 151646
llama_model_loader: - kv  20:                tokenizer.ggml.eos_token_id u32              = 151643
llama_model_loader: - kv  21:            tokenizer.ggml.padding_token_id u32              = 151643
llama_model_loader: - kv  22:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  23:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - kv  24:                    tokenizer.chat_template str              = {% if not add_generation_prompt is de...
llama_model_loader: - kv  25:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:  241 tensors
llama_model_loader: - type q4_K:  289 tensors
llama_model_loader: - type q6_K:   49 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = Q4_K - Medium
print_info: file size   = 8.37 GiB (4.87 BPW)
load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
load: special tokens cache size = 22
load: token to piece cache size = 0.9310 MB
print_info: arch             = qwen2
print_info: vocab_only       = 0
print_info: n_ctx_train      = 131072
print_info: n_embd           = 5120
print_info: n_layer          = 48
print_info: n_head           = 40
print_info: n_head_kv        = 8
print_info: n_rot            = 128
print_info: n_swa            = 0
print_info: n_embd_head_k    = 128
print_info: n_embd_head_v    = 128
print_info: n_gqa            = 5
print_info: n_embd_k_gqa     = 1024
print_info: n_embd_v_gqa     = 1024
print_info: f_norm_eps       = 0.0e+00
print_info: f_norm_rms_eps   = 1.0e-05
print_info: f_clamp_kqv      = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale    = 0.0e+00
print_info: n_ff             = 13824
print_info: n_expert         = 0
print_info: n_expert_used    = 0
print_info: causal attn      = 1
print_info: pooling type     = 0
print_info: rope type        = 2
print_info: rope scaling     = linear
print_info: freq_base_train  = 1000000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn  = 131072
print_info: rope_finetuned   = unknown
print_info: ssm_d_conv       = 0
print_info: ssm_d_inner      = 0
print_info: ssm_d_state      = 0
print_info: ssm_dt_rank      = 0
print_info: ssm_dt_b_c_rms   = 0
print_info: model type       = 14B
print_info: model params     = 14.77 B
print_info: general.name     = DeepSeek R1 Distill Qwen 14B
print_info: vocab type       = BPE
print_info: n_vocab          = 152064
print_info: n_merges         = 151387
print_info: BOS token        = 151646 '<|begin▁of▁sentence|>'
print_info: EOS token        = 151643 '<|end▁of▁sentence|>'
print_info: EOT token        = 151643 '<|end▁of▁sentence|>'
print_info: PAD token        = 151643 '<|end▁of▁sentence|>'
print_info: LF token         = 198 'Ċ'
print_info: FIM PRE token    = 151659 '<|fim_prefix|>'
print_info: FIM SUF token    = 151661 '<|fim_suffix|>'
print_info: FIM MID token    = 151660 '<|fim_middle|>'
print_info: FIM PAD token    = 151662 '<|fim_pad|>'
print_info: FIM REP token    = 151663 '<|repo_name|>'
print_info: FIM SEP token    = 151664 '<|file_sep|>'
print_info: EOG token        = 151643 '<|end▁of▁sentence|>'
print_info: EOG token        = 151662 '<|fim_pad|>'
print_info: EOG token        = 151663 '<|repo_name|>'
print_info: EOG token        = 151664 '<|file_sep|>'
print_info: max token length = 256
load_tensors: loading model tensors, this can take a while... (mmap = true)
load_tensors: offloading 48 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 49/49 layers to GPU
load_tensors:        CUDA0 model buffer size =  8148.38 MiB
load_tensors:   CPU_Mapped model buffer size =   417.66 MiB
..........................................................................................
llama_init_from_model: n_seq_max     = 1
llama_init_from_model: n_ctx         = 4096
llama_init_from_model: n_ctx_per_seq = 4096
llama_init_from_model: n_batch       = 2048
llama_init_from_model: n_ubatch      = 512
llama_init_from_model: flash_attn    = 0
llama_init_from_model: freq_base     = 1000000.0
llama_init_from_model: freq_scale    = 1
llama_init_from_model: n_ctx_per_seq (4096) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
llama_kv_cache_init: kv_size = 4096, offload = 1, type_k = 'f16', type_v = 'f16', n_layer = 48, can_shift = 1
llama_kv_cache_init:      CUDA0 KV buffer size =   768.00 MiB
llama_init_from_model: KV self size  =  768.00 MiB, K (f16):  384.00 MiB, V (f16):  384.00 MiB
llama_init_from_model:  CUDA_Host  output buffer size =     0.58 MiB
llama_init_from_model:      CUDA0 compute buffer size =   368.00 MiB
llama_init_from_model:  CUDA_Host compute buffer size =    18.01 MiB
llama_init_from_model: graph nodes  = 1686
llama_init_from_model: graph splits = 2
common_init_from_params: setting dry_penalty_last_n to ctx_size = 4096
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
main: llama threadpool init, n_threads = 12
main: chat template example:
You are a helpful assistant

<|User|>Hello<|Assistant|>Hi there<|end▁of▁sentence|><|User|>How are you?<|Assistant|>

system_info: n_threads = 12 (n_threads_batch = 12) / 24 | CUDA : ARCHS = 500,600,700,800,900,1000,1200 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | LLAMAFILE = 1 | OPENMP = 1 | AARCH64_REPACK = 1 |

main: interactive mode on.
sampler seed: 1528736906
sampler params:
        repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
        dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = 4096
        top_k = 40, top_p = 0.950, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, temp = 0.800
        mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampler chain: logits -> logit-bias -> penalties -> dry -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist
generate: n_ctx = 4096, n_batch = 2048, n_predict = -1, n_keep = 1

== Running in interactive mode. ==
 - Press Ctrl+C to interject at any time.
 - Press Return to return control to the AI.
 - To return control without starting a new line, end your input with '/'.
 - If you want to submit another line, end your input with '\'.
 - Using default system message. To change it, set a different value via -p PROMPT or -f FILE argument.

You are a helpful assistant


>
llama_perf_sampler_print:    sampling time =       0.00 ms /     7 runs   (    0.00 ms per token, 3500000.00 tokens per second)
llama_perf_context_print:        load time =    1468.82 ms
llama_perf_context_print: prompt eval time =       0.00 ms /     1 tokens (    0.00 ms per token,      inf tokens per second)
llama_perf_context_print:        eval time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_perf_context_print:       total time =    1039.66 ms /     2 tokens
Interrupted by user

llama.cpp can run the gguf file so I think the model file has no problem

@EricLBuehler
Copy link
Owner

@Sherlock-Holo I pushed 87a7c23 which fixes the loading. It looks like the chat template for these models in the GGUF file migth be incorrect, as it does not match the official one from Qwen. I'm not really sure what a good workaround is, other than creating your own or using ISQ.

@Sherlock-Holo
Copy link
Author

I tried 87a7c23 and confirm it fixed the problem

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants