Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Chat with Llama 3.2 1B #65

Open
shubhamgupto opened this issue Jan 17, 2025 · 3 comments
Open

Chat with Llama 3.2 1B #65

shubhamgupto opened this issue Jan 17, 2025 · 3 comments

Comments

@shubhamgupto
Copy link

Hello,

I want to run Llama 3.2 1B on my jetson orin nano using nanollm interface. I have been granted access to the llama 3.2 models but im not sure how to share my hf token with the container.

HUGGINGFACE_KEY=<>	 \
MLC_VERSION=0.1.2 \
jetson-containers run $(autotag nano_llm) \
  python3 -m nano_llm.chat --api=mlc \
    --model meta-llama/Llama-3.2-1B

let me know if there's a better way to do this, thanks

@shubhamgupto
Copy link
Author

I was able to first run the container and do huggingface-cli login and now i am able to download the weights, would be nice if its a 1 command like rest of the models

@dusty-nv
Copy link
Owner

Hi @shubhamgupto , you can set it by setting HUGGINGFACE_TOKEN like shown here: https://www.jetson-ai-lab.com/tutorial_nano-llm.html#containers

@shubhamgupto
Copy link
Author

Hey @dusty-nv this is great, thanks, any idea why the following error occurs?

Using path "/data/models/mlc/dist/models/Llama-3.2-3B-Instruct" for model "Llama-3.2-3B-Instruct"
Target configured: cuda -keys=cuda,gpu -arch=sm_87 -max_num_threads=1024 -max_shared_memory_per_block=49152 -max_threads_per_block=1024 -registers_per_block=65536 -thread_warp_size=32
Automatically using target for weight quantization: cuda -keys=cuda,gpu -arch=sm_87 -max_num_threads=1024 -max_shared_memory_per_block=49152 -max_threads_per_block=1024 -registers_per_block=65536 -thread_warp_size=32
Get old param:   0%|                                                                                       | 0/173 [00:00<?, ?tensors/sStart computing and quantizing weights... This may take a while.                                            | 0/287 [00:00<?, ?tensors/s]
Get old param:   1%|▍                                                                              | 1/173 [00:02<07:10,  2.50s/tensors]Traceback (most recent call last):
  File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/usr/local/lib/python3.10/dist-packages/mlc_llm/build.py", line 47, in <module>
    main()
  File "/usr/local/lib/python3.10/dist-packages/mlc_llm/build.py", line 43, in main
    core.build_model_from_args(parsed_args)
  File "/usr/local/lib/python3.10/dist-packages/mlc_llm/core.py", line 909, in build_model_from_args
    params = utils.convert_weights(mod_transform, param_manager, params, args)
  File "/usr/local/lib/python3.10/dist-packages/mlc_llm/utils.py", line 285, in convert_weights
    vm["transform_params"]()
  File "tvm/_ffi/_cython/./packed_func.pxi", line 332, in tvm._ffi._cy3.core.PackedFuncBase.__call__
  File "tvm/_ffi/_cython/./packed_func.pxi", line 263, in tvm._ffi._cy3.core.FuncCall
  File "tvm/_ffi/_cython/./packed_func.pxi", line 252, in tvm._ffi._cy3.core.FuncCall3
  File "tvm/_ffi/_cython/./base.pxi", line 182, in tvm._ffi._cy3.core.CHECK_CALL
  File "/usr/local/lib/python3.10/dist-packages/tvm/_ffi/base.py", line 481, in raise_last_ffi_error
    raise py_err
  File "tvm/_ffi/_cython/./packed_func.pxi", line 56, in tvm._ffi._cy3.core.tvm_callback
  File "/usr/local/lib/python3.10/dist-packages/mlc_llm/utils.py", line 48, in inner
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/mlc_llm/relax_model/param_manager.py", line 622, in get_item
    for torch_binname in [
  File "/usr/local/lib/python3.10/dist-packages/mlc_llm/relax_model/param_manager.py", line 623, in <listcomp>
    self.torch_pname2binname[torch_pname] for torch_pname in torch_pnames
KeyError: 'lm_head.weight'
Traceback (most recent call last):
  File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/opt/NanoLLM/nano_llm/chat/__main__.py", line 32, in <module>
    model = NanoLLM.from_pretrained(
  File "/opt/NanoLLM/nano_llm/nano_llm.py", line 91, in from_pretrained
    model = MLCModel(model_path, **kwargs)
  File "/opt/NanoLLM/nano_llm/models/mlc.py", line 60, in __init__
    quant = MLCModel.quantize(self.model_path, self.config, method=quantization, max_context_len=max_context_len, **kwargs)
  File "/opt/NanoLLM/nano_llm/models/mlc.py", line 276, in quantize
    subprocess.run(cmd, executable='/bin/bash', shell=True, check=True)  
  File "/usr/lib/python3.10/subprocess.py", line 526, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command 'python3 -m mlc_llm.build --model /data/models/mlc/dist/models/Llama-3.2-3B-Instruct --quantization q4f16_ft --target cuda --use-cuda-graph --use-flash-attn-mqa --sep-embed --max-seq-len 131072 --artifact-path /data/models/mlc/dist/Llama-3.2-3B-Instruct/ctx131072 --use-safetensors ' returned non-zero exit status 1.

is there a list of supported llama models, trying get the latest versions

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants