Chat with Llama 3.2 1B #65

shubhamgupto · 2025-01-17T18:59:58Z

Hello,

I want to run Llama 3.2 1B on my jetson orin nano using nanollm interface. I have been granted access to the llama 3.2 models but im not sure how to share my hf token with the container.

HUGGINGFACE_KEY=<>	 \
MLC_VERSION=0.1.2 \
jetson-containers run $(autotag nano_llm) \
  python3 -m nano_llm.chat --api=mlc \
    --model meta-llama/Llama-3.2-1B

let me know if there's a better way to do this, thanks

The text was updated successfully, but these errors were encountered:

shubhamgupto · 2025-01-17T19:06:21Z

I was able to first run the container and do huggingface-cli login and now i am able to download the weights, would be nice if its a 1 command like rest of the models

dusty-nv · 2025-01-17T19:21:31Z

Hi @shubhamgupto , you can set it by setting HUGGINGFACE_TOKEN like shown here: https://www.jetson-ai-lab.com/tutorial_nano-llm.html#containers

shubhamgupto · 2025-01-17T20:12:54Z

Hey @dusty-nv this is great, thanks, any idea why the following error occurs?

Using path "/data/models/mlc/dist/models/Llama-3.2-3B-Instruct" for model "Llama-3.2-3B-Instruct"
Target configured: cuda -keys=cuda,gpu -arch=sm_87 -max_num_threads=1024 -max_shared_memory_per_block=49152 -max_threads_per_block=1024 -registers_per_block=65536 -thread_warp_size=32
Automatically using target for weight quantization: cuda -keys=cuda,gpu -arch=sm_87 -max_num_threads=1024 -max_shared_memory_per_block=49152 -max_threads_per_block=1024 -registers_per_block=65536 -thread_warp_size=32
Get old param:   0%|                                                                                       | 0/173 [00:00<?, ?tensors/sStart computing and quantizing weights... This may take a while.                                            | 0/287 [00:00<?, ?tensors/s]
Get old param:   1%|▍                                                                              | 1/173 [00:02<07:10,  2.50s/tensors]Traceback (most recent call last):
  File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/usr/local/lib/python3.10/dist-packages/mlc_llm/build.py", line 47, in <module>
    main()
  File "/usr/local/lib/python3.10/dist-packages/mlc_llm/build.py", line 43, in main
    core.build_model_from_args(parsed_args)
  File "/usr/local/lib/python3.10/dist-packages/mlc_llm/core.py", line 909, in build_model_from_args
    params = utils.convert_weights(mod_transform, param_manager, params, args)
  File "/usr/local/lib/python3.10/dist-packages/mlc_llm/utils.py", line 285, in convert_weights
    vm["transform_params"]()
  File "tvm/_ffi/_cython/./packed_func.pxi", line 332, in tvm._ffi._cy3.core.PackedFuncBase.__call__
  File "tvm/_ffi/_cython/./packed_func.pxi", line 263, in tvm._ffi._cy3.core.FuncCall
  File "tvm/_ffi/_cython/./packed_func.pxi", line 252, in tvm._ffi._cy3.core.FuncCall3
  File "tvm/_ffi/_cython/./base.pxi", line 182, in tvm._ffi._cy3.core.CHECK_CALL
  File "/usr/local/lib/python3.10/dist-packages/tvm/_ffi/base.py", line 481, in raise_last_ffi_error
    raise py_err
  File "tvm/_ffi/_cython/./packed_func.pxi", line 56, in tvm._ffi._cy3.core.tvm_callback
  File "/usr/local/lib/python3.10/dist-packages/mlc_llm/utils.py", line 48, in inner
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/mlc_llm/relax_model/param_manager.py", line 622, in get_item
    for torch_binname in [
  File "/usr/local/lib/python3.10/dist-packages/mlc_llm/relax_model/param_manager.py", line 623, in <listcomp>
    self.torch_pname2binname[torch_pname] for torch_pname in torch_pnames
KeyError: 'lm_head.weight'
Traceback (most recent call last):
  File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/opt/NanoLLM/nano_llm/chat/__main__.py", line 32, in <module>
    model = NanoLLM.from_pretrained(
  File "/opt/NanoLLM/nano_llm/nano_llm.py", line 91, in from_pretrained
    model = MLCModel(model_path, **kwargs)
  File "/opt/NanoLLM/nano_llm/models/mlc.py", line 60, in __init__
    quant = MLCModel.quantize(self.model_path, self.config, method=quantization, max_context_len=max_context_len, **kwargs)
  File "/opt/NanoLLM/nano_llm/models/mlc.py", line 276, in quantize
    subprocess.run(cmd, executable='/bin/bash', shell=True, check=True)  
  File "/usr/lib/python3.10/subprocess.py", line 526, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command 'python3 -m mlc_llm.build --model /data/models/mlc/dist/models/Llama-3.2-3B-Instruct --quantization q4f16_ft --target cuda --use-cuda-graph --use-flash-attn-mqa --sep-embed --max-seq-len 131072 --artifact-path /data/models/mlc/dist/Llama-3.2-3B-Instruct/ctx131072 --use-safetensors ' returned non-zero exit status 1.

is there a list of supported llama models, trying get the latest versions

shubhamgupto closed this as completed Jan 17, 2025

shubhamgupto reopened this Jan 17, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Chat with Llama 3.2 1B #65

Chat with Llama 3.2 1B #65

shubhamgupto commented Jan 17, 2025

shubhamgupto commented Jan 17, 2025

dusty-nv commented Jan 17, 2025

shubhamgupto commented Jan 17, 2025

Chat with Llama 3.2 1B #65

Chat with Llama 3.2 1B #65

Comments

shubhamgupto commented Jan 17, 2025

shubhamgupto commented Jan 17, 2025

dusty-nv commented Jan 17, 2025

shubhamgupto commented Jan 17, 2025