[Lunar Lake] UR_RESULT_ERROR_DEVICE_LOST #780

rpolyano · 2025-02-04T13:07:15Z

Describe the bug

Trying to load the openbmb/MiniCPM-o-2_6 model results in

Native API failed. Native API returns: 20 (UR_RESULT_ERROR_DEVICE_LOST)
  File "...py/nightingale/server.py", line 49, in __init__
    self.model = self.model.to(self._device)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "...py/nightingale/server_test.py", line 10, in <module>
    service = MiniCPMService()
              ^^^^^^^^^^^^^^^^
RuntimeError: Native API failed. Native API returns: 20 (UR_RESULT_ERROR_DEVICE_LOST)

If I add .eval() after .model() it fully crashes my entire desktop, and sends me back to login screen.

I have also tried this in the docker.io/intel/intel-extension-for-pytorch:2.5.10-xpu docker container, same result.

Full code snippet:

import enum
from io import BytesIO
from typing import NewType, TypeAlias, TypeVar
from grpc import ServicerContext
import torch
from transformers import AutoModelForCausalLM, AutoProcessor, AutoTokenizer, AutoModel
from PIL import Image


def _select_device() -> torch.device:
    if torch.cuda.is_available():
        return torch.device("cuda:0")
    if torch.xpu.is_available():
        return torch.device("xpu")
    if torch.mps.is_available():
        return torch.device("mps")
    return torch.device("cpu")

class MiniCPMService:

    def __init__(self) -> None:
        super().__init__()

        self._device = _select_device()
        self._torch_dtype = (
            torch.bfloat16 if self._device.type != "cpu" else torch.float16
        )

        print(f'Running on {self._device} with dtype {self._torch_dtype}')

        self.model = AutoModel.from_pretrained(
            'openbmb/MiniCPM-o-2_6',
            trust_remote_code=True,
            attn_implementation='sdpa', # sdpa or flash_attention_2
            torch_dtype=torch.bfloat16,
            init_vision=True,
            init_audio=False,
            init_tts=False
        )


        self.model = self.model.to(self._device)
        # self.model = self.model.eval().to(self._device) # This eval() results in a full system crash
        self.tokenizer = AutoTokenizer.from_pretrained('openbmb/MiniCPM-o-2_6', trust_remote_code=True)

Versions

Traceback (most recent call last):
File .../collect_env.py", line 19, in
import intel_extension_for_pytorch as ipex
File "/home/roman/.local/share/virtualenvs/nightingale-uqI8m8sk/lib/python3.12/site-packages/intel_extension_for_pytorch/init.py", line 147, in
from . import _dynamo
File "/home/roman/.local/share/virtualenvs/nightingale-uqI8m8sk/lib/python3.12/site-packages/intel_extension_for_pytorch/_dynamo/init.py", line 4, in
from torch._inductor.compile_fx import compile_fx
File "/home/roman/.local/share/virtualenvs/nightingale-uqI8m8sk/lib/python3.12/site-packages/torch/_inductor/compile_fx.py", line 49, in
from torch._inductor.debug import save_args_for_compile_fx_inner
File "/home/roman/.local/share/virtualenvs/nightingale-uqI8m8sk/lib/python3.12/site-packages/torch/_inductor/debug.py", line 26, in
from . import config, ir # noqa: F811, this is needed
^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/roman/.local/share/virtualenvs/nightingale-uqI8m8sk/lib/python3.12/site-packages/torch/_inductor/ir.py", line 77, in
from .runtime.hints import ReductionHint
File "/home/roman/.local/share/virtualenvs/nightingale-uqI8m8sk/lib/python3.12/site-packages/torch/_inductor/runtime/hints.py", line 36, in
attr_desc_fields = {f.name for f in fields(AttrsDescriptor)}
^^^^^^^^^^^^^^^^^^^^^^^
File "/home/roman/.pyenv/versions/3.12.8/lib/python3.12/dataclasses.py", line 1289, in fields
raise TypeError('must be called with a dataclass type or instance') from None
TypeError: must be called with a dataclass type or instance

The text was updated successfully, but these errors were encountered:

louie-tsai · 2025-02-14T23:35:05Z

@rpolyano
Sorry for late response.
Since you faced a device lost, could you use xpu-smi to check whether you have right devices inside docker?
here are the instructions
https://intel.github.io/xpumanager/smi_user_guide.html#discover-the-devices-in-this-machine
thanks
Louie

louie-tsai · 2025-02-26T21:56:36Z

@rpolyano
moreover, your codes need flash-attn which doesn't support XPU or CPU.
https://pypi.org/project/flash-attn/
In that case, we might not be able to run the codes with flash-attn package dependency.

louie-tsai self-assigned this Feb 5, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Lunar Lake] UR_RESULT_ERROR_DEVICE_LOST #780

[Lunar Lake] UR_RESULT_ERROR_DEVICE_LOST #780

rpolyano commented Feb 4, 2025

louie-tsai commented Feb 14, 2025

louie-tsai commented Feb 26, 2025

[Lunar Lake] UR_RESULT_ERROR_DEVICE_LOST #780

[Lunar Lake] UR_RESULT_ERROR_DEVICE_LOST #780

Comments

rpolyano commented Feb 4, 2025

Describe the bug

Versions

louie-tsai commented Feb 14, 2025

louie-tsai commented Feb 26, 2025