Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[iGPU] The device does not have the ext_intel_free_memory aspect #1352

Open
Stonepia opened this issue Feb 10, 2025 · 4 comments
Open

[iGPU] The device does not have the ext_intel_free_memory aspect #1352

Stonepia opened this issue Feb 10, 2025 · 4 comments
Assignees
Labels
client dependency component: driver hw : LNL hw : MTL MTL platform module: dependency bug Problem is not caused by us, but caused by the library we use os: Windows Windows Platform
Milestone

Comments

@Stonepia
Copy link
Contributor

Stonepia commented Feb 10, 2025

🐛 Describe the bug

This issue only happens on iGPU on Windows. It could pass on BMG. The error message is below:

>>> import torch
>>> torch.xpu.mem_get_info()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Users\sdp\miniforge3\envs\tongsu_stock_pt\lib\site-packages\torch\xpu\memory.py", line 194, in mem_get_info
    return torch._C._xpu_getMemoryInfo(device)
RuntimeError: The device does not have the ext_intel_free_memory aspect
>>> torch.version.xpu
'20250000'

Versions

PyTorch version: 2.7.0.dev20250209+xpu
XPU used to build PyTorch: 20250000
Is XPU available: True
Intel GPU driver version:

  • 32.0.101.6458 (20250110000000.***+)
    Intel GPU models onboard:
  • Intel(R) Arc(TM) 140V GPU (16GB)
    Intel GPU models detected:
  • [0] _XpuDeviceProperties(name='Intel(R) Arc(TM) 140V GPU (16GB)', platform_name='Intel(R) oneAPI Unified Runtime over Level-Zero', type='gpu', driver_version='1.6.31441', total_memory=16900MB, max_compute_units=64, gpu_eu_count=64, gpu_subslice_count=8, max_work_group_size=1024, max_num_sub_groups=64, sub_group_sizes=[16 32], has_fp16=1, has_fp64=1, has_atomic64=1)
pytorchmergebot pushed a commit to pytorch/pytorch that referenced this issue Feb 13, 2025
# Motivation
Friendly handle the runtime error message if the device doesn't support querying the available free memory. See intel/torch-xpu-ops#1352

Pull Request resolved: #146899
Approved by: https://github.com/EikanWang
Tytskiy pushed a commit to Tytskiy/pytorch that referenced this issue Feb 18, 2025
# Motivation
Friendly handle the runtime error message if the device doesn't support querying the available free memory. See intel/torch-xpu-ops#1352

Pull Request resolved: pytorch#146899
Approved by: https://github.com/EikanWang
@Stonepia
Copy link
Contributor Author

More details for future reproduce:

// icpx demo.cpp -o demo.exe -fsycl
#include <sycl/sycl.hpp>
 
#include <cstdint>
#include <vector>
 
namespace {
 
struct DevicePool {
  std::vector<std::unique_ptr<sycl::device>> devices;
  std::unique_ptr<sycl::context> context;
} gDevicePool;
 
void enumDevices(std::vector<std::unique_ptr<sycl::device>>& devices) {
  for (const auto& platform : sycl::platform::get_platforms()) {
    if (platform.get_backend() != sycl::backend::ext_oneapi_level_zero) {
        continue;
    }
    for (const auto& device : platform.get_devices()) {
      if (device.is_gpu()) {
        devices.push_back(std::make_unique<sycl::device>(device));
      }
    }
    break;
  }
}
 
inline void initGlobalDevicePoolState() {
  // Enumerate all GPU devices and record them.
  enumDevices(gDevicePool.devices);
  if (gDevicePool.devices.empty()) {
    return;
  }
  gDevicePool.context = std::make_unique<sycl::context>(
      gDevicePool.devices[0]->get_platform().ext_oneapi_get_default_context());
}
 
}
 
sycl::device& get_raw_device(int device) {
  return *gDevicePool.devices[device];
}
 
sycl::context& get_device_context() {
  return *gDevicePool.context;
}
 
int device_count() {
  return gDevicePool.devices.size();
}
 
int main() {
  initGlobalDevicePoolState();
  const auto count = device_count();
  std::cout << "device count is " << count << std::endl;
  if (count <= 0) {
    return 0;
  }
  for (auto i = 0; i < count; i++) {
    auto& device = get_raw_device(i);
    std::cout << i << "th device name is " << device.get_info<sycl::info::device::name>() << ", total memory is "
              << device.get_info<sycl::info::device::global_mem_size>()/1024./1024./1024. << " Gb.";
    if (device.has(sycl::aspect::ext_intel_free_memory)) {
      std::cout << " free device memory is " << device.get_info<sycl::ext::intel::info::device::free_memory>()/1024./1024./1024. << "Gb.";
    } else {
      // This happens on LNL, it lacks of the sycl::aspect::ext_intel_free_memory 
       std::cout << " ERROR: free device memory is not available.";
    }
    std::cout << std::endl;
  }
 
  std::cout << "finish!" << std::endl;
}

The output is below, You could find that it does not support the sycl::aspect::ext_intel_free_memory:

>demo.exe
device count is 1
0th device name is Intel(R) Arc(TM) 140V GPU (16GB), total memory is 16.5044 Gb. ERROR: free device memory is not available.
finish!

>sycl-ls
[level_zero:gpu][level_zero:0] Intel(R) oneAPI Unified Runtime over Level-Zero, Intel(R) Arc(TM) 140V GPU (16GB) 20.4.4 [1.6.31441]
[opencl:cpu][opencl:0] Intel(R) OpenCL, Intel(R) Core(TM) Ultra 9 288V 3.30GHz OpenCL 3.0 (Build 0) [2024.18.12.0.05_160000]
[opencl:gpu][opencl:1] Intel(R) OpenCL Graphics, Intel(R) Arc(TM) 140V GPU (16GB) OpenCL 3.0 NEO  [32.0.101.6458]

The UR_TRACE would get the following output:

ZE ---> zeCommandListCreateImmediate(ZeContext, Device->ZeDevice, &ZeCommandQueueDesc, &ZeCommandListInit)
(.DeviceCount = 1, .phDevices = {000001E0CD323980}, .pProperties = nullptr, .phContext = 000001E0CE08AB80 (000001E0CE0AEC20)) -> UR_RESULT_SUCCESS;
device count is 1
0th device name is ---> urDeviceGetInfo(.hDevice = 000001E0CD323980, .propName = UR_DEVICE_INFO_NAME, .propSize = 0, .pPropValue = nullptr, .pPropSizeRet = 00000026E23BFC30 (33)) -> UR_RESULT_SUCCESS;
---> urDeviceGetInfo(.hDevice = 000001E0CD323980, .propName = UR_DEVICE_INFO_NAME, .propSize = 33, .pPropValue = 000001E0CE099E50 (Intel(R) Arc(TM) 140V GPU (16GB)), .pPropSizeRet = nullptr) -> UR_RESULT_SUCCESS;
Intel(R) Arc(TM) 140V GPU (16GB), total memory is ---> urDeviceGetInfoZE ---> zeDeviceGetMemoryProperties(ZeDevice, &Count, nullptr)
ZE ---> zeDeviceGetMemoryProperties(ZeDevice, &Count, PropertiesVector.data())
(.hDevice = 000001E0CD323980, .propName = UR_DEVICE_INFO_GLOBAL_MEM_SIZE, .propSize = 8, .pPropValue = 00000026E23BFD50 (17721458688), .pPropSizeRet = nullptr) -> UR_RESULT_SUCCESS;
16.5044 Gb.---> urDeviceGetInfoZE ---> zesDeviceEnumMemoryModules(ZesDevice, &MemCount, nullptr)
(.hDevice = 000001E0CD323980, .propName = UR_DEVICE_INFO_GLOBAL_MEM_FREE, .propSize = 0, .pPropValue = nullptr, .pPropSizeRet = 00000026E23BFD18 (0)) -> UR_RESULT_ERROR_UNSUPPORTED_ENUMERATION;
 ERROR: free device memory is not available.
finish!

@Stonepia
Copy link
Contributor Author

The root cause of this issue is that the driver does not support it now. See the GSD-10758 for the internal track.

Summary, is this is not supported on Integrated Platforms (Like LNL) at this time.
UR_DEVICE_INFO_GLOBAL_MEM_FREE needs device modules from Sysman being reported which are not, hence not supported return codes from UR.

UR detects 0 modules being reported and returns this error.

@Stonepia Stonepia added hw : MTL MTL platform module: dependency bug Problem is not caused by us, but caused by the library we use os: Windows Windows Platform labels Feb 20, 2025
@Stonepia Stonepia changed the title [LNL] The device does not have the ext_intel_free_memory aspect [iGPU] The device does not have the ext_intel_free_memory aspect Feb 20, 2025
@Stonepia
Copy link
Contributor Author

The issue should not be a blocking issue, since we already have a notification in PyTorch to warn users. So it won't block the user's overall experience:

pytorch/pytorch#146899

Raymo111 pushed a commit to pytorch/pytorch that referenced this issue Feb 20, 2025
# Motivation
Friendly handle the runtime error message if the device doesn't support querying the available free memory. See intel/torch-xpu-ops#1352

Pull Request resolved: #146899
Approved by: https://github.com/EikanWang
@Stonepia
Copy link
Contributor Author

Stonepia commented Feb 21, 2025

Additional Ref : intel/compute-runtime#742

@daisyden daisyden added this to the 2.8 milestone Feb 24, 2025
Ryo-not-rio pushed a commit to Ryo-not-rio/pytorch that referenced this issue Feb 24, 2025
# Motivation
Friendly handle the runtime error message if the device doesn't support querying the available free memory. See intel/torch-xpu-ops#1352

Pull Request resolved: pytorch#146899
Approved by: https://github.com/EikanWang
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
client dependency component: driver hw : LNL hw : MTL MTL platform module: dependency bug Problem is not caused by us, but caused by the library we use os: Windows Windows Platform
Projects
None yet
Development

No branches or pull requests

3 participants