Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory Bug #71

Open
raevillena opened this issue Jun 27, 2024 · 26 comments
Open

Memory Bug #71

raevillena opened this issue Jun 27, 2024 · 26 comments
Assignees
Labels
aitce question Further information is requested

Comments

@raevillena
Copy link

raevillena commented Jun 27, 2024

I am having memory issue with the running things. Everything works except that training bigger data crashes the kernel of jupyter notebook.

System Desktop

Ubuntu 22.04 on WSL2
Host: Windows 11
32Gb ram
AMD 5700x CPU
Intel Arc A750 8GB

Setup:
miniconda3 on itex environment

# pip list |grep tensorflow
intel_extension_for_tensorflow     2.15.0.0
intel_extension_for_tensorflow_lib 2.15.0.0.2
tensorflow                         2.15.0
tensorflow-datasets                4.9.3
tensorflow-estimator               2.15.0
tensorflow-io-gcs-filesystem       0.37.0
tensorflow-metadata                1.15.0

running model fit with train data results to (especially with vgg, resnet works fine):

2024-06-27 16:10:04.010287: I external/tsl/tsl/framework/bfc_allocator.cc:1122] Sum Total of in-use chunks: 513.70MiB
2024-06-27 16:10:04.010290: I external/tsl/tsl/framework/bfc_allocator.cc:1124] Total bytes in pool: 982550528 memory_limit_: 7487535513 available bytes: 6504984985 curr_region_allocation_bytes_: 14975071232
2024-06-27 16:10:04.010295: I external/tsl/tsl/framework/bfc_allocator.cc:1129] Stats:
Limit:                      7487535513
InUse:                       538648576
MaxInUse:                    956967680
NumAllocs:                         297
MaxAllocSize:                485714176
Reserved:                            0
PeakReserved:                        0
LargestFreeBlock:                    0

it crashes no matter what I do when it tries to allocated that 14gb in the curr_region_allocation

Global mem shows:

#clinfo | grep "Global memory size"
Global memory size                              16723046400 (15.57GiB)
Global memory size                              8319483904 (7.748GiB)

btw my version of itex didnt came with check_env.sh so I cant run that, I just know it works cause it does and it doesnt.

In jupyter the device is recognized as this

1 Physical GPUs, [LogicalDevice(name='/device:XPU:0', device_type='XPU')]

Also the other setups I can read about issues of bfc allocator uses the one that came along with the tensorflow while mine was coming from itex build files.

I could see that the repo is available for rebuilding and there might be chance to find what is happening there but I dont have the time and ability to do so.

I just wanna know if there what am I missing here since it was able allocate almost 8gb memory but unable to expand it.

I also tried exporting this to the conda environment with no effect
export ITEX_LIMIT_MEMORY_SIZE_IN_MB=4096

I said earlier that it works, yes I can train a resnet model blazingly fast compared to tesla t4 in colab but running it twice give the memory error.

what is consistent is that it tries to allocate that curr region allocation bytes: 14975071232
that value was very consistent. which I dont know why. It makes sense the the oom happens with that but why allocate 14gb when tf doesnt even need that much for the current workload.

@raevillena
Copy link
Author

update: ok so I was able to solve it

after reading all issues and documents I could here is what I did coming from a reboot of the wsl

in your terminal do this without starting the virtual environment:

export ITEX_LIMIT_MEMORY_SIZE_IN_MB=1024

if you have fp64 issues do this too

export OverrideDefaultFP64Settings=1
export IGC_EnableDPEmulation=1

then forcefully source your vars (still in the main environment)
source /opt/intel/oneapi/setvars.sh --force

you may now activate conda environment, and set the variables again all of them if you may:

export OverrideDefaultFP64Settings=1
export IGC_EnableDPEmulation=1
export ITEX_LIMIT_MEMORY_SIZE_IN_MB=1024

you may check using
printenv
this will list the variables in the conda environment

then you may now start the usage of tf. in my case:
jupyter notebook

all this happened without reinstallation of my system.

@guizili0
Copy link
Contributor

@raevillena Can you help to check if our latest weekly release still has this issue? thanks.

pip install --upgrade intel-extension-for-tensorflow-weekly[xpu] -f https://developer.intel.com/itex-whl-weekly

@raevillena
Copy link
Author

I just tried right now without exporting any env variables i mentioned above but still give me

NotFoundError: libsycl.so.7: cannot open shared object file: No such file or directory
this can be solved using source /opt/intel/oneapi/setvars.sh --force

Now I tried solving it with just setvars without setting the limit memory but no. the memory bug is still there.

but the fp64 emulation is now working without setting env

@guizili0
Copy link
Contributor

Can you help to share the result of
pip list | grep intel_extension_for_tensorflow

@raevillena
Copy link
Author

raevillena commented Jun 28, 2024

Hi here it is,

(itex) rae@DESKTOP-URAMFL5:~$ pip list | grep intel_extension_for_tensorflow
intel_extension_for_tensorflow            2.15.0.0
intel_extension_for_tensorflow_lib        2.15.0.0.2
intel_extension_for_tensorflow_lib_weekly 2.15.0.1.2.dev20240603
intel_extension_for_tensorflow_weekly     2.15.0.1.dev2024060

is there another step to do for the newer library gets used by default? or that was it?

@guizili0
Copy link
Contributor

please help to remove the "intel_extension_for_tensorflow" and "intel_extension_for_tensorflow_lib"

@yinghu5 yinghu5 added the question Further information is requested label Jun 28, 2024
@raevillena
Copy link
Author

Hi, can I test that after doing some modelling first. it works (and not sometimes) for now.

@raevillena
Copy link
Author

I can tell already the update made the gpu use memory but uses the cpu to process. cpu went up 100% with 0 from gpu which was used to be using the gpu as xpu from the original build. but let me restart the wsl to confirm everything. my models went up from 5 sec training per epoch to 130 sec which is not what I expect.

@raevillena
Copy link
Author

raevillena commented Jun 28, 2024

the update was no longer using the gpu tho
this line was no longer in the logs
[2024-06-28 19:43:19.472249: I itex/core/wrapper/itex_gpu_wrapper.cc:38] Intel Extension for Tensorflow* GPU backend is loaded.

so it was purely using cpu now

@raevillena
Copy link
Author

I did remove all itex and just installed the weekly build. the gpu gets mounted again but all the errors came back with it too. back to 0

@feng-intel
Copy link
Contributor

Hi @raevillena
How can I reproduce your issue ?

@raevillena
Copy link
Author

Hi @raevillena How can I reproduce your issue ?

Hi @feng-intel this is the summary

Hardware setup:

Ubuntu 22.04 on WSL2
Host: Windows 11 enterprise
32Gb ram 3600ddr4
AMD 5700x CPU
Intel Arc A750 8GB

wsl2:

ubuntu22.04 official distro
(this runs on Microsoft special kernet

running uname -r
5.15.153.1-microsoft-standard-WSL2

from fresh installation:
following steps here: https://github.com/intel/intel-extension-for-tensorflow/blob/main/docs/install/experimental/install_for_arc_gpu.md

sudo apt-get install -y gpg-agent wget
wget -qO - https://repositories.intel.com/gpu/intel-graphics.key | 
sudo gpg --dearmor --output /usr/share/keyrings/intel-graphics.gpg
echo "deb [arch=amd64 signed-by=/usr/share/keyrings/intel-graphics.gpg] https://repositories.intel.com/gpu/ubuntu jammy/lts/2350 unified" | sudo tee /etc/apt/sources.list.d/intel-gpu-jammy.list
sudo apt-get update

then

sudo apt-get install \
    intel-igc-cm \
    intel-level-zero-gpu \
    intel-opencl-icd \
    level-zero \
    libigc1 \
    libigdfcl1 \
    libigdgmm12

I needed to install the whole oneapi cause i needed the source setvars

wget https://registrationcenter-download.intel.com/akdlm/IRC_NAS/fdc7a2bc-b7a8-47eb-8876-de6201297144/l_BaseKit_p_2024.1.0.596.sh
sudo sh ./l_BaseKit_p_2024.1.0.596.sh

then

source /opt/intel/oneapi/setvars.sh

setting up my conda environment: https://intel.github.io/intel-extension-for-tensorflow/latest/docs/install/experimental/install_for_gpu_conda.html

curl https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -o Miniconda3-latest-Linux-x86_64.sh
bash Miniconda3-latest-Linux-x86_64.sh
conda update conda
conda create -n itex -c intel intelpython3_full python=3.9
#I removed the version orig: conda create -n itex -c intel intelpython3_full==2023.2.0 python=3.9

activated my conda
conda activate itex
proceeded as documented

pip install --upgrade pip
pip install tensorflow==2.15.0
pip install intel-extension-for-tensorflow[xpu]
source /opt/intel/oneapi/compiler/latest/env/vars.sh
source /opt/intel/oneapi/mkl/latest/env/vars.sh
export path_to_site_packages=`python -c "import site; print(site.getsitepackages()[0])"`
bash ${path_to_site_packages}/intel_extension_for_tensorflow/tools/env_check.sh

but the output would say that there is no file or directory for env_check.sh cause there isn't in the latest version

then install the jupyter using these
from here: https://www.intel.com/content/www/us/en/developer/articles/technical/running-tensorflow-stable-diffusion-on-intel-arc.html

pip install notebook
pip install keras tensorflow-datasets matplotlib ipywidgets
jupyter notebook

here is the sample model

import tensorflow as tf
base_model = tf.keras.applications.VGG16(include_top=False)
base_model.trainable = False
inputs = tf.keras.layers.Input(shape=(224, 224, 3), name="input_layer")
x = tf.keras.layers.experimental.preprocessing.Rescaling(1./255)(inputs)
x = base_model(inputs)
x = tf.keras.layers.GlobalAveragePooling2D(name="global_average_pooling_layer")(x)
outputs = tf.keras.layers.Dense(3, activation="softmax", name="output_layer")(x)
model_5 = tf.keras.Model(inputs, outputs)
model_5.compile(loss='categorical_crossentropy',
              optimizer=tf.keras.optimizers.Adam(),
              metrics=["accuracy"])
history5 = model_5.fit(train_data_50_test,
                                 epochs=10,
                                 steps_per_epoch=len(train_data_50_test),
                                 validation_data=val_data_50_test,
                                 validation_steps=int(0.5 * len(val_data_50_test)))

maybe you have a data there i cannot provide my own.

is there something I didn't mention except the exact logs? I don't want to redo the setup for the meantime I switched to use the cpu instead for now while waiting for a development on this.

@yinghu5 yinghu5 assigned feng-intel and unassigned aice-support Jul 1, 2024
@yinghu5
Copy link

yinghu5 commented Jul 5, 2024

@raevillena ,
thank you a lot for the details, if your environment is still on, could you please download the env_check.py using:
wget https://raw.githubusercontent.com/intel/intel-extension-for-tensorflow/v2.15.0.0/tools/python/env_check.py

and run it and let us know the output?
python env_check.py

Thanks

@raevillena
Copy link
Author

Hi @yinghu5

here is the result

(itex) rae@DESKTOP-URAMFL5:~$ python env_check.py

Check Environment for Intel(R) Extension for TensorFlow*...

Check Python
         Python 3.9.19 is Supported.
Check Python Passed

Check OS
        OS ubuntu:22.04 is Supported
Check OS Passed

Check Tensorflow
        Tensorflow 2.15.0 is installed.
Check Tensorflow Passed

Check Intel GPU Driver
Package: intel-level-zero-gpu
Status: install ok installed
Priority: optional
Section: libs
Installed-Size: 28239
Maintainer: Intel Graphics Team <[email protected]>
Architecture: amd64
Source: intel-compute-runtime
Version: 1.3.27642.52-803~22.04
Depends: libc6 (>= 2.34), libgcc-s1 (>= 3.4), libigdgmm12 (>= 22.3.15), libstdc++6 (>= 12), libigc1 (>= 1.0.12812), libigdfcl1 (>= 1.0.12812), libnl-3-200, libnl-route-3-200
Description: Intel(R) Graphics Compute Runtime for oneAPI Level Zero.
 Level Zero is the primary low-level interface for language and runtime
 libraries. Level Zero offers fine-grain control over accelerators
 capabilities, delivering a simplified and low-latency interface to
 hardware, and efficiently exposing hardware capabilities to applications.
Homepage: https://github.com/oneapi-src/level-zero
Original-Maintainer: Debian OpenCL Maintainers <[email protected]>
Package: intel-opencl-icd
Status: install ok installed
Priority: optional
Section: libs
Installed-Size: 23865
Maintainer: Intel Graphics Team <[email protected]>
Architecture: amd64
Source: intel-compute-runtime
Version: 23.43.27642.52-803~22.04
Replaces: intel-opencl
Provides: opencl-icd
Depends: libc6 (>= 2.34), libgcc-s1 (>= 3.4), libigdgmm12 (>= 22.3.15), libstdc++6 (>= 12), ocl-icd-libopencl1, libigc1 (>= 1.0.12812), libigdfcl1 (>= 1.0.12812)
Recommends: intel-igc-cm (>= 1.0.100)
Breaks: intel-opencl
Conffiles:
 /etc/OpenCL/vendors/intel.icd d0a34d0b4f75385c56ee357bb1b8e2d0
Description: Intel graphics compute runtime for OpenCL
 The Intel(R) Graphics Compute Runtime for OpenCL(TM) is a open source
 project to converge Intel's development efforts on OpenCL(TM) compute
 stacks supporting the GEN graphics hardware architecture.
 .
 Supported platforms:
 - Intel Core Processors with Gen8 GPU (Broadwell) - OpenCL 2.1
 - Intel Core Processors with Gen9 GPU (Skylake, Kaby Lake, Coffee Lake) - OpenCL 2.1
 - Intel Atom Processors with Gen9 GPU (Apollo Lake, Gemini Lake) - OpenCL 1.2
 - Intel Core Processors with Gen11 GPU (Ice Lake) - OpenCL 2.1
 - Intel Core Processors with Gen12 graphics devices (formerly Tiger Lake) - OpenCL 2.1
Homepage: https://github.com/intel/compute-runtime
Original-Maintainer: Debian OpenCL Maintainers <[email protected]>
Package: level-zero
Status: install ok installed
Priority: optional
Section: libs
Installed-Size: 1049
Maintainer: Intel Graphics Team <[email protected]>
Architecture: amd64
Source: level-zero-loader
Version: 1.14.0-744~22.04
Depends: libc6 (>= 2.34), libgcc-s1 (>= 3.3.1), libstdc++6 (>= 11)
Description: Intel(R) Graphics Compute Runtime for oneAPI Level Zero.
 Level Zero is the primary low-level interface for language and runtime
 libraries. Level Zero offers fine-grain control over accelerators
 capabilities, delivering a simplified and low-latency interface to
 hardware, and efficiently exposing hardware capabilities to applications.
 .
 This package provides the loader for oneAPI Level Zero compute runtimes.
Homepage: https://github.com/oneapi-src/level-zero
Package: libigc1
Status: install ok installed
Priority: optional
Section: libs
Installed-Size: 86364
Maintainer: Intel Graphics Team <[email protected]>
Architecture: amd64
Source: intel-graphics-compiler
Version: 1.0.15468.29-803~22.04
Depends: libc6 (>= 2.34), libgcc-s1 (>= 3.4), libstdc++6 (>= 12), zlib1g (>= 1:1.2.2)
Description: Intel graphics compiler for OpenCL -- core libs
 The Intel(R) Graphics Compiler for OpenCL(TM) is an llvm based compiler
 for OpenCL(TM) targeting Intel Gen graphics hardware architecture.
 .
 This package includes the core libraries.
Homepage: https://github.com/intel/intel-graphics-compiler
Original-Maintainer: Debian OpenCL team <[email protected]>
Package: libigdfcl1
Status: install ok installed
Priority: optional
Section: libs
Installed-Size: 116046
Maintainer: Intel Graphics Team <[email protected]>
Architecture: amd64
Source: intel-graphics-compiler
Version: 1.0.15468.29-803~22.04
Depends: libc6 (>= 2.34), libgcc-s1 (>= 3.4), libstdc++6 (>= 11), zlib1g (>= 1:1.2.0), libz3-4 (>= 4.7.1)
Description: Intel graphics compiler for OpenCL -- OpenCL library
 The Intel(R) Graphics Compiler for OpenCL(TM) is an llvm based compiler
 for OpenCL(TM) targeting Intel Gen graphics hardware architecture.
 .
 This package includes the library for OpenCL.
Homepage: https://github.com/intel/intel-graphics-compiler
Original-Maintainer: Debian OpenCL team <[email protected]>
Package: libigdgmm12
Status: install ok installed
Priority: optional
Section: libs
Installed-Size: 648
Maintainer: Intel Graphics Team <[email protected]>
Architecture: amd64
Multi-Arch: same
Source: intel-gmmlib
Version: 22.3.15-803~22.04
Replaces: libigdgmm11
Depends: libc6 (>= 2.34), libgcc-s1 (>= 3.3.1), libstdc++6 (>= 4.1.1)
Description: Intel Graphics Memory Management Library -- shared library
 The Intel Graphics Memory Management Library provides device specific
 and buffer management for the Intel Graphics Compute Runtime for
 OpenCL and the Intel Media Driver for VAAPI.
 .
 This library is only useful for Broadwell and newer CPUs.
 .
 This package includes the shared library.
Homepage: https://github.com/intel/gmmlib
Original-Maintainer: Debian Multimedia Maintainers <[email protected]>
Check Intel GPU Driver Passsed

Check OneAPI
        Can't find dpcpp
 Check OneAPI Failed

at the same time:

(itex) rae@DESKTOP-URAMFL5:~$ sudo apt install intel-oneapi-runtime-dpcpp-cpp
Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
intel-oneapi-runtime-dpcpp-cpp is already the newest version (2024.2.0-981).
0 upgraded, 0 newly installed, 0 to remove and 47 not upgraded.

please enlighten me
also, it works as long as a export there environment vars every opening my wsl instance

export OverrideDefaultFP64Settings=1
export IGC_EnableDPEmulation=1
export ITEX_LIMIT_MEMORY_SIZE_IN_MB=1024
source /opt/intel/oneapi/setvars.sh --force

@yinghu5
Copy link

yinghu5 commented Jul 8, 2024

Hi @raevillena
thank you!
From the result ,
seems you have two versions of oneAPI dpcpp 2024.1 and dpcpp-cpp newest version (2024.2.0-981) in the environment
Check OneAPI
Can't find dpcpp
Check OneAPI Failed
I recall, you had installed it
wget https://registrationcenter-download.intel.com/akdlm/IRC_NAS/fdc7a2bc-b7a8-47eb-8876-de6201297144/l_BaseKit_p_2024.1.0.596.sh
sudo sh ./l_BaseKit_p_2024.1.0.596.sh

As the current ITEX 2.15.0 was tested with oneAPI 2024.1. could you please remove sudo apt install intel-oneapi-runtime-dpcpp-cpp
and
$ source /opt/intel/oneapi/setvars.sh --force,
$icx -V
and
$sycl-ls
show the output?
and run again about env_check and see if the OneAPI error can gone?

Second, i saw Breaks: intel-opencl
Conffiles:
/etc/OpenCL/vendors/intel.icd d0a34d0b4f75385c56ee357bb1b8e2d0
Description: Intel graphics compute runtime for OpenCL
The Intel(R) Graphics Compute Runtime for OpenCL(TM) is a open source
project to converge Intel's development efforts on OpenCL(TM) compute
stacks supporting the GEN graphics hardware architecture.
.
Supported platforms:

  • Intel Core Processors with Gen8 GPU (Broadwell) - OpenCL 2.1
  • Intel Core Processors with Gen9 GPU (Skylake, Kaby Lake, Coffee Lake) - OpenCL 2.1
  • Intel Atom Processors with Gen9 GPU (Apollo Lake, Gemini Lake) - OpenCL 1.2
  • Intel Core Processors with Gen11 GPU (Ice Lake) - OpenCL 2.1
  • Intel Core Processors with Gen12 graphics devices (formerly Tiger Lake) - OpenCL 2.1
    Homepage: https://github.com/intel/compute-runtime
    as we haven't tried one AMD machine, not sure if it is real problem or not.

third, about new ITEX version etc, like Guizi mentioned, try the next release.
pip install --upgrade intel-extension-for-tensorflow-weekly[xpu] -f https://developer.intel.com/itex-whl-weekly
and could you please try the simple hello-world program and show the result?
$wget https://raw.githubusercontent.com/oneapi-src/oneAPI-samples/master/AI-and-Analytics/Getting-Started-Samples/IntelTensorFlow_GettingStarted/TensorFlow_HelloWorld.py
$ python TensorFlow_HelloWorld.py

@raevillena
Copy link
Author

raevillena commented Jul 8, 2024

Hello @yinghu5

a) yes I installed a newer dpcpp but I have encountered the results in even before installing them to see if that was the only reason,

b)

(itex) rae@DESKTOP-URAMFL5:~$ sycl-ls
[opencl:acc:0] Intel(R) FPGA Emulation Platform for OpenCL(TM), Intel(R) FPGA Emulation Device OpenCL 1.2  [2024.17.3.0.08_160000]
[opencl:cpu:1] Intel(R) OpenCL, AMD Ryzen 7 5700X 8-Core Processor              OpenCL 3.0 (Build 0) [2024.17.3.0.08_160000]
[opencl:gpu:2] Intel(R) OpenCL Graphics, Intel(R) Graphics [0x56a1] OpenCL 3.0 NEO  [23.43.27642.52]
[opencl:cpu:3] Intel(R) OpenCL, AMD Ryzen 7 5700X 8-Core Processor              OpenCL 3.0 (Build 0) [2024.18.6.0.02_160000]
[ext_oneapi_level_zero:gpu:0] Intel(R) Level-Zero, Intel(R) Graphics [0x56a1] 1.3 [1.3.27642]
(itex) rae@DESKTOP-URAMFL5:~$ icx -V
Intel(R) oneAPI DPC++/C++ Compiler for applications running on Intel(R) 64, Version 2024.1.0 Build 20240308
Copyright (C) 1985-2024 Intel Corporation. All rights reserved.

c)
AMD platform is not included in your test devices?

d)

$wget https://raw.githubusercontent.com/oneapi-src/oneAPI-samples/master/AI-and-Analytics/Getting-Started-Samples/IntelTensorFlow_GettingStarted/TensorFlow_HelloWorld.py
$ python TensorFlow_HelloWorld.py

Like I said in the previous replies that running simple commands will not result in this error, I could even run a simple 10 epoch transfer learning of EFFICIENTNETB0 model (this one is lighter than vgg16).

@yinghu5
Copy link

yinghu5 commented Jul 8, 2024

Hi @raevillena ,
a) how about after $ source /opt/intel/oneapi/setvars.sh --force , does the error still persist?
b) it seems oneAPI environment works fine in your machine
c) Right, it is not included in the validated devices.
d) do you have output? what is output, does the it include some gpu log information?

Thanks

@raevillena
Copy link
Author

raevillena commented Jul 8, 2024

hello @yinghu5

a) how about after $ source /opt/intel/oneapi/setvars.sh --force , does the error still persist?
nope that is the fix works for now <

using this each run solves most of the problem

export OverrideDefaultFP64Settings=1
export IGC_EnableDPEmulation=1
export ITEX_LIMIT_MEMORY_SIZE_IN_MB=1024
source /opt/intel/oneapi/setvars.sh --force

b) it seems oneAPI environment works fine in your machine

i think so too, it works after forcing setvars but doesnt persist after restart of instance or console

c) Right, it is not included in the validated devices.

cannot complain about that

d) do you have output? what is output, does the it include some gpu log information?

nope, once after sourcing the vars the only log it echoes are one in the initialization. after that it works as intended.

@yinghu5
Copy link

yinghu5 commented Jul 8, 2024

get it, thank for understanding :).
about the d), do you see the log:
Here is the code run on CPU, can't use GPU log
(itex214) ~$ python hello_tf.py
2024-07-08 16:53:09.912092: I tensorflow/core/util/port.cc:111] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable TF_ENABLE_ONEDNN_OPTS=0.
2024-07-08 16:53:09.949268: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.
2024-07-08 16:53:10.101459: E tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:9342] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-07-08 16:53:10.101523: E tensorflow/compiler/xla/stream_executor/cuda/cuda_fft.cc:609] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-07-08 16:53:10.102364: E tensorflow/compiler/xla/stream_executor/cuda/cuda_blas.cc:1518] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-07-08 16:53:10.198395: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.
2024-07-08 16:53:10.199198: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-07-08 16:53:10.841688: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
2024-07-08 16:53:11.435196: W itex/core/wrapper/itex_gpu_wrapper.cc:32] Could not load dynamic library: libimf.so: cannot open shared object file: No such file or directory
2024-07-08 16:53:11.544927: I itex/core/wrapper/itex_cpu_wrapper.cc:70] Intel Extension for Tensorflow* AVX2 CPU backend is loaded.
2024-07-08 16:53:11.576732: E itex/core/wrapper/itex_gpu_wrapper.cc:49] Could not load Intel Extension for Tensorflow GPU backend, GPU will not be used*.
If you need help, create an issue at https://github.com/intel/intel-extension-for-tensorflow/issues
2024-07-08 16:53:11.577016: E itex/core/wrapper/itex_gpu_wrapper.cc:49] Could not load Intel Extension for Tensorflow GPU backend, GPU will not be used.*
If you need help, create an issue at https://github.com/intel/intel-extension-for-tensorflow/issues
WARNING:tensorflow:From /home/yhu5/miniconda3/envs/itex214/lib/python3.10/site-packages/tensorflow/python/compat/v2_compat.py:108: disable_resource_variables (from tensorflow.python.ops.variable_scope) is deprecated and will be removed in a future version.
Instructions for updating:
non-resource variables are not supported in the long term
2024-07-08 16:53:11.725437: E itex/core/wrapper/itex_gpu_wrapper.cc:49] Could not load Intel Extension for Tensorflow* GPU backend, GPU will not be used.
If you need help, create an issue at https://github.com/intel/intel-extension-for-tensorflow/issues
2024-07-08 16:53:11.725575: E itex/core/wrapper/itex_gpu_wrapper.cc:49] Could not load Intel Extension for Tensorflow* GPU backend, GPU will not be used.
If you need help, create an issue at https://github.com/intel/intel-extension-for-tensorflow/issues
2024-07-08 16:53:11.725890: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:382] MLIR V1 optimization pass is not enabled
2024-07-08 16:53:11.728693: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:117] Plugin optimizer for device_type CPU is enabled.
0 0.43929783
1 0.36791593
2 0.34823328
3 0.33959246
4 0.33490422
Tensorflow HelloWorld Done!
[CODE_SAMPLE_COMPLETED_SUCCESFULLY]

and if i source /opt/intel/oneapi/mkl/2024.1/env/vars.sh
source /opt/intel/oneapi/compiler/2024.1/env/vars.sh
and the below code will run on GPU:
(itex214) yhu5@arc770-tce:~$ python hello_tf.py
2024-07-08 16:56:20.460059: I tensorflow/core/util/port.cc:111] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable TF_ENABLE_ONEDNN_OPTS=0.
2024-07-08 16:56:20.461009: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.
2024-07-08 16:56:20.476518: E tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:9342] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-07-08 16:56:20.476533: E tensorflow/compiler/xla/stream_executor/cuda/cuda_fft.cc:609] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-07-08 16:56:20.476546: E tensorflow/compiler/xla/stream_executor/cuda/cuda_blas.cc:1518] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-07-08 16:56:20.479603: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.
2024-07-08 16:56:20.479713: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-07-08 16:56:20.833677: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
2024-07-08 16:56:21.834356: I itex/core/wrapper/itex_gpu_wrapper.cc:35] Intel Extension for Tensorflow GPU backend is loaded.*
2024-07-08 16:56:21.856422: I itex/core/wrapper/itex_cpu_wrapper.cc:70] Intel Extension for Tensorflow* AVX2 CPU backend is loaded.
2024-07-08 16:56:21.982129: I itex/core/devices/gpu/itex_gpu_runtime.cc:129] Selected platform: Intel(R) Level-Zero
2024-07-08 16:56:21.982435: I itex/core/devices/gpu/itex_gpu_runtime.cc:154] number of sub-devices is zero, expose root device.
2024-07-08 16:56:21.982447: I itex/core/devices/gpu/itex_gpu_runtime.cc:154] number of sub-devices is zero, expose root device.
WARNING:tensorflow:From /home/yhu5/miniconda3/envs/itex214/lib/python3.10/site-packages/tensorflow/python/compat/v2_compat.py:108: disable_resource_variables (from tensorflow.python.ops.variable_scope) is deprecated and will be removed in a future version.
Instructions for updating:
non-resource variables are not supported in the long term
2024-07-08 16:56:22.085860: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:306] Could not identify NUMA node of platform XPU ID 0, defaulting to 0. Your kernel may not have been built with NUMA support.
2024-07-08 16:56:22.085881: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:306] Could not identify NUMA node of platform XPU ID 1, defaulting to 0. Your kernel may not have been built with NUMA support.
2024-07-08 16:56:22.085890: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:272] Created TensorFlow device (/job:localhost/replica:0/task:0/device:XPU:0 with 0 MB memory) -> physical PluggableDevice (device: 0, name: XPU, pci bus id: )
2024-07-08 16:56:22.086188: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:272] Created TensorFlow device (/job:localhost/replica:0/task:0/device:XPU:1 with 0 MB memory) -> physical PluggableDevice (device: 1, name: XPU, pci bus id: )
2024-07-08 16:56:22.086430: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:306] Could not identify NUMA node of platform XPU ID 0, defaulting to 0. Your kernel may not have been built with NUMA support.
2024-07-08 16:56:22.086437: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:306] Could not identify NUMA node of platform XPU ID 1, defaulting to 0. Your kernel may not have been built with NUMA support.
2024-07-08 16:56:22.086441: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:272] Created TensorFlow device (/job:localhost/replica:0/task:0/device:XPU:0 with 0 MB memory) -> physical PluggableDevice (device: 0, name: XPU, pci bus id: )
2024-07-08 16:56:22.086445: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:272] Created TensorFlow device (/job:localhost/replica:0/task:0/device:XPU:1 with 0 MB memory) -> physical PluggableDevice (device: 1, name: XPU, pci bus id: )
2024-07-08 16:56:22.086731: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:382] MLIR V1 optimization pass is not enabled
2024-07-08 16:56:22.087278: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:117] Plugin optimizer for device_type XPU is enabled.
0 0.37987733
1 0.34364864
2 0.33345342
3 0.32908753
4 0.32680914
Tensorflow HelloWorld Done!
[CODE_SAMPLE_COMPLETED_SUCCESFULLY]

@yinghu5
Copy link

yinghu5 commented Jul 9, 2024

Hi @raevillena

is there any update?
back to the original question, the GPU memory limit ~7.4G, and CPU memory limit is about 14.9G , I did further investigation and it seems the memory bug is still caused by OOM of the GPU memory.

The below is my test code, could you please try it (without any environment variable setting)
change the datasize = 100, or 1000
and show the output?

my machine with 8G GPU memory, the below code can run with datasize =100,

but failed when datasize=1000. it even failed very earier at the call preprocess_image_input
return tf.image.resize(output_ims, [224, 224])

log info: ran out of memory of XPU.
2024-07-09 11:16:07.554374: W external/tsl/tsl/framework/bfc_allocator.cc:500] Allocator (XPU_0_bfc) ran out of memory trying to allocate 573.64MiB (rounded to 601509888)requested by op
....

2024-07-09 11:10:44.057305: I external/tsl/tsl/framework/bfc_allocator.cc:1124] Total bytes in pool: 982310912 memory_limit_: 6039281664 available bytes: 5056970752 curr_region_allocation_bytes_: 12078563328

Thanks

python vgg16.py (without any environment variable)

import tensorflow as tf
from keras.datasets import cifar10
from keras.utils import to_categorical
(x_train, y_train), (x_test, y_test) = cifar10.load_data() # x_train - training data(images), y_train - labels(digits)
print('x_train shape:', x_train.shape)
print(x_train.shape[0], 'train samples')
print(x_test.shape[0], 'test samples')

def preprocess_image_input(input_images):
  #input_images = input_images.astype('float32')
  output_ims = tf.keras.applications.vgg16.preprocess_input(input_images)  
  print('output_ims:', output_ims.shape)
  return tf.image.resize(output_ims, [224, 224]) 
nb_classes = 10
datasize=100
y_train = to_categorical(y_train[1:datasize], nb_classes)
y_test = to_categorical(y_test[1:datasize], nb_classes)
print ("Train shape", x_train.shape, y_train.shape)

train_data_50_test=preprocess_image_input(x_train[1:datasize])
val_data_50_test = preprocess_image_input(x_test[1:datasize])

print ("================model training=============")
base_model = tf.keras.applications.VGG16(include_top=False)
base_model.trainable = False
inputs = tf.keras.layers.Input(shape=(224, 224, 3), name="input_layer")
x = tf.keras.layers.experimental.preprocessing.Rescaling(1./255)(inputs)
x = base_model(inputs)
x = tf.keras.layers.GlobalAveragePooling2D(name="global_average_pooling_layer")(x)
outputs = tf.keras.layers.Dense(nb_classes, activation="softmax", name="output_layer")(x)
model_5 = tf.keras.Model(inputs, outputs)
model_5.compile(loss='categorical_crossentropy',
                      optimizer=tf.keras.optimizers.Adam(),
                                    metrics=["accuracy"])
history5 = model_5.fit(train_data_50_test,y_train,
                                 epochs=2,verbose=1,
                                 batch_size=10,
 #                                steps_per_epoch=len(train_data_50_test),
                                 validation_data=(val_data_50_test, y_test),
 #                                validation_steps=int(0.5 * len(val_data_50_test))
                       )
#result = model.evaluate(val_data_50_test)

datasize = 100 output

2024-07-09 11:03:48.833394: I itex/core/wrapper/itex_gpu_wrapper.cc:38] Intel Extension for Tensorflow* GPU backend is loaded.
2024-07-09 11:03:48.834617: I external/local_xla/xla/pjrt/pjrt_api.cc:67] PJRT_Api is set for device type xpu
2024-07-09 11:03:48.834660: I external/local_xla/xla/pjrt/pjrt_api.cc:72] PJRT plugin for XPU has PJRT API version 0.33. The framework PJRT API version is 0.34.
2024-07-09 11:03:49.216499: I external/intel_xla/xla/stream_executor/sycl/sycl_gpu_runtime.cc:134] Selected platform: Intel(R) Level-Zero
2024-07-09 11:03:49.216817: I external/intel_xla/xla/stream_executor/sycl/sycl_gpu_runtime.cc:159] number of sub-devices is zero, expose root device.
2024-07-09 11:03:49.220242: I external/xla/xla/service/service.cc:168] XLA service 0x5579164cd630 initialized for platform SYCL (this does not guarantee that XLA will be used). Devices:
2024-07-09 11:03:49.220280: I external/xla/xla/service/service.cc:176]   StreamExecutor device (0): Intel(R) Graphics [0x9a49], <undefined>
2024-07-09 11:03:49.221860: I itex/core/devices/gpu/itex_gpu_runtime.cc:130] Selected platform: Intel(R) Level-Zero
2024-07-09 11:03:49.222135: I itex/core/devices/gpu/itex_gpu_runtime.cc:155] number of sub-devices is zero, expose root device.
2024-07-09 11:03:49.225126: I external/intel_xla/xla/pjrt/se_xpu_pjrt_client.cc:97] Using BFC allocator.
2024-07-09 11:03:49.225174: I external/xla/xla/pjrt/gpu/gpu_helpers.cc:106] XLA backend allocating 6039281664 bytes on device 0 for BFCAllocator.
2024-07-09 11:03:49.227397: I external/local_xla/xla/pjrt/pjrt_c_api_client.cc:119] PjRtCApiClient created.
x_train shape: (50000, 32, 32, 3)
50000 train samples
10000 test samples
Train shape (50000, 32, 32, 3) (99, 10)
output_ims: (99, 32, 32, 3)
2024-07-09 11:03:50.833135: I tensorflow/core/common_runtime/next_pluggable_device/next_pluggable_device_factory.cc:118] Created 1 TensorFlow NextPluggableDevices. Physical device type: XPU
output_ims: (99, 32, 32, 3)
2024-07-09 11:04:14.986917: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:117] Plugin optimizer for device_type XPU is enabled.

10/10 [==============================] - 124s 1s/step - loss: 4.2215 - accuracy: 0.1818 - val_loss: 4.3642 - val_accuracy: 0.1616

datasize=1000
`2024-07-09 11:10:33.292887: I itex/core/wrapper/itex_gpu_wrapper.cc:38] Intel Extension for Tensorflow* GPU backend is loaded.
2024-07-09 11:10:33.293279: I external/local_xla/xla/pjrt/pjrt_api.cc:67] PJRT_Api is set for device type xpu
2024-07-09 11:10:33.293315: I external/local_xla/xla/pjrt/pjrt_api.cc:72] PJRT plugin for XPU has PJRT API version 0.33. The framework PJRT API version is 0.34.
2024-07-09 11:10:33.311334: I external/intel_xla/xla/stream_executor/sycl/sycl_gpu_runtime.cc:134] Selected platform: Intel(R) Level-Zero
2024-07-09 11:10:33.311711: I external/intel_xla/xla/stream_executor/sycl/sycl_gpu_runtime.cc:159] number of sub-devices is zero, expose root device.
2024-07-09 11:10:33.319263: I external/xla/xla/service/service.cc:168] XLA service 0x560623f3a810 initialized for platform SYCL (this does not guarantee that XLA will be used). Devices:
2024-07-09 11:10:33.319314: I external/xla/xla/service/service.cc:176] StreamExecutor device (0): Intel(R) Graphics [0x9a49],
2024-07-09 11:10:33.320949: I itex/core/devices/gpu/itex_gpu_runtime.cc:130] Selected platform: Intel(R) Level-Zero
2024-07-09 11:10:33.321240: I itex/core/devices/gpu/itex_gpu_runtime.cc:155] number of sub-devices is zero, expose root device.
2024-07-09 11:10:33.324133: I external/intel_xla/xla/pjrt/se_xpu_pjrt_client.cc:97] Using BFC allocator.
2024-07-09 11:10:33.324173: I external/xla/xla/pjrt/gpu/gpu_helpers.cc:106] XLA backend allocating 6039281664 bytes on device 0 for BFCAllocator.
2024-07-09 11:10:33.325522: I external/local_xla/xla/pjrt/pjrt_c_api_client.cc:119] PjRtCApiClient created.
x_train shape: (50000, 32, 32, 3)
50000 train samples
10000 test samples
Train shape (50000, 32, 32, 3) (999, 10)
output_ims: (999, 32, 32, 3)
2024-07-09 11:10:33.751766: I tensorflow/core/common_runtime/next_pluggable_device/next_pluggable_device_factory.cc:118] Created 1 TensorFlow NextPluggableDevices. Physical device type: XPU
output_ims: (999, 32, 32, 3)
2024-07-09 11:10:44.055860: W external/tsl/tsl/framework/bfc_allocator.cc:500] Allocator (XPU_0_bfc) ran out of memory trying to allocate 573.64MiB (rounded to 601509888)requested by op
2024-07-09 11:10:44.056023: I external/tsl/tsl/framework/bfc_allocator.cc:1054] BFCAllocator dump for XPU_0_bfc
2024-07-09 11:10:44.056044: I external/tsl/tsl/framework/bfc_allocator.cc:1061] Bin (256): Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2024-07-09 11:10:44.056094: I external/tsl/tsl/framework/bfc_allocator.cc:1061] Bin (512): Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2024-07-09 11:10:44.056124: I external/tsl/tsl/framework/bfc_allocator.cc:1061] Bin (1024): Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2024-07-09 11:10:44.056177: I external/tsl/tsl/framework/bfc_allocator.cc:1061] Bin (2048): Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2024-07-09 11:10:44.056238: I external/tsl/tsl/framework/bfc_allocator.cc:1061] Bin (4096): Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2024-07-09 11:10:44.056285: I external/tsl/tsl/framework/bfc_allocator.cc:1061] Bin (8192): Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2024-07-09 11:10:44.056371: I external/tsl/tsl/framework/bfc_allocator.cc:1061] Bin (16384): Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2024-07-09 11:10:44.056391: I external/tsl/tsl/framework/bfc_allocator.cc:1061] Bin (32768): Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2024-07-09 11:10:44.056406: I external/tsl/tsl/framework/bfc_allocator.cc:1061] Bin (65536): Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2024-07-09 11:10:44.056419: I external/tsl/tsl/framework/bfc_allocator.cc:1061] Bin (131072): Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2024-07-09 11:10:44.056434: I external/tsl/tsl/framework/bfc_allocator.cc:1061] Bin (262144): Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2024-07-09 11:10:44.056446: I external/tsl/tsl/framework/bfc_allocator.cc:1061] Bin (524288): Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2024-07-09 11:10:44.056455: I external/tsl/tsl/framework/bfc_allocator.cc:1061] Bin (1048576): Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2024-07-09 11:10:44.056463: I external/tsl/tsl/framework/bfc_allocator.cc:1061] Bin (2097152): Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2024-07-09 11:10:44.056472: I external/tsl/tsl/framework/bfc_allocator.cc:1061] Bin (4194304): Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2024-07-09 11:10:44.056490: I external/tsl/tsl/framework/bfc_allocator.cc:1061] Bin (8388608): Total Chunks: 1, Chunks in use: 1. 11.71MiB allocated for chunks. 11.71MiB in use in bin. 11.71MiB client-requested in use in bin.
2024-07-09 11:10:44.056506: I external/tsl/tsl/framework/bfc_allocator.cc:1061] Bin (16777216): Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2024-07-09 11:10:44.056572: I external/tsl/tsl/framework/bfc_allocator.cc:1061] Bin (33554432): Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2024-07-09 11:10:44.056618: I external/tsl/tsl/framework/bfc_allocator.cc:1061] Bin (67108864): Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2024-07-09 11:10:44.056673: I external/tsl/tsl/framework/bfc_allocator.cc:1061] Bin (134217728): Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2024-07-09 11:10:44.056762: I external/tsl/tsl/framework/bfc_allocator.cc:1061] Bin (268435456): Total Chunks: 2, Chunks in use: 1. 925.10MiB allocated for chunks. 573.64MiB in use in bin. 573.64MiB client-requested in use in bin.
2024-07-09 11:10:44.056785: I external/tsl/tsl/framework/bfc_allocator.cc:1077] Bin for 573.64MiB was 256.00MiB, Chunk State:
2024-07-09 11:10:44.056817: I external/tsl/tsl/framework/bfc_allocator.cc:1083] Size: 351.45MiB | Requested Size: 0B | in_use: 0 | bin_num: 20, prev: Size: 573.64MiB | Requested Size: 573.64MiB | in_use: 1 | bin_num: -1
2024-07-09 11:10:44.056829: I external/tsl/tsl/framework/bfc_allocator.cc:1090] Next region of size 982310912
2024-07-09 11:10:44.056846: I external/tsl/tsl/framework/bfc_allocator.cc:1110] InUse at ffffb80200010000 of size 12275712 next 1
2024-07-09 11:10:44.056897: I external/tsl/tsl/framework/bfc_allocator.cc:1110] InUse at ffffb80200bc5000 of size 601509888 next 2
2024-07-09 11:10:44.056961: I external/tsl/tsl/framework/bfc_allocator.cc:1110] Free at ffffb8022496a000 of size 368525312 next 18446744073709551615
2024-07-09 11:10:44.057064: I external/tsl/tsl/framework/bfc_allocator.cc:1115] Summary of in-use Chunks by size:
2024-07-09 11:10:44.057102: I external/tsl/tsl/framework/bfc_allocator.cc:1118] 1 Chunks of size 12275712 totalling 11.71MiB
2024-07-09 11:10:44.057172: I external/tsl/tsl/framework/bfc_allocator.cc:1118] 1 Chunks of size 601509888 totalling 573.64MiB
2024-07-09 11:10:44.057250: I external/tsl/tsl/framework/bfc_allocator.cc:1122] Sum Total of in-use chunks: 585.35MiB
2024-07-09 11:10:44.057305: I external/tsl/tsl/framework/bfc_allocator.cc:1124] Total bytes in pool: 982310912 memory_limit_: 6039281664 available bytes: 5056970752 curr_region_allocation_bytes_: 12078563328
2024-07-09 11:10:44.057371: I external/tsl/tsl/framework/bfc_allocator.cc:1129] Stats:
Limit: 6039281664
InUse: 613785600
MaxInUse: 613785600
NumAllocs: 3
MaxAllocSize: 601509888
Reserved: 0
PeakReserved: 0
LargestFreeBlock: 0

2024-07-09 11:10:44.057461: W external/tsl/tsl/framework/bfc_allocator.cc:512] ***************************************************************_____________________________________
Segmentation fault
(itex) yhu5@rajeshch-desk89:~$`

@raevillena
Copy link
Author

hi @yinghu5

the hello runs fine

2024-07-09 14:12:47.836324: I tensorflow/core/common_runtime/next_pluggable_device/next_pluggable_device_factory.cc:118] Created 1 TensorFlow NextPluggableDevices. Physical device type: XPU
2024-07-09 14:12:47.836850: I tensorflow/core/common_runtime/next_pluggable_device/next_pluggable_device_factory.cc:118] Created 1 TensorFlow NextPluggableDevices. Physical device type: XPU
0 0.40498763
1 0.3569235
2 0.34184435
3 0.33502558
4 0.33131737
Tensorflow HelloWorld Done!
[CODE_SAMPLE_COMPLETED_SUCCESFULLY]

unfortunately I cannot run the code you did. it throws all errors on my side and I just gave up after the 10th error

@yinghu5
Copy link

yinghu5 commented Jul 10, 2024

Sorry, change the file format in last comment, please try again, the format is like
image

@Zantares
Copy link

It looks like an OOM error.

what is consistent is that it tries to allocate that curr region allocation bytes: 14975071232
that value was very consistent. which I dont know why. It makes sense the the oom happens with that but why allocate 14gb when tf doesnt even need that much for the current workload.

Let me explain why it always tries to allocate such consistent bytes: 14975071232: ITEX has a memory allocator create the runtime memory pool, it

  • Initialized as (total_memory_size - reserved_memory_size) * 0.75. You can see an original memory_limit_: 7487535513 in early log, that's it. NOTE the reserved_memory_size is for some HW internal data, the the ratio 0.75 is inherited from public community to ensure this process won't exhaust all HW resource.
  • It will try to extend to current_size * 2 if any allocation is failed. You can see the extended size curr_region_allocation_bytes_ is always ~2x of the original size memory_limit_, that's the reason.

Based on the above logic, it's easy to fail when the extending operation is triggered. The question is, why is the extending triggered even the memory pool still has free space? Maybe the allocation needs more space, or the pool is fragmentation.

We will have a deeper look and give more info later, thanks!

@raevillena
Copy link
Author

Sorry, change the file format in last comment, please try again, the format is like image

hi @yinghu5 sorry, maybe I can't run it cause I have upgraded my keras version to v3. It has giving me incompatibilities with the data preparation stage with shapes mismatch. I can't troubleshoot it and replicate my linux instance with the same spec as your for now since I am doing some academic experiments. I'll do that after some time.

@raevillena
Copy link
Author

It looks like an OOM error.

what is consistent is that it tries to allocate that curr region allocation bytes: 14975071232
that value was very consistent. which I dont know why. It makes sense the the oom happens with that but why allocate 14gb when tf doesnt even need that much for the current workload.

Let me explain why it always tries to allocate such consistent bytes: 14975071232: ITEX has a memory allocator create the runtime memory pool, it

  • Initialized as (total_memory_size - reserved_memory_size) * 0.75. You can see an original memory_limit_: 7487535513 in early log, that's it. NOTE the reserved_memory_size is for some HW internal data, the the ratio 0.75 is inherited from public community to ensure this process won't exhaust all HW resource.
  • It will try to extend to current_size * 2 if any allocation is failed. You can see the extended size curr_region_allocation_bytes_ is always ~2x of the original size memory_limit_, that's the reason.

Based on the above logic, it's easy to fail when the extending operation is triggered. The question is, why is the extending triggered even the memory pool still has free space? Maybe the allocation needs more space, or the pool is fragmentation.

We will have a deeper look and give more info later, thanks!

thanks for that deep information! Ill completely cooperate with it later too. thanks!

@yinghu5
Copy link

yinghu5 commented Jul 15, 2024

right, FYI, I'm using
intel_extension_for_tensorflow 2.15.0.0
intel_extension_for_tensorflow_lib 2.15.0.0.2
keras 2.15.0

thanks

@yinghu5 yinghu5 added the aitce label Jul 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
aitce question Further information is requested
Projects
None yet
Development

No branches or pull requests

6 participants