-
Notifications
You must be signed in to change notification settings - Fork 40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Memory Bug #71
Comments
update: ok so I was able to solve it after reading all issues and documents I could here is what I did coming from a reboot of the wsl in your terminal do this without starting the virtual environment:
if you have fp64 issues do this too
then forcefully source your vars (still in the main environment) you may now activate conda environment, and set the variables again all of them if you may:
you may check using then you may now start the usage of tf. in my case: all this happened without reinstallation of my system. |
@raevillena Can you help to check if our latest weekly release still has this issue? thanks.
|
I just tried right now without exporting any env variables i mentioned above but still give me
Now I tried solving it with just setvars without setting the limit memory but no. the memory bug is still there. but the fp64 emulation is now working without setting env |
Can you help to share the result of |
Hi here it is,
is there another step to do for the newer library gets used by default? or that was it? |
please help to remove the "intel_extension_for_tensorflow" and "intel_extension_for_tensorflow_lib" |
Hi, can I test that after doing some modelling first. it works (and not sometimes) for now. |
I can tell already the update made the gpu use memory but uses the cpu to process. cpu went up 100% with 0 from gpu which was used to be using the gpu as xpu from the original build. but let me restart the wsl to confirm everything. my models went up from 5 sec training per epoch to 130 sec which is not what I expect. |
the update was no longer using the gpu tho so it was purely using cpu now |
I did remove all itex and just installed the weekly build. the gpu gets mounted again but all the errors came back with it too. back to 0 |
Hi @raevillena |
Hi @feng-intel this is the summary Hardware setup:
wsl2:
running uname -r from fresh installation:
then
I needed to install the whole oneapi cause i needed the source setvars
then
setting up my conda environment: https://intel.github.io/intel-extension-for-tensorflow/latest/docs/install/experimental/install_for_gpu_conda.html
activated my conda
but the output would say that there is no file or directory for env_check.sh cause there isn't in the latest version then install the jupyter using these
here is the sample model
maybe you have a data there i cannot provide my own. is there something I didn't mention except the exact logs? I don't want to redo the setup for the meantime I switched to use the cpu instead for now while waiting for a development on this. |
@raevillena , and run it and let us know the output? Thanks |
Hi @yinghu5 here is the result
at the same time:
please enlighten me
|
Hi @raevillena As the current ITEX 2.15.0 was tested with oneAPI 2024.1. could you please remove sudo apt install intel-oneapi-runtime-dpcpp-cpp Second, i saw Breaks: intel-opencl
third, about new ITEX version etc, like Guizi mentioned, try the next release. |
Hello @yinghu5 a) yes I installed a newer dpcpp but I have encountered the results in even before installing them to see if that was the only reason, b)
c) d)
Like I said in the previous replies that running simple commands will not result in this error, I could even run a simple 10 epoch transfer learning of EFFICIENTNETB0 model (this one is lighter than vgg16). |
Hi @raevillena , Thanks |
hello @yinghu5
using this each run solves most of the problem
i think so too, it works after forcing setvars but doesnt persist after restart of instance or console
cannot complain about that
nope, once after sourcing the vars the only log it echoes are one in the initialization. after that it works as intended. |
get it, thank for understanding :). and if i source /opt/intel/oneapi/mkl/2024.1/env/vars.sh |
Hi @raevillena is there any update? The below is my test code, could you please try it (without any environment variable setting) my machine with 8G GPU memory, the below code can run with datasize =100, but failed when datasize=1000. it even failed very earier at the call preprocess_image_input log info: ran out of memory of XPU. 2024-07-09 11:10:44.057305: I external/tsl/tsl/framework/bfc_allocator.cc:1124] Total bytes in pool: 982310912 memory_limit_: 6039281664 available bytes: 5056970752 curr_region_allocation_bytes_: 12078563328 Thanks python vgg16.py (without any environment variable)
datasize = 100 output
datasize=1000 2024-07-09 11:10:44.057461: W external/tsl/tsl/framework/bfc_allocator.cc:512] ***************************************************************_____________________________________ |
hi @yinghu5 the hello runs fine
unfortunately I cannot run the code you did. it throws all errors on my side and I just gave up after the 10th error |
It looks like an OOM error.
Let me explain why it always tries to allocate such consistent bytes: 14975071232: ITEX has a memory allocator create the runtime memory pool, it
Based on the above logic, it's easy to fail when the extending operation is triggered. The question is, why is the extending triggered even the memory pool still has free space? Maybe the allocation needs more space, or the pool is fragmentation. We will have a deeper look and give more info later, thanks! |
hi @yinghu5 sorry, maybe I can't run it cause I have upgraded my keras version to v3. It has giving me incompatibilities with the data preparation stage with shapes mismatch. I can't troubleshoot it and replicate my linux instance with the same spec as your for now since I am doing some academic experiments. I'll do that after some time. |
thanks for that deep information! Ill completely cooperate with it later too. thanks! |
right, FYI, I'm using thanks |
I am having memory issue with the running things. Everything works except that training bigger data crashes the kernel of jupyter notebook.
System Desktop
Setup:
miniconda3 on itex environment
running model fit with train data results to (especially with vgg, resnet works fine):
it crashes no matter what I do when it tries to allocated that 14gb in the curr_region_allocation
Global mem shows:
btw my version of itex didnt came with
check_env.sh
so I cant run that, I just know it works cause it does and it doesnt.In jupyter the device is recognized as this
Also the other setups I can read about issues of bfc allocator uses the one that came along with the tensorflow while mine was coming from itex build files.
I could see that the repo is available for rebuilding and there might be chance to find what is happening there but I dont have the time and ability to do so.
I just wanna know if there what am I missing here since it was able allocate almost 8gb memory but unable to expand it.
I also tried exporting this to the conda environment with no effect
export ITEX_LIMIT_MEMORY_SIZE_IN_MB=4096
I said earlier that it works, yes I can train a resnet model blazingly fast compared to tesla t4 in colab but running it twice give the memory error.
what is consistent is that it tries to allocate that curr region allocation bytes: 14975071232
that value was very consistent. which I dont know why. It makes sense the the oom happens with that but why allocate 14gb when tf doesnt even need that much for the current workload.
The text was updated successfully, but these errors were encountered: