Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

官方样例demo报错: MemoryError: std::bad_alloc #3338

Open
ocivo opened this issue Jan 3, 2025 · 5 comments
Open

官方样例demo报错: MemoryError: std::bad_alloc #3338

ocivo opened this issue Jan 3, 2025 · 5 comments
Assignees

Comments

@ocivo
Copy link

ocivo commented Jan 3, 2025

直接按官方样例demo写的

  1. paddleClas版本 2.6.0 paddlepaddle版本2.6.2
  2. python 3.8
  3. paddle-cpu版本和gpu版本报相同的错误
    python run.py
    2025-01-03 18:29:51 INFO: Loading faiss with AVX512 support.
    2025-01-03 18:29:51 INFO: Successfully loaded faiss with AVX512 support.
    [2025/01/03 18:29:51] ppcls WARNING: The current running environment does not support the use of GPU. CPU has been used instead.
    Traceback (most recent call last):
    File "run.py", line 2, in
    model = paddleclas.PaddleClas(model_name="person_attribute")
    File "venv/lib/python3.8/site-packages/paddleclas/paddleclas.py", line 610, in init
    self.predictor = ClsPredictor(self._config)
    File "venv/lib/python3.8/site-packages/paddleclas/deploy/python/predict_cls.py", line 28, in init
    super().init(config["Global"])
    File "venv/lib/python3.8/site-packages/paddleclas/deploy/utils/predictor.py", line 37, in init
    self.predictor, self.config = self.create_paddle_predictor(
    File "venv/lib/python3.8/site-packages/paddleclas/deploy/utils/predictor.py", line 108, in create_paddle_predictor
    predictor = create_predictor(config)
    MemoryError: std::bad_alloc
@ocivo ocivo changed the title MemoryError: std::bad_alloc 官方样例demo报错: MemoryError: std::bad_alloc Jan 3, 2025
@TingquanGao TingquanGao self-assigned this Jan 3, 2025
@TingquanGao
Copy link
Collaborator

请提供一下完整的启动命令

@cainmagi
Copy link

cainmagi commented Jan 15, 2025

@TingquanGao

我来提供一个复现这个问题的方法:

复现过程

首先,将以下测试脚本test.py保存在某处,例如/example/test.py

import paddleclas

model = paddleclas.PaddleClas(model_name="text_image_orientation")

然后,运行以下命令启动docker容器,该容器是一个纯净的Debian Python容器。

docker run --gpus all -it --rm --shm-size=1g -v "/example:/example" python:3.10-slim bash

在容器中、安装必要的依赖项,并运行测试脚本。这里安装的是CPU版本的依赖项,因为目前Paddle的GPU版本和本人使用的其他库不兼容。

# Make the dependencies of OpenCV complete.
apt-get update
apt-get -y install libgomp1 libgl1-mesa-glx libglib2.0-0
# Install PaddleClas
pip install paddlepaddle paddleclas
# Run the test.
cd /example
python test.py

报错结果

于是,得到以下报错:

[2025/01/15 20:08:51] ppcls INFO: download https://paddleclas.bj.bcebos.com/models/PULC/inference/text_image_orientation_infer.tar to /root/.paddleclas/inference_model/PULC/text_image_orientation/text_image_orientation_infer.tar
100%|████████████████████████████████████████████████████████████████████████████| 7.40M/7.40M [00:18<00:00, 402kiB/s]
[2025/01/15 20:09:12] ppcls WARNING: The current running environment does not support the use of GPU. CPU has been used instead.
Traceback (most recent call last):
  File "/example/test.py", line 3, in <module>
    model = paddleclas.PaddleClas(model_name="text_image_orientation")
  File "/usr/local/lib/python3.10/site-packages/paddleclas/paddleclas.py", line 610, in __init__
    self.predictor = ClsPredictor(self._config)
  File "/usr/local/lib/python3.10/site-packages/paddleclas/deploy/python/predict_cls.py", line 28, in __init__
    super().__init__(config["Global"])
  File "/usr/local/lib/python3.10/site-packages/paddleclas/deploy/utils/predictor.py", line 37, in __init__
    self.predictor, self.config = self.create_paddle_predictor(
  File "/usr/local/lib/python3.10/site-packages/paddleclas/deploy/utils/predictor.py", line 108, in create_paddle_predictor
    predictor = create_predictor(config)
MemoryError: std::bad_alloc

深入测试

是shared memory的问题吗?

有没有可能,是shared memory太小?我尝试将container的启动参数改为:

docker run --gpus all -it --rm --shm-size=16g -v "/example:/example" python:3.10-slim bash

这个大小已经比教程里还大了。

然而错误如故。

是因为使用了GPU映射的关系吗?

尝试将--gpus all参数去除,

docker run -it --rm --shm-size=1g -v "/example:/example" python:3.10-slim bash

然而错误如故。

是paddlepaddle的安装有问题、或是因为paddlepaddle的CPU版本不可用吗?

我刚好知道NVIDIA有一个paddlepaddle的镜像,透过尝试运行它,并重复上述测试,

docker run --gpus all -it --rm --shm-size=1g -v "/example:/example" nvcr.io/nvidia/paddlepaddle:24.10-py3

为什么要用24.10而不是更新的版本(例如24.12)?这是因为,paddleclas不支持python 3.12,而镜像版本24.10是最后一个使用Ubuntu 22.04和Python 3.10的版本。注意它的CUDA仍然是几乎最新的12.6,且其已经内置了paddlepaddle-gpu的版本。

注意由于镜像里已经有了GPU版本的paddlepaddle-gpu,安装过程需要修改一下:

# Make the dependencies of OpenCV complete.
# Note that this is an Ubuntu image.
apt-get update
apt-get -y install libgomp1 libegl1 libglu1-mesa-dev
# Do not need to install paddlepaddle because paddlepaddle-gpu already exists.
pip install paddleclas
# Run the test.
cd /example
python test.py

这回依然报错,但是报错的内容不一样了

[2025/01/15 20:46:49] ppcls INFO: download https://paddleclas.bj.bcebos.com/models/PULC/inference/text_image_orientation_infer.tar to /root/.paddleclas/inference_model/PULC/text_image_orientation/text_image_orientation_infer.tar
100%|██████████████████████████████████████████████████████████████████████████████| 7.40M/7.40M [00:18<00:00, 406kiB/s]
Traceback (most recent call last):
  File "/example/test.py", line 3, in <module>
    model = paddleclas.PaddleClas(model_name="text_image_orientation")
  File "/usr/local/lib/python3.10/dist-packages/paddleclas/paddleclas.py", line 610, in __init__
    self.predictor = ClsPredictor(self._config)
  File "/usr/local/lib/python3.10/dist-packages/paddleclas/deploy/python/predict_cls.py", line 28, in __init__
    super().__init__(config["Global"])
  File "/usr/local/lib/python3.10/dist-packages/paddleclas/deploy/utils/predictor.py", line 37, in __init__
    self.predictor, self.config = self.create_paddle_predictor(
  File "/usr/local/lib/python3.10/dist-packages/paddleclas/deploy/utils/predictor.py", line 108, in create_paddle_predictor
    predictor = create_predictor(config)
ValueError: basic_string::_M_replace_aux

使用官方的paddle镜像

严格按照教程,使用官方的paddle镜像。

docker run --gpus all --name ppcls -it --rm -v "/example:/example" --shm-size=8G --network=host paddlepaddle/paddle:2.3.0-gpu-cuda10.2-cudnn7 /bin/bash

运行后,在容器内,安装并测试

# Install dependencies. Do not need to fix OpenCV issues.
pip install paddleclas
# Run the test.
cd /example
python test.py

成功运行起来了。

grep: warning: GREP_OPTIONS is deprecated; please use an alias or script
2025-01-15 21:16:35 INFO: Loading faiss with AVX2 support.
2025-01-15 21:16:35 INFO: Could not load library with AVX2 support due to:
ModuleNotFoundError("No module named 'faiss.swigfaiss_avx2'")
2025-01-15 21:16:35 INFO: Loading faiss.
2025-01-15 21:16:35 INFO: Successfully loaded faiss.
[2025/01/15 21:16:35] ppcls INFO: download https://paddleclas.bj.bcebos.com/models/PULC/inference/text_image_orientation_infer.tar to /root/.paddleclas/inference_model/PULC/text_image_orientation/text_image_orientation_infer.tar
100%|██████████████████████████████████████████████████████████████████████████████| 7.40M/7.40M [00:18<00:00, 410kiB/s]
/usr/local/python3.7.0/lib/python3.7/site-packages/paddleclas/deploy/python/preprocess.py:67: DeprecationWarning: NEAREST is deprecated and will be removed in Pillow 10 (2023-07-01). Use Resampling.NEAREST or Dither.NONE instead.
  'nearest': Image.NEAREST,
/usr/local/python3.7.0/lib/python3.7/site-packages/paddleclas/deploy/python/preprocess.py:68: DeprecationWarning: BILINEAR is deprecated and will be removed in Pillow 10 (2023-07-01). Use Resampling.BILINEAR instead.
  'bilinear': Image.BILINEAR,
/usr/local/python3.7.0/lib/python3.7/site-packages/paddleclas/deploy/python/preprocess.py:69: DeprecationWarning: BICUBIC is deprecated and will be removed in Pillow 10 (2023-07-01). Use Resampling.BICUBIC instead.
  'bicubic': Image.BICUBIC,
/usr/local/python3.7.0/lib/python3.7/site-packages/paddleclas/deploy/python/preprocess.py:70: DeprecationWarning: BOX is deprecated and will be removed in Pillow 10 (2023-07-01). Use Resampling.BOX instead.
  'box': Image.BOX,
/usr/local/python3.7.0/lib/python3.7/site-packages/paddleclas/deploy/python/preprocess.py:71: DeprecationWarning: LANCZOS is deprecated and will be removed in Pillow 10 (2023-07-01). Use Resampling.LANCZOS instead.
  'lanczos': Image.LANCZOS,
/usr/local/python3.7.0/lib/python3.7/site-packages/paddleclas/deploy/python/preprocess.py:72: DeprecationWarning: HAMMING is deprecated and will be removed in Pillow 10 (2023-07-01). Use Resampling.HAMMING instead.
  'hamming': Image.HAMMING,
/usr/local/python3.7.0/lib/python3.7/site-packages/paddleclas/deploy/python/preprocess.py:73: DeprecationWarning: BILINEAR is deprecated and will be removed in Pillow 10 (2023-07-01). Use Resampling.BILINEAR instead.
  'random': (Image.BILINEAR, Image.BICUBIC)
/usr/local/python3.7.0/lib/python3.7/site-packages/paddleclas/deploy/python/preprocess.py:73: DeprecationWarning: BICUBIC is deprecated and will be removed in Pillow 10 (2023-07-01). Use Resampling.BICUBIC instead.
  'random': (Image.BILINEAR, Image.BICUBIC)

虽然它能成功运行,但它的版本是Ubuntu 16.04,且python版本是3.7,要使用这个版本的话,只能通过多容器模式,实在太麻烦了。

使用官方的paddleclas镜像

docker run --gpus all -it --rm -v "/example:/example" --shm-size=8G --network=host paddlecloud/paddleclas:2.4-gpu-cuda11.2-cudnn8-latest /bin/bash

运行后,在容器内,直接测试

# Run the test.
cd /example
python test.py

这个也是能正常运行的。

是必须要退回到python 3.7版本吗

尝试退回到一个python 3.7的Debian纯净镜像,

docker run --gpus all -it --rm -v "/example:/example" --shm-size=8G --network=host python:3.7-slim bash

并重复之前的安装、测试步骤。发现测试也通过了。

2025-01-15 21:33:18 INFO: Loading faiss with AVX2 support.
2025-01-15 21:33:18 INFO: Could not load library with AVX2 support due to:
ModuleNotFoundError("No module named 'faiss.swigfaiss_avx2'")
2025-01-15 21:33:18 INFO: Loading faiss.
2025-01-15 21:33:18 INFO: Successfully loaded faiss.
[2025/01/15 21:33:18] ppcls INFO: download https://paddleclas.bj.bcebos.com/models/PULC/inference/text_image_orientation_infer.tar to /root/.paddleclas/inference_model/PULC/text_image_orientation/text_image_orientation_infer.tar
100%|██████████████████████████████████████████████████████████████████████████████| 7.40M/7.40M [00:19<00:00, 380kiB/s]
[2025/01/15 21:33:39] ppcls WARNING: The current running environment does not support the use of GPU. CPU has been used instead.

那么可以使用python 3.8版本吗

切换到python 3.8镜像,

docker run --gpus all -it --rm -v "/example:/example" --shm-size=8G --network=host python:3.8-slim bash

并重复之前的安装、测试步骤。测试不通过。

2025-01-15 21:38:40 INFO: Loading faiss with AVX512 support.
2025-01-15 21:38:40 INFO: Successfully loaded faiss with AVX512 support.
[2025/01/15 21:38:40] ppcls INFO: download https://paddleclas.bj.bcebos.com/models/PULC/inference/text_image_orientation_infer.tar to /root/.paddleclas/inference_model/PULC/text_image_orientation/text_image_orientation_infer.tar
100%|██████████████████████████████████████████████████████████████████████████████| 7.40M/7.40M [00:19<00:00, 383kiB/s]
[2025/01/15 21:39:02] ppcls WARNING: The current running environment does not support the use of GPU. CPU has been used instead.
Traceback (most recent call last):
  File "test-ori.py", line 3, in <module>
    model = paddleclas.PaddleClas(model_name="text_image_orientation")
  File "/usr/local/lib/python3.8/site-packages/paddleclas/paddleclas.py", line 610, in __init__
    self.predictor = ClsPredictor(self._config)
  File "/usr/local/lib/python3.8/site-packages/paddleclas/deploy/python/predict_cls.py", line 28, in __init__
    super().__init__(config["Global"])
  File "/usr/local/lib/python3.8/site-packages/paddleclas/deploy/utils/predictor.py", line 37, in __init__
    self.predictor, self.config = self.create_paddle_predictor(
  File "/usr/local/lib/python3.8/site-packages/paddleclas/deploy/utils/predictor.py", line 108, in create_paddle_predictor
    predictor = create_predictor(config)
MemoryError: std::bad_alloc

那么,问题出在paddlepaddle和paddleclas版本上吗

还是在python 3.8镜像,

docker run --gpus all -it --rm -v "/example:/example" --shm-size=8G --network=host python:3.8-slim bash

这一回,强制指定paddlepaddle和paddleclas版本为旧版

# Make the dependencies of OpenCV complete.
apt-get update
apt-get -y install libgomp1 libgl1-mesa-glx libglib2.0-0
# Install PaddleClas
pip install paddlepaddle==2.5.2 paddleclas==2.5.1
# Run the test.
cd /example
python test.py

运行成功:

2025-01-15 21:41:02 INFO: Loading faiss with AVX2 support.
2025-01-15 21:41:02 INFO: Could not load library with AVX2 support due to:
ModuleNotFoundError("No module named 'faiss.swigfaiss_avx2'")
2025-01-15 21:41:02 INFO: Loading faiss.
2025-01-15 21:41:02 INFO: Successfully loaded faiss.
[2025/01/15 21:41:02] ppcls WARNING: The current running environment does not support the use of GPU. CPU has been used instead.

结论

令人难以忍受的是,在各种标准环境下,PaddleClas都不能正常初始化,怀疑它的开发者所用的CPU是不是amd64的。

已经确认,paddlepaddle和paddleclas的版本之间存在兼容性问题。必须要指定合适的版本才行,不能太新也不能太旧。

本人后续在python 3.8的环境下,进行了进一步的确认:

  • paddlepaddle 2.5.2和paddleclas 2.5.1是可以兼容的。
  • paddlepaddle 2.5.2和paddleclas 2.6.0是可以兼容的。
  • paddlepaddle 2.6.0~2.6.2和paddleclas 2.6.0是不兼容的。会有MemoryError: std::bad_alloc
  • 最新的paddlepaddle 3.0.0rc0和paddleclas 2.6.0也是不兼容的。会有另外的错误。

本人所用的环境如下:

  • CUDA (if used): Cuda compilation tools, release 12.6, V12.6.77
  • OS (in container python:3.10-slim): Debian GNU/Linux 12 (bookworm) (Python is 3.10, PaddlePaddle is 2.6.2, PaddleClas is 2.6.0)
  • OS (in container nvidia/paddlepaddle): Ubuntu 22.04.5 LTS (Python is 3.10, PaddlePaddle is 2.6.1, PaddleClas is 2.6.0)
  • OS (in container paddlepaddle/paddle): Ubuntu 16.04.7 LTS (Python is 3.7, PaddlePaddle is 2.3.0, PaddleClas is 2.5.1)
  • OS (in container paddlecloud/paddleclas): Ubuntu 18.04.5 LTS (Python is 3.7, PaddlePaddle is 2.3.0.post112, PaddleClas is 0.0.0 (actually it should be 2.4, so this seems to be a dev version))
  • OS (in container python:3.7-slim): Debian GNU/Linux 12 (bookworm) (Python is 3.7, PaddlePaddle is 2.5.2, PaddleClas is 2.5.1)
  • OS (in container python:3.8-slim): Debian GNU/Linux 12 (bookworm) (Python is 3.8, PaddlePaddle is 2.6.2, PaddleClas is 2.6.0)
  • OS (native device): Windows 11 Enterprise 24H2 (10.0.26100 Build 26100)
  • Docker version: 27.3.1, build ce12230
  • NVIDIA Driver: 566.03

可以肯定的是,尽管上述测试多次报出memory error,但运行脚本的时候、本人的内存是绝对没有满的。

@TingquanGao
Copy link
Collaborator

感谢您的反馈和非常详细的实验!我们会安排排查该问题。

@drawyaW
Copy link

drawyaW commented Feb 20, 2025

赞!很详细的解决方案,我尝试把paddlepaddle版本回退,确实成功运行!

@wang-kangkang
Copy link

可见paddle相关库在发布的时候,并没有自动化运行各种测试样例的机制。
而实际上这个并不难,把各种demo级别的命令集中到一起,运行一次就行。这很明显就是测试团队leader的责任

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants