[Build] ONNX Runtime build fails OOM (v1.20.0) #22859

mc-nv · 2024-11-15T23:04:09Z

Describe the issue

Getting issue trying to compile against rel-1.20.0 branch.
We are getting out of memory issue, for both Linux and Windows platforms.

windows config (64GB RAM):

BUILDTOOLS_VERSION:17.12.35506.116 
CMAKE_VERSION:3.30.5 
CUDA_VERSION:12.6.2 
CUDNN_VERSION:9.5.1.17 
PYTHON_VERSION:3.12.3 
TENSORRT_VERSION:10.6.0.26 
VCPGK_VERSION:2024.03.19

LInux (64GB RAM):

CMAKE_VERSION:3.28.3
CUDA_VERSION:12.6.2 
CUDNN_VERSION:9.5.1.17 
PYTHON_VERSION:3.12.3 
TENSORRT_VERSION:10.6.0.26

Urgency

ASAP

Target platform

Linux, Windows

Build script

Windows:

onnxruntime/tools/ci_build/build.py `
   --cmake_generator "Visual Studio 17 2022" `
   --config Release `
   --cmake_extra_defines "CMAKE_CUDA_ARCHITECTURES=75;80;86;90" `
   --skip_submodule_sync `
   --parallel `
   --build_shared_lib `
   --compile_no_warning_as_error `
   --skip_tests `
   --update `
   --build `
   --build_dir /workspace/build `
   --use_cuda `
   --cuda_home ${env:CUDA_PATH} `
   --cudnn_home ${env:CUDA_PATH} `
   --use_tensorrt --tensorrt_home "/tensorrt" ; `

linux:

./build.sh \
  --config Release \
  --skip_submodule_sync \
  --parallel \
  --build_shared_lib     \
  --compile_no_warning_as_error \
  --build_dir /workspace/build \
  --cmake_extra_defines CMAKE_CUDA_ARCHITECTURES='75;80;86;90'  \
  --update \
  --build \
  --use_cuda \
  --cuda_home "/usr/local/cuda" \
  --cudnn_home "/usr" \
  --use_tensorrt \
  --use_tensorrt_builtin_parser \
  --tensorrt_home "/usr/src/tensorrt" \
  --allow_running_as_root \
  --use_openvino CPU

Error / output

No error, container fails out of memory.

Visual Studio Version

No response

GCC / Compiler Version

No response

The text was updated successfully, but these errors were encountered:

mc-nv · 2024-11-15T23:04:26Z

@snnn for viz

snnn · 2024-11-15T23:15:38Z

Use " --parallel <n>" to reduce the parallelism.

snnn · 2024-11-15T23:16:24Z

It is more about how much memory you have for each CPU core than how much memory you have in total.

mc-nv · 2024-11-15T23:19:02Z

See linux build uses --parallel and it heavy machines where we never see issue building ONNX Runtime.

snnn · 2024-11-15T23:41:13Z

Sorry my response was eaten by a part because of formatting. I meant, put a number there after "--parallel", to limit the number of concurrent processes. Let's say you have 64GB memory and 16 CPUs. By default make/msbuild will create at most 16 subprocesses. Since we do not know if 4GB is enough for one compiler process, sometimes we might need to manually adjust the parallelism to avoid OOM.

mc-nv · 2024-11-15T23:47:50Z

Sounds like a suggestion to have 8Gb per process, am I right?

mc-nv · 2024-11-16T00:07:35Z

Sorry my response was eaten by a part because of formatting. I meant, put a number there after "--parallel", to limit the number of concurrent processes. Let's say you have 64GB memory and 16 CPUs. By default make/msbuild will create at most 8 subprocesses. Since we do not know if 4GB is enough for one compiler process, sometimes we might need to manually adjust the parallelism to avoid OOM.

See in my scenario we don't set limit to parallel jobs and using default which "1" by default: https://github.com/microsoft/onnxruntime/blob/main/tools/ci_build/build.py#L171

What will be the reason to set limit to 2 or 4 if we failing with OOO using single process?

snnn · 2024-11-16T00:22:46Z

Actually the default is not one. If the optional value is 0 or unspecified, it is interpreted as the number of CPUs. As you know how much CPUs the machine has, you may start with dividing it by half. For example, if we think the default value is 16, we try 8 first. If the error still exists, we decrease it further. Eventually it will pass because 64GB is definitely enough for one single compiler processs.

snnn · 2024-11-16T00:23:48Z

You may also need to tune the "--nvcc_threads" parameter. To be safe, you can set it to one.

mc-nv · 2024-11-16T00:24:18Z

My windows build environment has 2 CPUs.

tianleiwu · 2024-11-16T01:26:36Z

Estimated memory usage is nvcc_threads * parallel * 8GB so you will need at least 16 GB memory for --parallel 2 --nvcc_threads 1. Otherwise, try --parallel 1 --nvcc_threads 1. If you do not set them, nvcc_threads=parallel=vCPU=2, so you will need 32GB.

mc-nv added the build build issues; typically submitted using template label Nov 15, 2024

mc-nv changed the title ~~[Build] ONNX Runtime build fails OOM~~ [Build] ONNX Runtime build fails OOM (v1.20.0) Nov 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Build] ONNX Runtime build fails OOM (v1.20.0) #22859

[Build] ONNX Runtime build fails OOM (v1.20.0) #22859

mc-nv commented Nov 15, 2024 •

edited

Loading

mc-nv commented Nov 15, 2024

snnn commented Nov 15, 2024 •

edited

Loading

snnn commented Nov 15, 2024

mc-nv commented Nov 15, 2024

snnn commented Nov 15, 2024 •

edited

Loading

mc-nv commented Nov 15, 2024

mc-nv commented Nov 16, 2024

snnn commented Nov 16, 2024

snnn commented Nov 16, 2024

mc-nv commented Nov 16, 2024 •

edited

Loading

tianleiwu commented Nov 16, 2024 •

edited

Loading

[Build] ONNX Runtime build fails OOM (v1.20.0) #22859

[Build] ONNX Runtime build fails OOM (v1.20.0) #22859

Comments

mc-nv commented Nov 15, 2024 • edited Loading

Describe the issue

Urgency

Target platform

Build script

Error / output

Visual Studio Version

GCC / Compiler Version

mc-nv commented Nov 15, 2024

snnn commented Nov 15, 2024 • edited Loading

snnn commented Nov 15, 2024

mc-nv commented Nov 15, 2024

snnn commented Nov 15, 2024 • edited Loading

mc-nv commented Nov 15, 2024

mc-nv commented Nov 16, 2024

snnn commented Nov 16, 2024

snnn commented Nov 16, 2024

mc-nv commented Nov 16, 2024 • edited Loading

tianleiwu commented Nov 16, 2024 • edited Loading

mc-nv commented Nov 15, 2024 •

edited

Loading

snnn commented Nov 15, 2024 •

edited

Loading

snnn commented Nov 15, 2024 •

edited

Loading

mc-nv commented Nov 16, 2024 •

edited

Loading

tianleiwu commented Nov 16, 2024 •

edited

Loading