Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

InferenceSession - Catastrophic Error or Unspecified Error is thrown #22815

Open
saddam213 opened this issue Nov 12, 2024 · 6 comments
Open

InferenceSession - Catastrophic Error or Unspecified Error is thrown #22815

saddam213 opened this issue Nov 12, 2024 · 6 comments
Labels
api:CSharp issues related to the C# API ep:DML issues related to the DirectML execution provider

Comments

@saddam213
Copy link

saddam213 commented Nov 12, 2024

Describe the issue

Version 1.19.0
Sometimes when starting an InferenceSession this exception, Catastrophic Error or Unspecified Error is thrown

No other sessions will work at all until the application is stopped/started

New Unrelated Issue from Version 1.20.0
[ErrorCode:Fail] Trying to add a domain to DomainToVersion map, but the domain is already exist with version range (1, 1000). domain: "com.microsoft.extensions"

This is new to 1.20.0 happens at random like the other 2 error, however seems to be unrelated per the comments below, I upgraded to 1.20.0 to see if the first 2 error were resolved, but it has not, and has introduced this new one

To reproduce

new InferenceSession("Model.onnx") with a known working model

This is extremely hard to replicate, but we are getting plenty of error reports, in most cases it happens the first time after a system reboot, sometimes it just happens randomly

Urgency

Urgent, live application that has started failing globally

Platform

Windows

OS Version

10 & 11

ONNX Runtime Installation

Released Package

ONNX Runtime Version or Commit ID

1.19.0

ONNX Runtime API

C#

Architecture

X64

Execution Provider

DirectML

Execution Provider Library Version

1.19.0

@github-actions github-actions bot added api:CSharp issues related to the C# API ep:DML issues related to the DirectML execution provider labels Nov 12, 2024
@saddam213
Copy link
Author

saddam213 commented Nov 12, 2024

Bit more context:

I am the developer of Amuse.ai, our app has been out for about a year running DirectML inference without issue

a few months back we upgraded from 1.18.1 to 1.19.0, then we started getting a few error reports of "Catastrophic Error" when the user tried to load a model

However it is now 10-20 reports a day, so its somehow getting worse? windows update?

After upgrading to 1.20.0 we now also get this new error, actually hoping its the root cause of Catastrophic Error because I can't find that anywhere in OnnxRuntime

@saddam213
Copy link
Author

2024-11-13 09:19:13.0746439 [E:onnxruntime:, inference_session.cc:2118 onnxruntime::InferenceSession::Initialize::<lambda_a18664140bfa1274480334618139aa6c>::operator ()] Exception during initialization: D:\a_work\1\s\onnxruntime\core\providers\dml\DmlExecutionProvider\src\DmlGraphFusionHelper.cpp(576)\onnxruntime.DLL!00007FFA288EE903: (caller: 00007FFA2886E449) Exception(1) tid(87c) 8000FFFF Catastrophic failure

2024-11-13 09:19:13.9065553 [E:onnxruntime:, inference_session.cc:2118 onnxruntime::InferenceSession::Initialize::<lambda_a18664140bfa1274480334618139aa6c>::operator ()] Exception during initialization: D:\a_work\1\s\onnxruntime\core\providers\dml\DmlExecutionProvider\src\DmlGraphFusionHelper.cpp(576)\onnxruntime.DLL!00007FFA288EE903: (caller: 00007FFA2886E449) Exception(2) tid(1a78) 80004005 Unspecified error

@skottmckay
Copy link
Contributor

AFAIK "com.microsoft.extensions" is used by onnxruntime-extensions. The extensions have to be manually registered by calling SessionOptions.RegisterOrtExtensions. If you call that multiple times you'll get an error about the DomainToVersion map.

However that seems completely unrelated to any DML issues.

@saddam213
Copy link
Author

saddam213 commented Nov 13, 2024

AFAIK "com.microsoft.extensions" is used by onnxruntime-extensions. The extensions have to be manually registered by calling SessionOptions.RegisterOrtExtensions. If you call that multiple times you'll get an error about the DomainToVersion map.

However that seems completely unrelated to any DML issues.

Ok, then that error is a new one and unrelated to the other 2

Was hoping this new exception was the cause, but just looks like a brand new issue that bricks OnnxRuntime, sigh

We are unable to rollback to 1.18.1 as Flux and SD3-Large models do not run on the lower opset

@saddam213 saddam213 changed the title [ErrorCode:Fail] Trying to add a domain to DomainToVersion map, but the domain is already exist with version range (1, 1000). domain: "com.microsoft.extensions" InferenceSession - Catastrophic Error or Unspecified Error is thrown Nov 13, 2024
@saddam213
Copy link
Author

saddam213 commented Nov 13, 2024

Seems to be system dependent, some systems do it some don't, we have about 3000 concurrent active users and maybe 4% face this issue

I only have 1 Laptop PC that does it, sometimes, no rhyme or reason, same OS, same everything

There is not state stored by the app that would affect DirectML initialization, just seems to be a race condition inside the DML EP during initialization

@fdwr
Copy link
Contributor

fdwr commented Nov 16, 2024

Some debugging questions:

  • Do you know notice it on any particular GPU and driver range version? You mentioned even for ORT 1.19.1 that it "its somehow getting worse? windows update?". DirectML.dll in System32\ hasn't been updated for a while, as DirectML.dll is matched with the version of onnxruntime.dll, but driver updates could be a possibility.
  • If you keep the same version of ORT but use an older version of DirectML (https://www.nuget.org/packages/Microsoft.AI.DirectML) do the failures go away?
  • Is the model proprietary? If so, are there are parts of the model that can be shared for repro purposes if the model weights were zeroed?
  • Do you get any more diagnostic information with the DML debug layer running? RUNTIME_EXCEPTION, 80070057 The parameter is incorrect in v1.17.3 #20464 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api:CSharp issues related to the C# API ep:DML issues related to the DirectML execution provider
Projects
None yet
Development

No branches or pull requests

3 participants