You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
ValueError: processor in session meta is not valid: <ErSessionMeta(id=202410250415447511850_nn_0_0_guest_10000, name=, status=KILLED, tag=, processors=[***, len=4], options=[{'eggroll.rollpair.inmemory_output': 'True', 'python.path': '/data/projects/fate/fate/python:/data/projects/fate/fate/python:/data/projects/fate/fateflow/python:/data/projects/fate/eggroll/python', 'eggroll.session.deploy.mode': 'cluster', 'eggroll.session.processors.per.node': '4', 'python.venv': '/data/projects/fate/common/python/venv'}]) at 0x7f14a43997c0>
FATE1.11.3,自定义模型报错,大概率出现该报错。
使用flow test toy -gid 10000 -hid 10000 极小概率出现该报错。
有时可以成功训练。
clustermanager.jvm.err.log报错:
[ERROR][2124508][2024-10-25 04:10:46,885][grpc-server-4670-24,pid:3120,tid:113][c.w.e.c.e.h.DefaultLoggingErrorHandler:144] -
java.lang.reflect.InvocationTargetException: null
at sun.reflect.GeneratedMethodAccessor8.invoke(Unknown Source) ~[?:?]
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:1.8.0_345]
at java.lang.reflect.Method.invoke(Method.java:498) ~[?:1.8.0_345]
at com.webank.eggroll.core.command.CommandRouter$$anonfun$register$3.apply(CommandRouter.scala:130) ~[eggroll-core-2.5.2.jar:?]
at com.webank.eggroll.core.command.CommandRouter$$anonfun$register$3.apply(CommandRouter.scala:124) ~[eggroll-core-2.5.2.jar:?]
at com.webank.eggroll.core.command.CommandRouter$.dispatch(CommandRouter.scala:139) ~[eggroll-core-2.5.2.jar:?]
at com.webank.eggroll.core.command.CommandService.com$webank$eggroll$core$command$CommandService$$run$body$1(CommandService.scala:47) ~[eggroll-core-2.5.2.jar:?]
at com.webank.eggroll.core.command.CommandService$$anonfun$1.run(CommandService.scala:41) ~[eggroll-core-2.5.2.jar:?]
at com.webank.eggroll.core.grpc.server.GrpcServerWrapper.wrapGrpcServerRunnable(GrpcServerWrapper.java:43) [eggroll-core-2.5.2.jar:?]
at com.webank.eggroll.core.command.CommandService.call(CommandService.scala:41) [eggroll-core-2.5.2.jar:?]
at com.webank.eggroll.core.command.CommandServiceGrpc$MethodHandlers.invoke(CommandServiceGrpc.java:257) [eggroll-core-2.5.2.jar:?]
at io.grpc.stub.ServerCalls$UnaryServerCallHandler$UnaryServerCallListener.onHalfClose(ServerCalls.java:182) [grpc-stub-1.55.1.jar:1.55.1]
at io.grpc.internal.ServerCallImpl$ServerStreamListenerImpl.halfClosed(ServerCallImpl.java:346) [grpc-core-1.55.1.jar:1.55.1]
at io.grpc.internal.ServerImpl$JumpToApplicationThreadServerStreamListener$1HalfClosed.runInContext(ServerImpl.java:860) [grpc-core-1.55.1.jar:1.55.1]
at io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37) [grpc-core-1.55.1.jar:1.55.1]
at io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:133) [grpc-core-1.55.1.jar:1.55.1]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_345]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_345]
at java.lang.Thread.run(Thread.java:750) [?:1.8.0_345]
Caused by: com.webank.eggroll.core.error.ErSessionException: unable to start all processors for session id: '202410250359237753070_eval_0_0_host_10000'. Please check corresponding bootstrap logs at '/data/logs/fate/eggroll/202410250359237753070_eval_0_0_host_10000' to check the reasons. Details:
=================
total processors: 4,
started count: 0,
not started count: 4,
current active processors per node: Map(192.168.71.121 -> 0),
not started processors and their nodes: Map(218 -> 192.168.71.121, 220 -> 192.168.71.121, 217 -> 192.168.71.121, 219 -> 192.168.71.121)
at com.webank.eggroll.core.resourcemanager.SessionManagerService.getOrCreateSessionOld(SessionManager.scala:493) ~[eggroll-core-2.5.2.jar:?]
at com.webank.eggroll.core.resourcemanager.SessionManagerService.getOrCreateSession(SessionManager.scala:342) ~[eggroll-core-2.5.2.jar:?]
... 19 more
请问是资源问题还是网络问题?
The text was updated successfully, but these errors were encountered:
FATE1.11.3,自定义模型报错,大概率出现该报错。
使用flow test toy -gid 10000 -hid 10000 极小概率出现该报错。
有时可以成功训练。
clustermanager.jvm.err.log报错:
请问是资源问题还是网络问题?
The text was updated successfully, but these errors were encountered: