Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

processor in session meta is not valid #731

Open
sealofyou opened this issue Oct 25, 2024 · 0 comments
Open

processor in session meta is not valid #731

sealofyou opened this issue Oct 25, 2024 · 0 comments

Comments

@sealofyou
Copy link

sealofyou commented Oct 25, 2024

ValueError: processor in session meta is not valid: <ErSessionMeta(id=202410250415447511850_nn_0_0_guest_10000, name=, status=KILLED, tag=, processors=[***, len=4], options=[{'eggroll.rollpair.inmemory_output': 'True', 'python.path': '/data/projects/fate/fate/python:/data/projects/fate/fate/python:/data/projects/fate/fateflow/python:/data/projects/fate/eggroll/python', 'eggroll.session.deploy.mode': 'cluster', 'eggroll.session.processors.per.node': '4', 'python.venv': '/data/projects/fate/common/python/venv'}]) at 0x7f14a43997c0>

FATE1.11.3,自定义模型报错,大概率出现该报错。
使用flow test toy -gid 10000 -hid 10000 极小概率出现该报错。
有时可以成功训练。
clustermanager.jvm.err.log报错:

[ERROR][2124508][2024-10-25 04:10:46,885][grpc-server-4670-24,pid:3120,tid:113][c.w.e.c.e.h.DefaultLoggingErrorHandler:144] -
java.lang.reflect.InvocationTargetException: null
        at sun.reflect.GeneratedMethodAccessor8.invoke(Unknown Source) ~[?:?]
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:1.8.0_345]
        at java.lang.reflect.Method.invoke(Method.java:498) ~[?:1.8.0_345]
        at com.webank.eggroll.core.command.CommandRouter$$anonfun$register$3.apply(CommandRouter.scala:130) ~[eggroll-core-2.5.2.jar:?]
        at com.webank.eggroll.core.command.CommandRouter$$anonfun$register$3.apply(CommandRouter.scala:124) ~[eggroll-core-2.5.2.jar:?]
        at com.webank.eggroll.core.command.CommandRouter$.dispatch(CommandRouter.scala:139) ~[eggroll-core-2.5.2.jar:?]
        at com.webank.eggroll.core.command.CommandService.com$webank$eggroll$core$command$CommandService$$run$body$1(CommandService.scala:47) ~[eggroll-core-2.5.2.jar:?]
        at com.webank.eggroll.core.command.CommandService$$anonfun$1.run(CommandService.scala:41) ~[eggroll-core-2.5.2.jar:?]
        at com.webank.eggroll.core.grpc.server.GrpcServerWrapper.wrapGrpcServerRunnable(GrpcServerWrapper.java:43) [eggroll-core-2.5.2.jar:?]
        at com.webank.eggroll.core.command.CommandService.call(CommandService.scala:41) [eggroll-core-2.5.2.jar:?]
        at com.webank.eggroll.core.command.CommandServiceGrpc$MethodHandlers.invoke(CommandServiceGrpc.java:257) [eggroll-core-2.5.2.jar:?]
        at io.grpc.stub.ServerCalls$UnaryServerCallHandler$UnaryServerCallListener.onHalfClose(ServerCalls.java:182) [grpc-stub-1.55.1.jar:1.55.1]
        at io.grpc.internal.ServerCallImpl$ServerStreamListenerImpl.halfClosed(ServerCallImpl.java:346) [grpc-core-1.55.1.jar:1.55.1]
        at io.grpc.internal.ServerImpl$JumpToApplicationThreadServerStreamListener$1HalfClosed.runInContext(ServerImpl.java:860) [grpc-core-1.55.1.jar:1.55.1]
        at io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37) [grpc-core-1.55.1.jar:1.55.1]
        at io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:133) [grpc-core-1.55.1.jar:1.55.1]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_345]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_345]
        at java.lang.Thread.run(Thread.java:750) [?:1.8.0_345]
Caused by: com.webank.eggroll.core.error.ErSessionException: unable to start all processors for session id: '202410250359237753070_eval_0_0_host_10000'. Please check corresponding bootstrap logs at '/data/logs/fate/eggroll/202410250359237753070_eval_0_0_host_10000' to check the reasons. Details:
=================
total processors: 4,
started count: 0,
not started count: 4,
current active processors per node: Map(192.168.71.121 -> 0),
not started processors and their nodes: Map(218 -> 192.168.71.121, 220 -> 192.168.71.121, 217 -> 192.168.71.121, 219 -> 192.168.71.121)
        at com.webank.eggroll.core.resourcemanager.SessionManagerService.getOrCreateSessionOld(SessionManager.scala:493) ~[eggroll-core-2.5.2.jar:?]
        at com.webank.eggroll.core.resourcemanager.SessionManagerService.getOrCreateSession(SessionManager.scala:342) ~[eggroll-core-2.5.2.jar:?]
        ... 19 more

请问是资源问题还是网络问题?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant