You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Note: Panic! because we can't move test.glycam.org to actual until this is fixed.
Website: test.glycam.org. Developer environment and actual are fine. They do not show this behavior
Build something like:
DGalpb1-3DGlcpNAcb1-6DGalpb1-6DGlcpNAcb1-6DManpa1-3[DGalpb1-3DGlcpNAcb1-3DGalpb1-4DGlcpNAcb1-2DManpa1-6]DManpb1-6DGlcpNAcb1-6DGlcpNAcb1-OH
You can change any b to a or any 3 to a 6 etc to trigger a new build. Request at least 48 structures. You will see that some of them do not build. This is not a polling issue. If I check the folders the structures that are showing as "minimizing" forever are not finished. The slurm submission script was written but nothing after that. If I do squeue when the job is running I see a bunch of things happening. But they all finish and there's nothing pending.
root@gw-slurm-head:/# <- This is where I did squeue.
If I tail the gems log:
> tail git-ignore-me_gemsDebug.log
gw-slurm-head 2024-06-26 09:06:51 AM - gemsModules.deprecated.batchcompute.slurm.receive - ERROR - Error type: <class 'FileNotFoundError'>
gw-slurm-head 2024-06-26 09:06:51 AM - gemsModules.deprecated.batchcompute.slurm.receive - ERROR - Traceback (most recent call last):
File "/programs/gems/gemsModules/deprecated/batchcompute/slurm/receive.py", line 67, in writeSlurmSubmissionScript
script = open(path, "w")
FileNotFoundError: [Errno 2] No such file or directory: '/website/userdata/sequence/cb/Builds/e55e1714-1e0d-42cc-a617-bc5a157362e2/New_Builds/1ogt_2ogg_3ogt_9ogg_11ogt/slurm_submit.sh'
gw-slurm-head 2024-06-26 09:06:51 AM - __main__ - ERROR - gRPC Slurm server caught an unknown error.
gw-slurm-head 2024-06-26 09:06:51 AM - __main__ - ERROR - For date-time stamp: 26/06/2024 09:06:51
gw-slurm-head 2024-06-26 09:06:51 AM - __main__ - ERROR - For this submission: /programs/gems/bin/slurmreceive
gw-slurm-head 2024-06-26 09:06:51 AM - __main__ - ERROR - This is the result: {"entity": {"type": "GRPC", "responses": [{"Error": {"respondingService": "GemsGrpcSlurmReceiver", "notice": {"type": "Exit", "code": "4", "brief": "UnknownError", "message": "There was an unknown fatal error."}, "options": {"osExitCode": "1", "theStandardError": "b'Traceback (most recent call last):\\n File \"/programs/gems/bin/slurmreceive\", line 72, in <module>\\n responseObjectString = manageIncomingString(jsonObjectString)\\n File \"/programs/gems/gemsModules/deprecated/batchcompute/slurm/receive.py\", line 156, in manageIncomingString\\n writeSlurmSubmissionScript(slurm_runscript_path, thisSlurmJobInfo)\\n File \"/programs/gems/gemsModules/deprecated/batchcompute/slurm/receive.py\", line 72, in writeSlurmSubmissionScript\\n raise error\\n File \"/programs/gems/gemsModules/deprecated/batchcompute/slurm/receive.py\", line 67, in writeSlurmSubmissionScript\\n script = open(path, \"w\")\\nFileNotFoundError: [Errno 2] No such file or directory: \\'/website/userdata/sequence/cb/Builds/e55e1714-1e0d-42cc-a617-bc5a157362e2/New_Builds/1ogt_2ogg_3ogt_9ogg_11ogt/slurm_submit.sh\\'\\n'", "theStandardOutput": "b''", "theExceptionError": "None"}}}]}}
(Wed Jun-6 11:47:55am)-(CPU 0.6%:0:Net 29)-(webdev@smanager04:/website/DOCKER/GLYCAM/hope-hockney/V_2/Web_Programs/gems/logs)-(204K:4)
>
The important part is here:
File "/programs/gems/gemsModules/deprecated/batchcompute/slurm/receive.py", line 67, in writeSlurmSubmissionScript
script = open(path, "w")
FileNotFoundError: [Errno 2] No such file or directory: '/website/userdata/sequence/cb/Builds/e55e1714-1e0d-42cc-a617-bc5a157362e2/New_Builds/1ogt_2ogg_3ogt_9ogg_11ogt/slurm_submit.sh'
Lachele thinks something isn't mounted to one of the workers, but we tried:
8183 ssh sworker02 "ls /website"
8184 ssh sworker02 "ls /website/DOCKER"
8185 ssh sworker01 "ls /website/DOCKER"
8186 ssh sworker03 "ls /website/DOCKER"
8187 ssh sworker04 "ls /website/DOCKER"
8188 ssh sworker05 "ls /website/DOCKER"
8189 ssh sworker06 "ls /website/DOCKER"
8190 ssh smanager02 "ls /website/DOCKER"
8191 ssh rime "ls /website/DOCKER"
8192 ssh smanager04 "ls /website/DOCKER"
8193 ssh smanager05 "ls /website/DOCKER"
8194 ssh smanager06 "ls /website/DOCKER"
And saw the contents of DOCKER everytime.
L recommended asking Grayson.
The text was updated successfully, but these errors were encountered:
I am uncertain, I have seen this exception before for sending the request to the wrong grpc_server, creating the project in the wrong userdata folder. However, in those cases a submission would always fail. This is just a should, but deprecated Sequence + BatchCompute should not have any differences across test or actual.
I'm suspecting a race condition with creating 48 project folders and files across a network and gRPC requests. I'm not familiar with what a finished Sequence project looks like, but I followed your direction and modified the sequence for a new build, which appears to have worked.
Ok yes this is no longer re-producible. I was able to consistently generate ~3 projects like this over the course of a few hours before submitting this bug ticket on Wednesday (yesterday). Now I get perfect behavior... Should we add checks for the slurm submission file existing into the code? Followed by waits that log that they are happening and finally followed by useful error throws if it doesn't appear within a second or so? We need to figure this out as it could be intermittently messing up jobs and we would remain unaware of it.
Note: Panic! because we can't move test.glycam.org to actual until this is fixed.
Website: test.glycam.org. Developer environment and actual are fine. They do not show this behavior
Build something like:
DGalpb1-3DGlcpNAcb1-6DGalpb1-6DGlcpNAcb1-6DManpa1-3[DGalpb1-3DGlcpNAcb1-3DGalpb1-4DGlcpNAcb1-2DManpa1-6]DManpb1-6DGlcpNAcb1-6DGlcpNAcb1-OH
You can change any b to a or any 3 to a 6 etc to trigger a new build. Request at least 48 structures. You will see that some of them do not build. This is not a polling issue. If I check the folders the structures that are showing as "minimizing" forever are not finished. The slurm submission script was written but nothing after that. If I do squeue when the job is running I see a bunch of things happening. But they all finish and there's nothing pending.
root@gw-slurm-head:/# <- This is where I did squeue.
If I tail the gems log:
The important part is here:
File "/programs/gems/gemsModules/deprecated/batchcompute/slurm/receive.py", line 67, in writeSlurmSubmissionScript
script = open(path, "w")
FileNotFoundError: [Errno 2] No such file or directory: '/website/userdata/sequence/cb/Builds/e55e1714-1e0d-42cc-a617-bc5a157362e2/New_Builds/1ogt_2ogg_3ogt_9ogg_11ogt/slurm_submit.sh'
Lachele thinks something isn't mounted to one of the workers, but we tried:
8183 ssh sworker02 "ls /website"
8184 ssh sworker02 "ls /website/DOCKER"
8185 ssh sworker01 "ls /website/DOCKER"
8186 ssh sworker03 "ls /website/DOCKER"
8187 ssh sworker04 "ls /website/DOCKER"
8188 ssh sworker05 "ls /website/DOCKER"
8189 ssh sworker06 "ls /website/DOCKER"
8190 ssh smanager02 "ls /website/DOCKER"
8191 ssh rime "ls /website/DOCKER"
8192 ssh smanager04 "ls /website/DOCKER"
8193 ssh smanager05 "ls /website/DOCKER"
8194 ssh smanager06 "ls /website/DOCKER"
And saw the contents of DOCKER everytime.
L recommended asking Grayson.
The text was updated successfully, but these errors were encountered: