Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Carb builder: Some structures are not building when you request many #196

Open
gitoliver opened this issue Jun 26, 2024 · 3 comments
Open
Assignees
Labels
bug Something isn't working GEMS panic!

Comments

@gitoliver
Copy link
Collaborator

Note: Panic! because we can't move test.glycam.org to actual until this is fixed.

Website: test.glycam.org. Developer environment and actual are fine. They do not show this behavior

Build something like:
DGalpb1-3DGlcpNAcb1-6DGalpb1-6DGlcpNAcb1-6DManpa1-3[DGalpb1-3DGlcpNAcb1-3DGalpb1-4DGlcpNAcb1-2DManpa1-6]DManpb1-6DGlcpNAcb1-6DGlcpNAcb1-OH
You can change any b to a or any 3 to a 6 etc to trigger a new build. Request at least 48 structures. You will see that some of them do not build. This is not a polling issue. If I check the folders the structures that are showing as "minimizing" forever are not finished. The slurm submission script was written but nothing after that. If I do squeue when the job is running I see a bunch of things happening. But they all finish and there's nothing pending.
root@gw-slurm-head:/# <- This is where I did squeue.

If I tail the gems log:

> tail git-ignore-me_gemsDebug.log 
gw-slurm-head 2024-06-26 09:06:51 AM - gemsModules.deprecated.batchcompute.slurm.receive - ERROR - Error type: <class 'FileNotFoundError'>
gw-slurm-head 2024-06-26 09:06:51 AM - gemsModules.deprecated.batchcompute.slurm.receive - ERROR - Traceback (most recent call last):
  File "/programs/gems/gemsModules/deprecated/batchcompute/slurm/receive.py", line 67, in writeSlurmSubmissionScript
    script = open(path, "w")
FileNotFoundError: [Errno 2] No such file or directory: '/website/userdata/sequence/cb/Builds/e55e1714-1e0d-42cc-a617-bc5a157362e2/New_Builds/1ogt_2ogg_3ogt_9ogg_11ogt/slurm_submit.sh'

gw-slurm-head 2024-06-26 09:06:51 AM - __main__ - ERROR - gRPC Slurm server caught an unknown error.
gw-slurm-head 2024-06-26 09:06:51 AM - __main__ - ERROR - For date-time stamp: 26/06/2024 09:06:51
gw-slurm-head 2024-06-26 09:06:51 AM - __main__ - ERROR - For this submission: /programs/gems/bin/slurmreceive
gw-slurm-head 2024-06-26 09:06:51 AM - __main__ - ERROR - This is the result: {"entity": {"type": "GRPC", "responses": [{"Error": {"respondingService": "GemsGrpcSlurmReceiver", "notice": {"type": "Exit", "code": "4", "brief": "UnknownError", "message": "There was an unknown fatal error."}, "options": {"osExitCode": "1", "theStandardError": "b'Traceback (most recent call last):\\n  File \"/programs/gems/bin/slurmreceive\", line 72, in <module>\\n    responseObjectString = manageIncomingString(jsonObjectString)\\n  File \"/programs/gems/gemsModules/deprecated/batchcompute/slurm/receive.py\", line 156, in manageIncomingString\\n    writeSlurmSubmissionScript(slurm_runscript_path, thisSlurmJobInfo)\\n  File \"/programs/gems/gemsModules/deprecated/batchcompute/slurm/receive.py\", line 72, in writeSlurmSubmissionScript\\n    raise error\\n  File \"/programs/gems/gemsModules/deprecated/batchcompute/slurm/receive.py\", line 67, in writeSlurmSubmissionScript\\n    script = open(path, \"w\")\\nFileNotFoundError: [Errno 2] No such file or directory: \\'/website/userdata/sequence/cb/Builds/e55e1714-1e0d-42cc-a617-bc5a157362e2/New_Builds/1ogt_2ogg_3ogt_9ogg_11ogt/slurm_submit.sh\\'\\n'", "theStandardOutput": "b''", "theExceptionError": "None"}}}]}}
(Wed Jun-6 11:47:55am)-(CPU 0.6%:0:Net 29)-(webdev@smanager04:/website/DOCKER/GLYCAM/hope-hockney/V_2/Web_Programs/gems/logs)-(204K:4)
> 

The important part is here:
File "/programs/gems/gemsModules/deprecated/batchcompute/slurm/receive.py", line 67, in writeSlurmSubmissionScript
script = open(path, "w")
FileNotFoundError: [Errno 2] No such file or directory: '/website/userdata/sequence/cb/Builds/e55e1714-1e0d-42cc-a617-bc5a157362e2/New_Builds/1ogt_2ogg_3ogt_9ogg_11ogt/slurm_submit.sh'

Lachele thinks something isn't mounted to one of the workers, but we tried:
8183 ssh sworker02 "ls /website"
8184 ssh sworker02 "ls /website/DOCKER"
8185 ssh sworker01 "ls /website/DOCKER"
8186 ssh sworker03 "ls /website/DOCKER"
8187 ssh sworker04 "ls /website/DOCKER"
8188 ssh sworker05 "ls /website/DOCKER"
8189 ssh sworker06 "ls /website/DOCKER"
8190 ssh smanager02 "ls /website/DOCKER"
8191 ssh rime "ls /website/DOCKER"
8192 ssh smanager04 "ls /website/DOCKER"
8193 ssh smanager05 "ls /website/DOCKER"
8194 ssh smanager06 "ls /website/DOCKER"
And saw the contents of DOCKER everytime.

L recommended asking Grayson.

@gitoliver
Copy link
Collaborator Author

Here is an example:
https://test.glycam.org/cb/download/324fb5bb-86c9-4739-a31b-eb78560ccb40/
Maybe overkill, but here is a jam you can watch:
https://jam.dev/c/ef155618-2224-439f-a26c-daed89821a5d

@GRAYgoose124
Copy link
Member

GRAYgoose124 commented Jun 26, 2024

I am uncertain, I have seen this exception before for sending the request to the wrong grpc_server, creating the project in the wrong userdata folder. However, in those cases a submission would always fail. This is just a should, but deprecated Sequence + BatchCompute should not have any differences across test or actual.

I'm suspecting a race condition with creating 48 project folders and files across a network and gRPC requests. I'm not familiar with what a finished Sequence project looks like, but I followed your direction and modified the sequence for a new build, which appears to have worked.

https://test.glycam.org/cb/download/7f4aeca0-59ef-47ef-b4de-3bf332e69ff4/
Related project folder: /website/DOCKER/USERDATA/LiveTest/userdata/sequence/cb/Builds/d6714d49-9af0-4e1d-ad41-3d63497f7668

@gitoliver
Copy link
Collaborator Author

Ok yes this is no longer re-producible. I was able to consistently generate ~3 projects like this over the course of a few hours before submitting this bug ticket on Wednesday (yesterday). Now I get perfect behavior... Should we add checks for the slurm submission file existing into the code? Followed by waits that log that they are happening and finally followed by useful error throws if it doesn't appear within a second or so? We need to figure this out as it could be intermittently messing up jobs and we would remain unaware of it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working GEMS panic!
Projects
None yet
Development

No branches or pull requests

3 participants