Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spawningworker: logging, check if process crashed #76

Open
yankovs opened this issue Jun 7, 2024 · 0 comments · May be fixed by #78
Open

Spawningworker: logging, check if process crashed #76

yankovs opened this issue Jun 7, 2024 · 0 comments · May be fixed by #78

Comments

@yankovs
Copy link
Contributor

yankovs commented Jun 7, 2024

Hey!

I've tried the new spawning worker type, and I have some remarks about it:

In this section:

def _executeJobPayload(self, job_payload, job):
# instead of execution within our own context, spawn a new process as worker for this job payload
console_handle = subprocess.Popen(["python", "-m", "mcrit", "singlejobworker", "--job_id", str(job.job_id)], stdout=subprocess.PIPE, stderr=subprocess.PIPE)
# extract result_id from console_output
result_id = None
try:
stdout_result, stderr_result = console_handle.communicate(timeout=self._queue_config.QUEUE_SPAWNINGWORKER_CHILDREN_TIMEOUT)
stdout_result = stdout_result.strip().decode("utf-8")
last_line = stdout_result.split("\n")[-1]
# successful output should be just the result_id in a single line
match = re.match("(?P<result_id>[0-9a-fA-F]{24})", last_line)
if match:
result_id = match.group("result_id")
except subprocess.TimeoutExpired:
LOGGER.error(f"Job {str(job.job_id)} running as child from SpawningWorker timed out during processing.")
return result_id

it should be taken into account that the singlejobworker subprocess might crash. We've noticed cases where it does crash, and in those cases what happens is that result_id = None is returned, and Finished Remote Job with resuld_id: None is logged. This causes the job to appear as finished in the web, but trying to access the result will result in an error page saying the result doesn't exist.

I think this can be easily handled with (plus maybe logic to mark job as failed):

with job as j:
                LOGGER.info("Processing Remote Job: %s", job)
                result_id = self._executeJobPayload(j["payload"], job)
-               # result should have already been persisted by the child process,we repeat it here to close the job for the queue
-               job.result = result_id
-               LOGGER.info("Finished Remote Job with result_id: %s", result_id)
+              if result_id:
+                  # result should have already been persisted by the child process,we repeat it here to close the job for the queue
+                  job.result = result_id
+                  LOGGER.info("Finished Remote Job with result_id: %s", result_id)

One other thing I wanted to mention, is that logs from the child singlejobworker process aren't appearing in logs when for example doing docker logs <worker-container-id>. Since you already get stdout and stderr of the child process, I think it's just a matter of logging them from the spawningworker parent process

Other than this the feature seems to be working great, and thank you for taking the time to implement it :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant