Spawningworker: logging, check if process crashed #76

yankovs · 2024-06-07T09:17:01Z

Hey!

I've tried the new spawning worker type, and I have some remarks about it:

In this section:

Lines 90 to 105 in 90d3344

    
           def _executeJobPayload(self, job_payload, job): 
        
               # instead of execution within our own context, spawn a new process as worker for this job payload 
        
               console_handle = subprocess.Popen(["python", "-m", "mcrit", "singlejobworker", "--job_id", str(job.job_id)], stdout=subprocess.PIPE, stderr=subprocess.PIPE) 
        
               # extract result_id from console_output 
        
               result_id = None 
        
               try: 
        
                   stdout_result, stderr_result = console_handle.communicate(timeout=self._queue_config.QUEUE_SPAWNINGWORKER_CHILDREN_TIMEOUT) 
        
                   stdout_result = stdout_result.strip().decode("utf-8") 
        
                   last_line = stdout_result.split("\n")[-1] 
        
                   # successful output should be just the result_id in a single line 
        
                   match = re.match("(?P<result_id>[0-9a-fA-F]{24})", last_line) 
        
                   if match: 
        
                       result_id = match.group("result_id") 
        
               except subprocess.TimeoutExpired: 
        
                   LOGGER.error(f"Job {str(job.job_id)} running as child from SpawningWorker timed out during processing.") 
        
               return result_id

it should be taken into account that the singlejobworker subprocess might crash. We've noticed cases where it does crash, and in those cases what happens is that result_id = None is returned, and Finished Remote Job with resuld_id: None is logged. This causes the job to appear as finished in the web, but trying to access the result will result in an error page saying the result doesn't exist.

I think this can be easily handled with (plus maybe logic to mark job as failed):

with job as j:
                LOGGER.info("Processing Remote Job: %s", job)
                result_id = self._executeJobPayload(j["payload"], job)
-               # result should have already been persisted by the child process,we repeat it here to close the job for the queue
-               job.result = result_id
-               LOGGER.info("Finished Remote Job with result_id: %s", result_id)
+              if result_id:
+                  # result should have already been persisted by the child process,we repeat it here to close the job for the queue
+                  job.result = result_id
+                  LOGGER.info("Finished Remote Job with result_id: %s", result_id)

One other thing I wanted to mention, is that logs from the child singlejobworker process aren't appearing in logs when for example doing docker logs <worker-container-id>. Since you already get stdout and stderr of the child process, I think it's just a matter of logging them from the spawningworker parent process

Other than this the feature seems to be working great, and thank you for taking the time to implement it :)

The text was updated successfully, but these errors were encountered:

yankovs linked a pull request Jun 10, 2024 that will close this issue

Log from worker subprocess, check if result_id is received #78

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Spawningworker: logging, check if process crashed #76

Spawningworker: logging, check if process crashed #76

yankovs commented Jun 7, 2024

Spawningworker: logging, check if process crashed #76

Spawningworker: logging, check if process crashed #76

Comments

yankovs commented Jun 7, 2024