PoC with run() & get_results() using K8S statuses mapped to v6 ones + Question about V6 statuses #15
Replies: 1 comment
-
Hi @hcadavid! Great to see you are making progress.
So indeed as you mention the The
Since there are more exceptions that can be raised from the docker-py packege we decided to catch them all and report them as Unknown. Note that there is an
|
Beta Was this translation helpful? Give feedback.
-
Hello all,
I recently pushed an update on the PoC. This update includes the implementation of the original 'run()' and 'get_result()' methods of v6's DockerManager (using the k8s API), including all the required status-tracking and mapping between k8s'job execution status and v6's TaskStatus. It is based not only on the original methods' specifications but also considering each method's control flow.
The PoC's node now includes a thread that calls this 'get_result()' (blocking) method to show how the node can get a 'Result' object for each one of the executed tasks, once these are completed either due to a successful execution or after N failed attempts (N attempts handled by K8S based on the backoff_limit parameter). It also shows how the jobs and related PODs are destroyed (while keeping the data on the host's filesystem, as discussed in another thread) upon each task/job completion.
These 'Result' objects, for the moment, include:
The following are the potential V6-Statuses that can be reported in the process: ACTIVE, COMPLETED, CRASHED, NOT_ALLOWED
I still need to explore how to intercept and properly report (1) an error caused by a non-existing image (NO_DOCKER_IMAGE), and (2) the event of a failed k8s job execution caused by an ['unknown algorithm' which, if I understand correctly, you are reporting as TaskStatus.START_FAILED. For the latter, in the code, the chain of events that lead to this status (START_FAILED) starts when the Docker client raises any exception, so I'm not sure what does an 'unknown algorithm' mean technically speaking. Is it a missing method or module within the image that causes the container execution to crash? (With the current PoC implementation an event like that would end on a CRASHED status).
The PoC also needs to properly include the parent_id, job_id and run_id values. I understand what the parent_id should be, but I'm still a bit confused about the difference between job_id and run_id. Do you have something I can refer to to understand this better?
I have updated the steps on the README file if you want to give it a try and/or check the code of the above.
Beta Was this translation helpful? Give feedback.
All reactions