-
Notifications
You must be signed in to change notification settings - Fork 287
fix resource check in scanAll process #8
base: master
Are you sure you want to change the base?
Conversation
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me overall.
Can you please (1) add some checks to enforce the minimal number of cores and memory in the REST API, and the comments that describes the fields in the YAML file?
(2) an unit test validating the behavior?
Sure, will do |
Got a problem, how do we define minimum amount of memory? It seems like neither AthenaX nor Flink has such kind of property currently... |
@@ -150,7 +150,8 @@ static StateView computeState(Map<UUID, JobDefinition> jobs, Map<UUID, InstanceI | |||
|
|||
static JobDefinitionDesiredstate computeActualState(InstanceInfo info) { | |||
JobDefinitionResource r = new JobDefinitionResource() | |||
.memory(info.status().getAllocatedMB()) | |||
.memory(info.status().getAllocatedMB() / (info.status().getAllocatedVCores() != 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Taking a closer look I don't quite understand this. The memory should be the total amount of memory for the whole job instead of the memory used by each task manager.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
According to the definition of JobConf, it seems like the amount of memory specified by the Web API parameters is exactly the memory used by each task manager instead of the whole job's memory consumption. And when I start a job with 2 cores and 2G memory, AthenaX starts a job with two TaskManager containers with 1 Core and 2G memory each, so the total amount of memory retrieved from YARN is 4G, which leads to the same issue of continuously restarting.
Maybe this is not the right place to fix this issue ?
fix the endless 'kill and restart' issue