Skip to content
This repository has been archived by the owner on Oct 15, 2020. It is now read-only.

fix resource check in scanAll process #8

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

innomentats
Copy link
Contributor

fix the endless 'kill and restart' issue

@CLAassistant
Copy link

CLAassistant commented Oct 30, 2017

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
You have signed the CLA already but the status is still pending? Let us recheck it.

Copy link
Contributor

@haohui haohui left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me overall.

Can you please (1) add some checks to enforce the minimal number of cores and memory in the REST API, and the comments that describes the fields in the YAML file?
(2) an unit test validating the behavior?

@innomentats
Copy link
Contributor Author

Sure, will do

@innomentats
Copy link
Contributor Author

Got a problem, how do we define minimum amount of memory? It seems like neither AthenaX nor Flink has such kind of property currently...

@@ -150,7 +150,8 @@ static StateView computeState(Map<UUID, JobDefinition> jobs, Map<UUID, InstanceI

static JobDefinitionDesiredstate computeActualState(InstanceInfo info) {
JobDefinitionResource r = new JobDefinitionResource()
.memory(info.status().getAllocatedMB())
.memory(info.status().getAllocatedMB() / (info.status().getAllocatedVCores() != 0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Taking a closer look I don't quite understand this. The memory should be the total amount of memory for the whole job instead of the memory used by each task manager.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://github.com/uber/AthenaX/blob/9de54305f6dc198cbe176f64c8d6dddc6ab2e6ac/athenax-backend/src/main/java/com/uber/athenax/backend/server/yarn/JobConf.java#L48-L55

According to the definition of JobConf, it seems like the amount of memory specified by the Web API parameters is exactly the memory used by each task manager instead of the whole job's memory consumption. And when I start a job with 2 cores and 2G memory, AthenaX starts a job with two TaskManager containers with 1 Core and 2G memory each, so the total amount of memory retrieved from YARN is 4G, which leads to the same issue of continuously restarting.

Maybe this is not the right place to fix this issue ?

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants