Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MCM does not report last operation for machines if credentials are invalid #455

Closed
rfranzke opened this issue Apr 29, 2020 · 6 comments
Closed
Labels
kind/bug Bug lifecycle/stale Nobody worked on this for 6 months (will further age)

Comments

@rfranzke
Copy link
Member

What happened:
Trying to create/delete VMs with invalid credentails does only produce a log message but no information in the .status.lastOperation of the Machine object.

What you expected to happen:
The log messages show:

I0429 06:54:56.244810       1 machine.go:547] Deleting Machine "shoot--foo--bar-cpu-worker-z1-5cdcb46f64-pxzp5"
I0429 06:54:56.244842       1 machine.go:659] Machine "shoot--foo--bar-cpu-worker-z1-5cdcb46f64-pxzp5" on deletion doesn't have a providerID attached to it. Checking for any VM linked to this machine object.
E0429 06:54:56.336545       1 driver_aws.go:409] AWS driver is returning error while describe instances request is sent: AuthFailure: AWS was not able to validate the provided access credentials
	status code: 401, request id: 6e99231c-654e-4b05-8801-310e3532b4e9
E0429 06:54:56.336595       1 machine.go:664] Failed to list VMs while deleting the machine "shoot--foo--bar-cpu-worker-z1-5cdcb46f64-pxzp5" AuthFailure: AWS was not able to validate the provided access credentials
	status code: 401, request id: 6e99231c-654e-4b05-8801-310e3532b4e9

However, the status does not indicate any problem:

status:
  currentStatus:
    lastUpdateTime: "2020-04-29T06:53:33Z"
    phase: Terminating
  lastOperation:
    description: Deleting machine from cloud provider
    lastUpdateTime: "2020-04-29T06:53:33Z"
    state: Processing
    type: Delete

Consequently, .status.failedMachines of the MachineDeployment is empty as well.

How to reproduce it (as minimally and precisely as possible):
Try to create/delete a machine with invalid credentials.

@rfranzke rfranzke added the kind/bug Bug label Apr 29, 2020
@rfranzke
Copy link
Member Author

Probably related to #453?

@ggaurav10
Copy link
Contributor

It seems to be atleast partially contradicting #456 where status field can be seen having the credential related error when trying to delete a machine. Or am i missing something here?
Do you mean that while machineDeployment reflects the error, but not the machine object?

@rfranzke
Copy link
Member Author

Why is it contradicting? This issue here is about writing the unauthorized error into the lastOperation in the first place. #456 is about resetting the .status.failedMachines after the error has been resolved, e.g., after the credentials have been fixed by the user and MCM can now properly create VMs.

@ggaurav10
Copy link
Contributor

The failing status of individual machines appears in .status.failedMachines[] of the machineset and machineDeployment (as seen in #456). This is built by MCM using the machine's status itself, so at some point in time, the credential related error would have shown up in the machine's status itself.
I think what is happening here is that, when the machine delete is retried, that status is overwritten to what is reported in this issue:

  lastOperation:
    description: Deleting machine from cloud provider

Wondering if we should overwrite the last error message which is anyway preserved in machine set/deployment status, or should we introduce a new field in machine's status to save the last failed operation, so that we can still keep track of the in-flight last operation. eg: Deleting machine from cloud provider
/cc @hardikdr @prashanth26

@prashanth26
Copy link
Contributor

@rfranzke - You are partially right that this message is not being propogated to the machine status. However, this is something that was introduced in the last release with this change - https://github.com/gardener/machine-controller-manager/blob/master/pkg/controller/machine.go#L396-L399. cc @hardikdr

@ggaurav10 - I don't think it's the issue with machine deployment error propagation. That part seems fine to me.

Overall this particular instance is an regression with the last release. We need to fix this as well.

@gardener-robot gardener-robot added the lifecycle/stale Nobody worked on this for 6 months (will further age) label Jun 29, 2020
@prashanth26 prashanth26 removed the lifecycle/stale Nobody worked on this for 6 months (will further age) label Aug 16, 2020
@gardener-robot gardener-robot added the lifecycle/stale Nobody worked on this for 6 months (will further age) label Oct 16, 2020
@hardikdr
Copy link
Member

hardikdr commented Nov 6, 2020

This should be solved with #527
Please feel free to re-open if seen again.
/close

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Bug lifecycle/stale Nobody worked on this for 6 months (will further age)
Projects
None yet
Development

No branches or pull requests

5 participants