-
Notifications
You must be signed in to change notification settings - Fork 175
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] Forward ClusterIP traffic to inter-node communication port before CQL is ready #1614
base: master
Are you sure you want to change the base?
Conversation
4dc2f8f
to
d77e0f4
Compare
8d9c038
to
28f6bd1
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
left a couple of comments, only skimmed through the tests for now
}, | ||
) | ||
if err != nil { | ||
objectErrs = append(objectErrs, err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know the other invocations don't wrap this error, but is there actually a reason for not wrapping it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no reason, just consistency with others
PublishNotReadyAddresses: true, | ||
Type: corev1.ServiceTypeClusterIP, | ||
Ports: servicePorts(sc), | ||
Selector: nil, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: shall we have a comment explaining why there's no selector? Or maybe even a higher-level comment how we work around it? It's obvious in this PR's context, but probably not so much for someone looking at it in the future. On the other hand it can be backtracked to this commit, so I'm just leaving it as a nit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added a comment
`, | ||
}, | ||
ReadinessProbe: &corev1.Probe{ | ||
TimeoutSeconds: int32(30), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: shall we have comments explaining these values?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
they are not special, there's nothing to comment on.
return endpointSlices, fmt.Errorf("can't get Pod %q: %w", naming.ManualRef(sc.Namespace, svc.Name), err) | ||
} | ||
|
||
// Don't publish endpoints for Pod that are being deleted. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// Don't publish endpoints for Pod that are being deleted. | |
// Don't publish endpoints for Pods that are being deleted. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
// Don't publish endpoints for Pod that are being deleted. | ||
// Removing endpoints prevents from new connections being established while still allowing | ||
// for existing connections to survive and finish their requests. | ||
// We need to do this early, as gap between when Pod is actually deleted and Endpoint |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// We need to do this early, as gap between when Pod is actually deleted and Endpoint | |
// We need to do this early, as a gap between when Pod is actually deleted and Endpoint |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
} | ||
|
||
for i := range tt { | ||
tc := tt[i] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no need for this anymore
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
pkg/controllerhelpers/core_test.go
Outdated
} | ||
|
||
for i := range tt { | ||
tc := tt[i] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
} | ||
|
||
for i := range tt { | ||
tc := tt[i] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
pkg/controllerhelpers/core_test.go
Outdated
t.Errorf("expected found %#v, got %#v", tc.expectedFound, found) | ||
} | ||
if !reflect.DeepEqual(container, tc.expectedContainer) { | ||
t.Errorf("expected container %#v, got %#v", tc.expectedContainer, container) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we use cmp.Diff instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
|
||
collidedContainerName, collision := hashes[hash] | ||
if collision { | ||
t.Errorf("found collision on container endpoint ports, both pod and it's controlled ports and %q container with its serving port generate same hash", collidedContainerName) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
t.Errorf("found collision on container endpoint ports, both pod and it's controlled ports and %q container with its serving port generate same hash", collidedContainerName) | |
t.Errorf("found collision on container endpoint ports, both pod and its controlled ports and %q container with its serving port generate same hash", collidedContainerName) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
t.Fatal(err) | ||
} | ||
if !apiequality.Semantic.DeepEqual(endpoints, tc.expectedEndpoints) { | ||
t.Errorf("expected and actual Endpoints(s) differ: %s", cmp.Diff(tc.expectedEndpoints, endpoints)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Endpoints(s) doesn't really make sense
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
}, | ||
}, | ||
{ | ||
name: "EndpointSlices to Service ports backed by container having readiness probe are ready when container is ready regardless of Pod readiness", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
name: "EndpointSlices to Service ports backed by container having readiness probe are ready when container is ready regardless of Pod readiness", | |
name: "EndpointSlices for Service ports backed by container having readiness probe are ready when container is ready regardless of Pod readiness", |
for consistency mostly
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
}, | ||
}, | ||
{ | ||
name: "EndpointSlice have IPV6 AddressType when PodIP from Pod.Status is an IPv6 address", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
name: "EndpointSlice have IPV6 AddressType when PodIP from Pod.Status is an IPv6 address", | |
name: "EndpointSlice has IPV6 AddressType when PodIP from Pod.Status is an IPv6 address", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
for _, svcPort := range svc.Spec.Ports { | ||
containerServingPort, ok, err := controllerhelpers.FindContainerServingPort(svcPort, sts.Spec.Template.Spec.Containers) | ||
if err != nil { | ||
t.Fatal(err) | ||
} | ||
|
||
ep := discoveryv1.EndpointPort{ | ||
Name: pointer.Ptr(svcPort.Name), | ||
Port: pointer.Ptr(svcPort.Port), | ||
Protocol: pointer.Ptr(svcPort.Protocol), | ||
AppProtocol: svcPort.AppProtocol, | ||
} | ||
|
||
if ok && containerServingPort.ReadinessProbe != nil { | ||
containerPortsMap[containerServingPort.Name] = append(containerPortsMap[containerServingPort.Name], ep) | ||
} else { | ||
podPorts = append(podPorts, ep) | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could this be abstracted to a func shared between the test and implementation, so that we're sure they don't diverge?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
}, | ||
}, | ||
{ | ||
name: "multi EndpointSlices per Service", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: multiple
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
t.Errorf("expected error %#v, got %#v", tc.expectedError, err) | ||
} | ||
if !reflect.DeepEqual(found, tc.expectedFound) { | ||
t.Errorf("expected found %#v, got %#v", tc.expectedFound, found) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
%t instead of %#v?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why? these two produce the same output
@zimnx can you please link the issue on the kubernetes repo we've talked about filling in the past that explains the problem and why the existing API concepts didn't cover it? Was there any feedback you got on the approach or alternative ideas from the community? |
pod, err := podLister.Pods(sc.Namespace).Get(svc.Name) | ||
if err != nil { | ||
if apierrors.IsNotFound(err) { | ||
continue |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
V4 log?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
pkg/controllerhelpers/core.go
Outdated
@@ -249,3 +251,38 @@ func GetScyllaContainerID(pod *corev1.Pod) (string, error) { | |||
|
|||
return cs.ContainerID, nil | |||
} | |||
|
|||
func FindContainerServingPort(port corev1.ServicePort, containers []corev1.Container) (corev1.Container, bool, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
func FindContainerServingPort(port corev1.ServicePort, containers []corev1.Container) (corev1.Container, bool, error) { | |
func FindContainerServingPort(svcPort corev1.ServicePort, containers []corev1.Container) (corev1.Container, bool, error) { |
port is ambiguous
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
return h, err | ||
} | ||
|
||
fnvHash := fnv.New32a() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
encoding sha512 wouldn't solve it, as it's too long to be used as part of a name
I didn't get that. Encoding converts the char sets. Hashes (and their encodings) can be arbitrarily truncated as they have random distribution.
1f80b92
to
48d9bb0
Compare
/assign @rzetelskik |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: tnozicka, zimnx The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
48d9bb0
to
cc8e792
Compare
Kubernetes reconciles Endpoints and EndpointSlices for Services having selector by taking entire Pod readiness into account. Because we need to allow inter-node communication before CQL traffic we used PublishNonReadyEndpoints on Services to overcome this. But at the same time, we would like stop accepting new connections when Pod is tearing down. These two requirements contradicts with each other. To satisfy both requirements, Operator will reconcile Endpoints and EndpointSlice resources in per-container way. If Pod container specifies port and has its own Readiness probe, Endpoint/EndpointSlice will become ready for this port when given container is ready. If these two conditions aren't met, given port becomes ready when entire Pod is ready. Controller logic will consider Pod being deleted (non-nil deletionTimestamp) as fully non-ready so that all endpoints for all ports and containers will removed. Additional container added to Scylla Pod controls readiness of ports used for inter-node communication.
cc8e792
to
b14579a
Compare
@zimnx: The following tests failed, say
Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
Marking as WIP, I need to triage those clusterip failures |
PR needs rebase. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
The Scylla Operator project currently lacks enough contributors to adequately respond to all PRs. This bot triages PRs according to the following rules:
You can:
/lifecycle stale |
The Scylla Operator project currently lacks enough contributors to adequately respond to all PRs. This bot triages PRs according to the following rules:
You can:
/lifecycle rotten |
The Scylla Operator project currently lacks enough contributors to adequately respond to all PRs. This bot triages PRs according to the following rules:
You can:
/lifecycle stale |
Kubernetes reconciles Endpoints and EndpointSlices for Services having
selector by taking entire Pod readiness into account. Because we need to
allow inter-node communication before CQL traffic we used
PublishNonReadyEndpoints on Services to overcome this.
But at the same time, we would like stop accepting new connections when
Pod is tearing down. These two requirements contradicts with each other.
To satisfy both requirements, Operator will reconcile Endpoints and
EndpointSlice resources in per-container way. If Pod container specifies
port and has its own Readiness probe, Endpoint/EndpointSlice
will become ready for this port when given container is ready. If these
two conditions aren't met, given port becomes ready when entire Pod is
ready.
Controller logic will consider Pod being deleted (non-nil
deletionTimestamp) as fully non-ready so that all endpoints for all ports
and containers will removed.
Additional container added to Scylla Pod controls readiness of ports
used for inter-node communication.
With above logic, nodes observing other node going down, won't be
able to reconnect until restarted node opens a port for inter-node
communication because traffic will be rejected immediately when node
starts terminating. This fixes an issue of traffic disruption during rolling
restarts happening on ScyllaClusters using ClusterIP for inter-node
communication which was caused by nodes being stuck on reconnection
attempts.
Prerequisites:
Fixes #1077