Linkerd-destination: unable to connect to validator #11597

matthiasdeblock · 2023-11-09T12:03:36Z

What is the issue?

Hi

After installing linkerd-cni. the Linkerd pods are unable to start due to the following error:

Unable to connect to validator. Please ensure iptables rules are rewriting traffic as expected error=Host is unreachable (os error 113)

How can it be reproduced?

Install linkerd-cni and linkerd on a flatcar kubernetes 1.28.3 cluster with cilium as CNI.

Logs, error output, etc

2023-11-09T11:42:46.686000Z  INFO linkerd_network_validator: Listening for connections on 0.0.0.0:4140
2023-11-09T11:42:46.686030Z DEBUG linkerd_network_validator: token="KXyajGp2VZRdLXMQEEAqBJoJUeNIUUUhajU7NmAqDTmCn9fcj9GyrFcDdlGURTo\n"
2023-11-09T11:42:46.686037Z  INFO linkerd_network_validator: Connecting to 1.1.1.1:20001
2023-11-09T11:42:47.586457Z ERROR linkerd_network_validator: Unable to connect to validator. Please ensure iptables rules are rewriting traffic as expected error=Host is unreachable (os error 113)
2023-11-09T11:42:47.586481Z ERROR linkerd_network_validator: error=Host is unreachable (os error 113)

output of `linkerd check -o short`

linkerd-existence
-----------------
- No running pods for "linkerd-destination" ^C

Environment

Kubernetes-version: 1.28.3
Cilium version: 1.14.3
Linkerd-cni-version: stable-2.14.3
Linkerd-version: stable-2.14.3
OS: Flatcar Openstack 3510.2.8

Possible solution

No response

Additional context

No response

Would you like to work on fixing this bug?

maybe

The text was updated successfully, but these errors were encountered:

mateiidavid · 2023-11-09T14:04:35Z

@matthiasdeblock hi, sounds like the validator is detecting erroneous configuration in your network stack. The validator is attempting to connect to a server it creates in order to test iptables destination re-writing works as expected. I see that you're using Cilium. We have a cluster configuration section in our docs aimed at getting Linkerd to work with Cilium. Their socket level load balancing capability can sometimes mess up routing for other services. Can you check if that's affecting you here?

matthiasdeblock · 2023-11-09T14:22:50Z

Hi @mateiidavid
I did set 'bpf-lb-sock-hostns-only: "true"' but that did not fix the issue here. Without linkerd-cni everything is working fine.

mateiidavid · 2023-11-10T10:32:52Z

If you think linkerd-cni is the culprit, I'd suggest having a look at some logs. Specifically:

Does the installer (linkerd-cni daemonset pod) report anything?
Can you get access to kubelet logs to verify whether plugin runs have unsuccessful?
Does your CNI host configuration file contain linkerd-cni's configuration?

I'd perhaps start with the last one if it's easy. It might be that the configuration wasn't appended properly for some reason.

matthiasdeblock · 2023-11-20T16:37:33Z

If you think linkerd-cni is the culprit, I'd suggest having a look at some logs. Specifically:

Does the installer (linkerd-cni daemonset pod) report anything?

Can you get access to kubelet logs to verify whether plugin runs have unsuccessful?

Does your CNI host configuration file contain linkerd-cni's configuration?

I'd perhaps start with the last one if it's easy. It might be that the configuration wasn't appended properly for some reason.

I'll give it a retry next week. I did check all these but I'll give it another look:

The cni pods did not report any issues
The plugin just installs correctly and is up and running withing couple of seconds
The CNI config file mentioned the location of the Cilium CNI plugin conf.

I'll verify this by the beginning of next week.

Regards

kflynn · 2023-12-18T21:22:59Z

@matthiasdeblock Any joy retrying this?

kflynn · 2024-01-04T15:45:18Z

@matthiasdeblock Happy new year! Still curious if you got a chance to retry things? 🙂

matthiasdeblock · 2024-03-19T14:01:23Z

Hi
Sorry for the delay, we will be testing again in the upcoming days.
Regards
Matthias

Driesvanherpe · 2024-04-03T09:43:16Z

Hi,

As a colleague of @matthiasdeblock i'd like to give some extra info about this issue:
Logs of the cni pod:

[2024-04-03 09:12:34] Wrote linkerd CNI binaries to /host/opt/cni/bin
[2024-04-03 09:12:34] Installing CNI configuration for /host/etc/cni/net.d/05-cilium.conflist
[2024-04-03 09:12:34] Using CNI config template from CNI_NETWORK_CONFIG environment variable.
      "k8s_api_root": "https://__KUBERNETES_SERVICE_HOST__:__KUBERNETES_SERVICE_PORT__",
      "k8s_api_root": "https://10.12.0.1:__KUBERNETES_SERVICE_PORT__",
[2024-04-03 09:12:34] CNI config: {
  "name": "linkerd-cni",
  "type": "linkerd-cni",
  "log_level": "info",
  "policy": {
      "type": "k8s",
      "k8s_api_root": "https://10.12.0.1:443",
      "k8s_auth_token": "__SERVICEACCOUNT_TOKEN__"
  },
  "kubernetes": {
      "kubeconfig": "/etc/cni/net.d/ZZZ-linkerd-cni-kubeconfig"
  },
  "linkerd": {
    "incoming-proxy-port": 4143,
    "outgoing-proxy-port": 4140,
    "proxy-uid": 2102,
    "ports-to-redirect": [],
    "inbound-ports-to-ignore": ["4191","4190"],
    "simulate": false,
    "use-wait-flag": false
  }
}
[2024-04-03 09:12:34] Created CNI config /host/etc/cni/net.d/05-cilium.conflist
Setting up watches.
Watches established.

Looks like the config doesn't get written to the file, contents of /etc/cni/net.d/05-cilium.conflist

{
  "cniVersion": "0.3.1",
  "name": "cilium",
  "plugins": [
    {
       "type": "cilium-cni",
       "enable-debug": false,
       "log-file": "/var/run/cilium/cilium-cni.log"
    }
  ]
}

Fixes linkerd/linkerd2#11597 When the cni plugin is triggered, it validates that the proxy has been injected into the pod before setting up the iptables rules. It does so by looking for the "linkerd-proxy" container. However, when the proxy is injected as a native sidecar, it gets added as an _init_ container, so it was being disregarded here. We don't have integration tests for validating native sidecars when using linkerd-cni because [Calico doesn't work in k3s since k8s 1.27](k3d-io/k3d#1375), and we require k8s 1.29 for using native sidecars. I did nevertheless successfully test this fix in an AKS cluster.

matthiasdeblock · 2024-04-05T05:39:47Z

Hi

As our cluster is air-gapped I noticed the 1.1.1.1 as connection address isn't correct. I've fixed this in our helm chart and we are now getting a bit further but still running into an error:

flatcar-k8stest-master-01 ~ # kubectl -n linkerd logs linkerd-destination-56f777c8b6-8sw9c -c linkerd-network-validator

2024-04-05T05:33:28.979251Z  INFO linkerd_network_validator: Listening for connections on 0.0.0.0:4140
2024-04-05T05:33:28.979293Z DEBUG linkerd_network_validator: token="y3SgDWabwG6jtxhXFrYYBB4cSHHiSKjbSsaDV29f89tkwrWjmJXtvMz9lmyWb5p\n"
2024-04-05T05:33:28.979308Z  INFO linkerd_network_validator: Connecting to <kubernetes_api>:6443
2024-04-05T05:33:28.981087Z DEBUG connect: linkerd_network_validator: Connected client.addr=10.12.70.197:57332
2024-04-05T05:33:38.980507Z ERROR linkerd_network_validator: Failed to validate networking configuration. Please ensure iptables rules are rewriting traffic as expected. timeout=10s

(using the kubernetes api IP to connect to)

So it now connects but is still throwing an error.

Regards
Matthias

alpeb · 2024-04-10T16:36:45Z

Thanks @matthiasdeblock for circling back. With version 2.14.9 we have added a cni-repair-controller component that should detect race conditions between the cluster's cni and linkerd-cni. You can enable it via the linkerd2-cni chart value repairController.enabled=true.
If that doesn't do the trick, there's another fix in linkerd/linkerd2-proxy-init#360 that might work for you, so please let me know and I can provide an image to test that out.

matthiasdeblock · 2024-04-11T11:30:14Z

Thanks @matthiasdeblock for circling back. With version 2.14.9 we have added a cni-repair-controller component that should detect race conditions between the cluster's cni and linkerd-cni. You can enable it via the linkerd2-cni chart value repairController.enabled=true. If that doesn't do the trick, there's another fix in linkerd/linkerd2-proxy-init#360 that might work for you, so please let me know and I can provide an image to test that out.

Hi

The cni-repair-controller just keeps restarting the linkerd control plane. This isn't fixing the issue.

You have linked linkerd/linkerd2-proxy-init#362 as well, can this be the issue we are running into?

Regards
Matthias

alpeb · 2024-04-11T13:56:28Z

I linked linkerd/linkerd2-proxy-init#362 by mistake. That should be unrelated unless you're using native sidecars too.
I've published the image ghcr.io/alpeb/cni-plugin:modify with the change from linkerd/linkerd2-proxy-init#360. It would be great if you could give that a try.

matthiasdeblock · 2024-04-19T09:53:32Z

Hi
I have tested the image you provided but it still throws me the same error:

flatcar-k8stest-master-01 ~ # kubectl -n linkerd logs linkerd-destination-6c49f479d8-946ww -c linkerd-network-validator
2024-04-19T09:51:52.016640Z  INFO linkerd_network_validator: Listening for connections on 0.0.0.0:4140
2024-04-19T09:51:52.016683Z DEBUG linkerd_network_validator: token="CMiU50KsdnCBztqVH5xUXcVHqbfhqE960BJEpwoj5GTJLiftg9qQJ3JmT6KLssx\n"
2024-04-19T09:51:52.016689Z  INFO linkerd_network_validator: Connecting to 172.24.214.93:6443
2024-04-19T09:51:52.018141Z DEBUG connect: linkerd_network_validator: Connected client.addr=10.12.71.143:44382
2024-04-19T09:52:02.017854Z ERROR linkerd_network_validator: Failed to validate networking configuration. Please ensure iptables rules are rewriting traffic as expected. timeout=10s

matthiasdeblock · 2024-05-03T10:11:51Z

@mateiidavid , any news on this one?

mateiidavid · 2024-05-03T10:19:19Z

@matthiasdeblock sorry, I think this was closed automatically when I hit the merge button on the PR above. Since it did not fix your issue, I'm going to re-open this.

matthiasdeblock · 2024-06-06T09:54:36Z

Hi @mateiidavid
Any news on this one?

I have changed the timeout from 10s to 60s and now I am getting a different error:

flatcar-k8stest-master-01 ~ # kubectl -n linkerd logs linkerd-destination-f7b89b9db-qjxb7 -c linkerd-network-validator -f
2024-06-06T09:52:05.455607Z  INFO linkerd_network_validator: Listening for connections on 0.0.0.0:4140
2024-06-06T09:52:05.455666Z DEBUG linkerd_network_validator: token="8NAaWTB0bQ7E5FcrUPpyWs8OOdpq1xnlMJElrWZ9RrN3ssRWdPSvVVBDwnykGOQ\n"
2024-06-06T09:52:05.455762Z  INFO linkerd_network_validator: Connecting to 172.24.214.93:6443
2024-06-06T09:52:05.456775Z DEBUG connect: linkerd_network_validator: Connected client.addr=10.12.69.107:47754
2024-06-06T09:52:37.458317Z DEBUG connect: linkerd_network_validator: Read message from server bytes=0
2024-06-06T09:52:37.458513Z DEBUG linkerd_network_validator: data="" size=0
2024-06-06T09:52:37.458543Z ERROR linkerd_network_validator: error=expected client to receive "8NAaWTB0bQ7E5FcrUPpyWs8OOdpq1xnlMJElrWZ9RrN3ssRWdPSvVVBDwnykGOQ\n"; got "" instead

So, it is still the same connecting address 172.24.214.93:6443 which is our kubernetes-api but it is now throwing another error...

Thank you!
Regards
Matthias

matthiasdeblock · 2024-06-06T12:04:28Z

Hi

I have changed linkerd to the latest edge-24.5.5 and CNI to 1.5.0. Also have been putting the timeout to 30s. Still the same issue:

flatcar-k8stest-master-01 ~ # kubectl -n linkerd logs linkerd-destination-749d567f64-rnmhl -c linkerd-network-validator -f
2024-06-06T12:02:02.055672Z  INFO linkerd_network_validator: Listening for connections on 0.0.0.0:4140
2024-06-06T12:02:02.055715Z DEBUG linkerd_network_validator: token="FxnawK939yIxs5SAvEnQ9ii4QLecvKoWZRgGRMgOcrzwwRaWCyIbaxzorU79K5G\n"
2024-06-06T12:02:02.055729Z  INFO linkerd_network_validator: Connecting to 172.24.214.93:6443
2024-06-06T12:02:02.057521Z DEBUG connect: linkerd_network_validator: Connected client.addr=10.12.69.211:47982
2024-06-06T12:02:32.057580Z ERROR linkerd_network_validator: Failed to validate networking configuration. Please ensure iptables rules are rewriting traffic as expected. timeout=30s

matthiasdeblock · 2024-06-07T12:41:01Z

Hi

I've been looking into this myself a bit better and I found the issue here. It seems like the Cilium needed this config:

cni.exclusive=false

cni-exclusive: "false"

What this means: make Cilium take ownership over the /etc/cni/net.d directory on the node, renaming all non-Cilium CNI configurations to *.cilium_bak. This ensures no Pods can be scheduled using other CNI plugins during Cilium agent downtime.

Source: https://docs.cilium.io/en/stable/helm-reference/

Closes linkerd/linkerd2#11597

alpeb · 2024-06-25T15:22:31Z

Thanks for the feedback @matthiasdeblock ! I've confirmed the fix and pushed some updates to our docs.

* Add notes about Cilium's exclusive mode Closes linkerd/linkerd2#11597 Co-authored-by: Flynn <[email protected]> Co-authored-by: William Morgan <[email protected]>

matthiasdeblock added the bug label Nov 9, 2023

mateiidavid assigned alpeb Apr 4, 2024

alpeb mentioned this issue Apr 4, 2024

Fix linkerd-cni when using native sidecars linkerd/linkerd2-proxy-init#362

Merged

mateiidavid closed this as completed in linkerd/linkerd2-proxy-init@295008c Apr 16, 2024

mateiidavid reopened this May 3, 2024

adleong added the env/cillium label May 9, 2024

alpeb added a commit to linkerd/website that referenced this issue Jun 25, 2024

Add notes about Cilium's exclusive mode

19bf4a6

Closes linkerd/linkerd2#11597

alpeb mentioned this issue Jun 25, 2024

Add notes about Cilium's exclusive mode linkerd/website#1794

Merged

alpeb closed this as completed in linkerd/website#1794 Jul 8, 2024

github-actions bot locked as resolved and limited conversation to collaborators Aug 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Linkerd-destination: unable to connect to validator #11597

Linkerd-destination: unable to connect to validator #11597

matthiasdeblock commented Nov 9, 2023

mateiidavid commented Nov 9, 2023

matthiasdeblock commented Nov 9, 2023

mateiidavid commented Nov 10, 2023

matthiasdeblock commented Nov 20, 2023

kflynn commented Dec 18, 2023

kflynn commented Jan 4, 2024

matthiasdeblock commented Mar 19, 2024

Driesvanherpe commented Apr 3, 2024

matthiasdeblock commented Apr 5, 2024

alpeb commented Apr 10, 2024

matthiasdeblock commented Apr 11, 2024

alpeb commented Apr 11, 2024

matthiasdeblock commented Apr 19, 2024

matthiasdeblock commented May 3, 2024

mateiidavid commented May 3, 2024

matthiasdeblock commented Jun 6, 2024

matthiasdeblock commented Jun 6, 2024

matthiasdeblock commented Jun 7, 2024 •

edited

Loading

alpeb commented Jun 25, 2024

Linkerd-destination: unable to connect to validator #11597

Linkerd-destination: unable to connect to validator #11597

Comments

matthiasdeblock commented Nov 9, 2023

What is the issue?

How can it be reproduced?

Logs, error output, etc

output of linkerd check -o short

Environment

Possible solution

Additional context

Would you like to work on fixing this bug?

mateiidavid commented Nov 9, 2023

matthiasdeblock commented Nov 9, 2023

mateiidavid commented Nov 10, 2023

matthiasdeblock commented Nov 20, 2023

kflynn commented Dec 18, 2023

kflynn commented Jan 4, 2024

matthiasdeblock commented Mar 19, 2024

Driesvanherpe commented Apr 3, 2024

matthiasdeblock commented Apr 5, 2024

alpeb commented Apr 10, 2024

matthiasdeblock commented Apr 11, 2024

alpeb commented Apr 11, 2024

matthiasdeblock commented Apr 19, 2024

matthiasdeblock commented May 3, 2024

mateiidavid commented May 3, 2024

matthiasdeblock commented Jun 6, 2024

matthiasdeblock commented Jun 6, 2024

matthiasdeblock commented Jun 7, 2024 • edited Loading

alpeb commented Jun 25, 2024

output of `linkerd check -o short`

matthiasdeblock commented Jun 7, 2024 •

edited

Loading