-
Notifications
You must be signed in to change notification settings - Fork 388
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RETRY_JOIN fails after server comes back up - it's always DNS! #1253
Comments
Details pushed to https://github.com/fopina/delme/tree/main/dkron_retry_join_does_not_reresolve_dns to make reproducing easier I'd assume with a HA / multi-server setup this won't happen as the server changing IP will retry-join by himself onto the other servers (and then new IP will be shared among everyone), but I haven't tested, and I think it still makes this a valid bug as single-server setup is documented. |
after taking a look at But looking at
But now the agents will "see" the server, but they do not
|
@yvanoers just in case you're still around, would you have any comment on this one? I've tried debugging but I believe part that is handling reconnection (and not re-resolving DNS) is within serf library, not dkron. I couldn't find any workaround at all... I've tried modifying |
I'm not that well-versed in the internals of serf, but you could very well be right that this is a serf-related issue. |
This is an old one, known issue, it's because how Raft is handling nodes, it's affecting any dynamic IP system like k8s and it should be fixable. I need to dig into it, it's something really annoying, so expect that I'll try to allocate time for this soon. |
That's awesome! I did an attempt to trace it but failed... Even when trying with multiple servers as the workaround I mentioned, it still doesn't work. Using the low serf reconnect timeout, kicks server out and never allows it back it.. |
I took a deeper look into this, it's not related to Raft but to what you mentioned, Serf is not resolving the hostname but using the existing IP, it's always DNS :) I need to investigate a bit more to come up with a workaround that doesn't involve restarting the agents. |
Gentle reminder this is still happening 🩸 😄 |
Hi, Did anyone found a solution to this ?? |
I don't and it's really annoying. I ended up setting log alerts (as I have logs in Loki) and kill all agents when issue starts popping up... Really bad workaround but, in my case, I prefer to break some ongoing jobs than not running any until I manually restarted... |
Thanks @fopina for your reply ! |
Agents have no data, it's all in the server(s). |
@fopina can you check against v4-beta? |
I already did @vcastellm : #1442 (comment) It didn't work though :/ |
Hi, |
@jaccky @vcastellm maybe that is what the other issue/PR refer to (missing leader elections), though this issue I opened is not about leader. I have single server setup and if the container restarts, the worker nodes will not reconnect (as server changed IP but not hostname) the server itself comes back up and resumes as leader (as single node). It sounds similar, but maybe it’s in slightly different place of the code? (As one is about server nodes reconnecting to the one that changed IP and mine is about worker nodes reconnecting to the same server that changed name) |
@fopina Hey there! Have you faced any issues when running more than one dkron servers? AFAIK, retry join is a finite process in dkron. Here's what typically happens when deploying dkron in such a configuration:
While a DNS solution might work, there could be other approaches to consider. For example, if the agent receives a server leave event and there are no known dkron server nodes, it could initiate a retry-joining process on the dkron agent. I'm not very familiar with the dkron backend, so I'd like to ask @vcastellm to validate this information. |
I believe that is not correct, the nodes do keep trying to rejoin at serf layer but only keep resolved IP, they do not re-resolve. Relating to multiple server nodes, yes, I used to run a 3 server node cluster, but the leader election / raft issues were so frequent that HA setup had more downtime than single server node hehe |
@fopina thanks for reply! Just to clarify, retry join is not a feature of the serf layer itself. Instead, it's an abstraction within dkron. You can find the implementation details in the dkron source code at this link: retry_join. This method is invoked only when a dkron server or agent starts up |
So, I reproduced the issue in k8s environment. I initiated one dkron server and one dkron agent, then removed the retry join property from the dkron server configuration. Here's how the configuration looked: - "--retry-join=\"provider=k8s label_selector=\"\"app.kubernetes.io/instance={{ .Release.Name }}\"\" namespace=\"\"{{ .Release.Namespace }}\"\"\"" After removing the retry join property and restarting the dkron server, the dkron agent produced the following logs (like yours):
The issue is not reproducible when the retry join property is present in the dkron server configuration. With this property dkron server is able to discover the dkron agent. Consequently, the dkron agent simply receives an update event rather than only a member leave event. Below are the logs from the dkron agent:
It appears that you can try adding the dkron-agent DNS name to the retry-join configuration in the dkron-server as a workaround. |
@ivan-kripakov-m10 could you highlight the differences of your test with the configuration I posted in the issue itself? It's using retry-join and DNS name. |
@fopina no, the issue itself is not fixed in v4 yet :( services.server.environment.DKRON_RETRY_JOIN: {{dkron-agents-dns-names}} |
@ivan-kripakov-m10 oh got it! Good point, I’ll test in my setup, might be worth it even if it causes some network “noise”! |
So, I did a bit of digging into how serf works and if we can use DNS names with it. Here's what I found:
At first glance it seems that we can't solve this problem in the serf layer and have to implement something within dkron. |
@ivan-kripakov-m10 thank you very much! As I'm using docker swarm, adding @vcastellm I think this issue still makes sense (as agents DO retry to join but without re-resolving hostname, so looks like a bug) but feel free to close it, Ivan's workaround is more than acceptable |
Describe the bug
After both server and agents are up and cluster is running smoothly, if the server goes down and comes back up with a different IP (but same hostname), agents do not reconnect.
To Reproduce
docker-compose.yml
docker kill
the serverlevel=info msg="removing server dkron1 (Addr: 10.5.0.20:6868) (DC: dc1)" node=...
10.5.0.22
and rundocker compose up -d
to re-create the serverExpected behavior
Agents would eventually retry joining on hostname, picking up the new IP.
Additional context
I understand serf or raft might be tricky with DNS but in this case, server does start up with proper access to data/log, no corruption. And if I restart the agents, they will reconnect just fine.
It seems it's just that retry will go on using IP after first join, instead of re-resolving hostname.
To reproduce the issue, I'm forcing the IP change here, but when running in docker swarm (and I assume in k8s as well) new IP upon service re-creation is expected without using fixed IPs.
Is this something easy to fix?
The text was updated successfully, but these errors were encountered: