Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[求助/Help] lbagent 有几率出现端口被占用 #22179

Closed
66545shiwo opened this issue Feb 25, 2025 · 7 comments
Closed

[求助/Help] lbagent 有几率出现端口被占用 #22179

66545shiwo opened this issue Feb 25, 2025 · 7 comments
Labels
question Further information is requested

Comments

@66545shiwo
Copy link

66545shiwo commented Feb 25, 2025

负载均衡功能:页面创建监听, default-lbagent 有几率报错端口被占用,导致整个haproxy挂掉, 但重启该pod后正常
请问有遇到过这个bug吗?

[info 2025-02-25 03:02:14 apihelper.(*APIHelper).doSync.func1(apihelper.go:156)] sync data done, changed: true, elapsed: 172.581971ms
[info 2025-02-25 03:02:14 lbagent.(*ApiHelper).Run(api.go:127)] got new data from api helper
[info 2025-02-25 03:02:14 lbagent.(*ApiHelper).doUseCorpus(api.go:418)] make effect new corpus and params
[info 2025-02-25 03:02:14 lbagent.(*HaproxyHelper).handleUseCorpusCmd.func1(haproxy.go:186)] GenKeepalivedConfigs /opt/cloud/workspace/lbagent/configs/20250225.030214.918.staging
[info 2025-02-25 03:02:14 lbagent.(*HaproxyHelper).reloadHaproxy(haproxy.go:334)] reloading haproxy
[error 2025-02-25 03:02:16 lbagent.(*HaproxyHelper).reloadHaproxy(haproxy.go:339)] reloading haproxy: haproxy: exit status 1
args: -D -p /opt/cloud/workspace/lbagent/run/haproxy.pid -C /opt/cloud/workspace/lbagent/configs/haproxy.conf.d -f /opt/cloud/workspace/lbagent/configs/haproxy.conf.d -sf 107400 -x /opt/cloud/workspace/lbagent/run/haproxy.sock
stdout: 
stderr: [NOTICE]   (108649) : haproxy version is 2.4.22-f8e3218
[ALERT]    (108649) : Starting proxy b6334d0f-dc8c-4050-8a7c-db36f54693de: cannot bind socket (Address in use) [169.254.0.107:30885]
[ALERT]    (108649) : [haproxy.main()] Some protocols failed to start their listeners! Exiting.

[error 2025-02-25 03:02:16 lbagent.(*HaproxyHelper).reloadHaproxy(haproxy.go:344)] killing old haproxy 107400
[info 2025-02-25 03:02:16 lbagent.(*HaproxyHelper).reloadHaproxy(haproxy.go:366)] restarting haproxy
[info 2025-02-25 03:02:16 lbagent.(*HaproxyHelper).reloadGobetween(haproxy.go:388)] stopping gobetween(107401)
[info 2025-02-25 03:02:16 lbagent.(*HaproxyHelper).reloadGobetween(haproxy.go:402)] starting gobetween
[error 2025-02-25 03:02:16 lbagent.(*HaproxyHelper).handleUseCorpusCmd(haproxy.go:226)] useConfigs: haproxy: exit status 1
args: -D -p /opt/cloud/workspace/lbagent/run/haproxy.pid -C /opt/cloud/workspace/lbagent/configs/haproxy.conf.d -f /opt/cloud/workspace/lbagent/configs/haproxy.conf.d
stdout: 
stderr: [NOTICE]   (108701) : haproxy version is 2.4.22-f8e3218
[ALERT]    (108701) : Starting proxy b6334d0f-dc8c-4050-8a7c-db36f54693de: cannot bind socket (Address in use) [169.254.0.107:30885]
[ALERT]    (108701) : [haproxy.main()] Some protocols failed to start their listeners! Exiting.

问题点是:假设端口真被haproxy占用,重启pod应该依旧报错,但重启后就正常了

version: v3.10.15

@66545shiwo 66545shiwo added the question Further information is requested label Feb 25, 2025
@66545shiwo
Copy link
Author

应该是LB节点对应的宿主机上的端口和监听的端口冲突了,才会报这个错误,但发现是瞬时冲突,lb pod报错后,检测宿主机这个端口已经没了

@yulongz
Copy link

yulongz commented Feb 25, 2025

应该是LB节点对应的宿主机上的端口和监听的端口冲突了,才会报这个错误,但发现是瞬时冲突,lb pod报错后,检测宿主机这个端口已经没了

这种情况出现在串行API创建多个监听端口的时候,偶发性出现端口冲突的报错。在创建过程中reload haproxy的时候出现了异常。使用如下命令重启lbagent之后,lbagent又恢复了正常。kubectl rollout restart daemonset default-lbagent -n onecloud

1、重启之后lbagent正常启动,至少说明重启之后端口不被占用,即非宿主机其他服务(kubelet等)使用的常驻端口。
2、API创建多个监听端口过程中,reload haproxy的时候出现了haproxy需要绑定使用,但并非haproxy控制的端口,导致端口显示被占用,并出现haproxy error。
3、出现error时,killing old haproxy 并重启haproxy,但是仍然显示端口被占用。haproxy重启失败,导致其他所有代理端口全部无法正常连接。

@yulongz
Copy link

yulongz commented Feb 25, 2025

如果负载均衡节点上的监听比较多(比如达到2000个以上),需要调整如下参数么

	ApiLbagentHbInterval          int `default:"10"`
	ApiLbagentHbTimeoutRelaxation int `default:"120" help:"If agent is to stale out in specified seconds in the future, consider it staled to avoid race condition when doing incremental api data fetch"`
	ApiSyncIntervalSeconds  int `default:"10"`
	ApiRunDelayMilliseconds int `default:"10"`
	ApiListBatchSize int `default:"1024"`
	DataPreserveN int `default:"8" help:"number of recent data to preserve on disk"`

@66545shiwo
Copy link
Author

66545shiwo commented Feb 25, 2025

应该是LB节点对应的宿主机上的端口和监听的端口冲突了,才会报这个错误,但发现是瞬时冲突,lb pod报错后,检测宿主机这个端口已经没了

问题重现: @swordqiu

[error 2025-02-25 12:50:26 lbagent.(*HaproxyHelper).reloadHaproxy(haproxy.go:348)] restart haproxy: haproxy: exit status 1
args: -D -p /opt/cloud/workspace/lbagent/run/haproxy.pid -C /opt/cloud/workspace/lbagent/configs/haproxy.conf.d -f /opt/cloud/workspace/lbagent/configs/haproxy.conf.d
stdout:
stderr: [NOTICE]   (3269567) : haproxy version is 2.4.22-f8e3218
[ALERT]    (3269567) : Starting proxy 4fadf650-4fea-48b4-8abf-ac956e9fba69: cannot bind socket (Address in use) [169.254.0.101:58824]
[ALERT]    (3269567) : [haproxy.main()] Some protocols failed to start their listeners! Exiting.

在容器内部发现一个ESTABLISHED连接占用未释放:

bash-5.1# netstat -anp | grep 58824
tcp        0      0 169.254.0.101:58824     169.254.0.100:2774      ESTABLISHED 3243981/haproxy

bash-5.1# ps -ef|grep 3243981 
3243981 root      0:00 haproxy -D -p /opt/cloud/workspace/lbagent/run/haproxy.pid -C /opt/cloud/workspace/lbagent/configs/haproxy.conf.d -f /opt/cloud/workspace/lbagent/configs/haproxy.conf.d -sf 3243938 -x /opt/cloud/workspace/lbagent/run/haproxy.sock

导致haproxy一直循环重启失败

@swordqiu
Copy link
Member

@yulongz 这个是VPC的LB吗?

@yulongz
Copy link

yulongz commented Feb 26, 2025

@yulongz 这个是VPC的LB吗?

vpc内虚拟机对应的lbagent。

@yulongz
Copy link

yulongz commented Feb 27, 2025

限制端口范围解决问题。 issue可以关闭了

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants