Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

1.10 版本消费kafka 一段时间后不继续消费报错 #239

Closed
zcola opened this issue Dec 14, 2023 · 21 comments
Closed

1.10 版本消费kafka 一段时间后不继续消费报错 #239

zcola opened this issue Dec 14, 2023 · 21 comments

Comments

@zcola
Copy link

zcola commented Dec 14, 2023

报错日志,1.10 版本自己开了cgo 编译的,为了用让自己写的pulsar input 插件,大概一年前的版本gohangout 消费kafka 没有任何问题我们用了很久了,pulsar input 插件我们也用了一段时间了没有问题,昨天问题发生再配置文件写了kafka input 和 pulsar input 情况 ,pulsar input 没有任何写入流量还没有切,重启gohangout 后kafka消费恢复

kafka 版本
1.1.1

I1213 20:14:16.105228       1 simple_consumer.go:135] coordinator for group[logstash__flink_yarn_logtail_20231206new]:canal-kafka05-kiel.:9092
E1213 20:14:16.105299       1 broker.go:241] write tcp 10.191.73.227:57628->7.32.139.221:9092: use of closed network connection
E1213 20:14:16.105322       1 brokers.go:267] get metadata of [kiel__flink_yarn_logtail] from canal-kafka14-kiel.:9092 error: requst of 3(1) to canal-kafka14-kiel.:9092 error: write tcp 10.191.73.227:57628->7.32.139.221:9092: use of closed network connection
E1213 20:14:16.105371       1 broker.go:241] write tcp 10.191.73.227:37138->7.32.139.228:9092: use of closed network connection
E1213 20:14:16.105420       1 brokers.go:267] get metadata of [kiel__flink_yarn_logtail] from canal-kafka21-kiel.:9092 error: requst of 3(1) to canal-kafka21-kiel.:9092 error: write tcp 10.191.73.227:37138->7.32.139.228:9092: use of closed network connection
E1213 20:14:16.105378       1 broker.go:241] write tcp 10.191.73.227:44158->7.32.139.210:9092: use of closed network connection
E1213 20:14:16.105444       1 brokers.go:267] get metadata of [kiel__flink_yarn_logtail] from canal-kafka03-kiel.:9092 error: requst of 3(1) to canal-kafka03-kiel.:9092 error: write tcp 10.191.73.227:44158->7.32.139.210:9092: use of closed network connection
I1213 20:14:16.107316       1 simple_consumer.go:158] leader ID of [kiel__flink_yarn_logtail][36] is 4
I1213 20:14:16.107335       1 simple_consumer.go:158] leader ID of [kiel__flink_yarn_logtail][32] is 24
I1213 20:14:16.107467       1 simple_consumer.go:158] leader ID of [kiel__flink_yarn_logtail][35] is 3
I1213 20:14:16.107490       1 simple_consumer.go:158] leader ID of [kiel__flink_yarn_logtail][34] is 2
I1213 20:14:16.107515       1 simple_consumer.go:158] leader ID of [kiel__flink_yarn_logtail][40] is 1
I1213 20:14:16.107494       1 simple_consumer.go:158] leader ID of [kiel__flink_yarn_logtail][30] is 22
I1213 20:14:16.109289       1 simple_consumer.go:158] leader ID of [kiel__flink_yarn_logtail][29] is 21
I1213 20:14:16.109405       1 simple_consumer.go:158] leader ID of [kiel__flink_yarn_logtail][28] is 13
I1213 20:14:16.111240       1 simple_consumer.go:158] leader ID of [kiel__flink_yarn_logtail][39] is 7
I1213 20:14:16.111301       1 simple_consumer.go:171] got leader broker 22:canal-kafka22-kiel.:9092 for kiel__flink_yarn_logtail-30
I1213 20:14:16.111467       1 simple_consumer.go:158] leader ID of [kiel__flink_yarn_logtail][31] is 23
I1213 20:14:16.113261       1 simple_consumer.go:158] leader ID of [kiel__flink_yarn_logtail][38] is 6
I1213 20:14:16.118258       1 simple_consumer.go:171] got leader broker 6:canal-kafka06-kiel.:9092 for kiel__flink_yarn_logtail-38
I1213 20:14:16.160326       1 simple_consumer.go:171] got leader broker 3:canal-kafka03-kiel.:9092 for kiel__flink_yarn_logtail-35
I1213 20:14:16.163384       1 simple_consumer.go:171] got leader broker 1:canal-kafka01-kiel.:9092 for kiel__flink_yarn_logtail-40
I1213 20:14:16.164828       1 simple_consumer.go:171] got leader broker 21:canal-kafka21-kiel.:9092 for kiel__flink_yarn_logtail-29
I1213 20:14:16.172641       1 simple_consumer.go:171] got leader broker 24:canal-kafka24-kiel.:9092 for kiel__flink_yarn_logtail-32
I1213 20:14:16.173089       1 simple_consumer.go:171] got leader broker 7:canal-kafka07-kiel.:9092 for kiel__flink_yarn_logtail-39
I1213 20:14:16.175015       1 simple_consumer.go:171] got leader broker 4:canal-kafka04-kiel.:9092 for kiel__flink_yarn_logtail-36
I1213 20:14:16.202612       1 simple_consumer.go:171] got leader broker 2:canal-kafka02-kiel.:9092 for kiel__flink_yarn_logtail-34
I1213 20:14:16.204897       1 simple_consumer.go:171] got leader broker 23:canal-kafka23-kiel.:9092 for kiel__flink_yarn_logtail-31
I1213 20:14:16.223705       1 simple_consumer.go:171] got leader broker 13:canal-kafka13-kiel.:9092 for kiel__flink_yarn_logtail-28
I1213 20:14:16.307683       1 simple_consumer.go:158] leader ID of [kiel__flink_yarn_logtail][33] is 18
I1213 20:14:16.307716       1 simple_consumer.go:158] leader ID of [kiel__flink_yarn_logtail][41] is 2
I1213 20:14:16.307973       1 simple_consumer.go:158] leader ID of [kiel__flink_yarn_logtail][37] is 5
I1213 20:14:16.336300       1 simple_consumer.go:171] got leader broker 5:canal-kafka05-kiel.:9092 for kiel__flink_yarn_logtail-37
I1213 20:14:16.387292       1 simple_consumer.go:171] got leader broker 2:canal-kafka02-kiel.:9092 for kiel__flink_yarn_logtail-41
I1213 20:14:16.406057       1 simple_consumer.go:171] got leader broker 18:canal-kafka18-kiel.:9092 for kiel__flink_yarn_logtail-33
E1213 20:15:15.360490       1 broker.go:241] write tcp 10.191.73.227:42444->7.32.139.223:9092: use of closed network connection
E1213 20:15:15.360566       1 brokers.go:267] get metadata of [kiel__flink_yarn_logtail] from canal-kafka16-kiel.:9092 error: requst of 3(1) to canal-kafka16-kiel.:9092 error: write tcp 10.191.73.227:42444->7.32.139.223:9092: use of closed network connection
E1213 20:15:15.560618       1 broker.go:241] write tcp 10.191.73.227:53490->7.32.139.214:9092: use of closed network connection
E1213 20:15:15.560663       1 brokers.go:267] get metadata of [kiel__flink_yarn_logtail] from canal-kafka07-kiel.:9092 error: requst of 3(1) to canal-kafka07-kiel.:9092 error: write tcp 10.191.73.227:53490->7.32.139.214:9092: use of closed network connection
E1213 20:15:15.565207       1 broker.go:241] write tcp 10.191.73.227:37138->7.32.139.228:9092: use of closed network connection
E1213 20:15:15.565273       1 brokers.go:267] get metadata of [kiel__flink_yarn_logtail] from canal-kafka21-kiel.:9092 error: requst of 3(1) to ca
@zcola
Copy link
Author

zcola commented Dec 14, 2023

#82 有点像这个问题

grep join|grep 20:
E1213 20:14:14.485455       1 group_consumer.go:461] failed to send heartbeat, restart: parse response of 12(0) from canal-kafka05-kieldsdet:9092 error: The group is rebalancing, so a rejoin is needed.
I1213 20:14:14.493263       1 group_consumer.go:174] try to join group logstash_gdc_flink_yarn_logtail_20231206new
E1213 20:32:47.451438       1 group_consumer.go:461] failed to send heartbeat, restart: parse response of 12(0) from canal-kafka05-kiel.i.ddset:9092 error: The group is rebalancing, so a rejoin is needed.
I1213 20:32:47.452590       1 group_consumer.go:174] try to join group logstash_gdc_flink_yarn_logtail_20231206new
E1213 20:33:16.708744       1 group_consumer.go:461] failed to send heartbeat, restart: parse response of 12(0) from canal-kafka05-kdsdnet:9092 error: The group is rebalancing, so a rejoin is needed.

@zcola
Copy link
Author

zcola commented Dec 25, 2023

1.8.2 没这个问题,我编译一个c go 凑合用着先

@XuHandsome
Copy link

我用master编译了还是会出现这个问题

E0403 10:27:31.638795       1 broker.go:241] write tcp 172.16.201.32:55248->192.168.2.2:9092: use of closed network connection
E0403 10:27:31.638872       1 brokers.go:267] get metadata of [filebeat-daemonset-java-app] from 10.88.11.142:9092 error: requst of 3(7) to 192.168.2.2:9092 error: write tcp 172.16.201.32:55248->192.168.2.2:9092: use of closed network connection

@childe
Copy link
Owner

childe commented Apr 3, 2024

@XuHandsome 给一些更多的报错我看一下?如果你没有使用更多参数,日志默认写到 /tmp 目录下面了。

@XuHandsome
Copy link

@XuHandsome 给一些更多的报错我看一下?如果你没有使用更多参数,日志默认写到 /tmp 目录下面了。

启动参数

/app/gohangout --config /app/config.yml -alsologtostderr --reload -prometheus 0.0.0.0:19119

复现问题

  1. kafka单机部署, topic单分片时未出现
  2. kafka三节点kraft集群部署, topic三分片,出现错误,经过排查网络没有问题
  3. 有个情况说明一下不知道是不是引发问题的原因,就是单机部署单分片的topic是断断续续有消息打进来的, 而集群部署三分片的topic中至今还没有开始往里面生产日志消息。

配置文件

inputs:
    - Kafka:
        topic:
            filebeat-daemonset-java-app: 1
        codec: json
        consumer_settings:
            bootstrap.servers: log-kafka.ohops.org:9092
            group.id: hangout.javapp
            from.beginning: 'true'

日志文件

/tmp打包见附件
logs.tgz

@XuHandsome
Copy link

@childe 找到了kafka-go项目中这个issue,不知道是不是一样的情况

@XuHandsome
Copy link

另外 healer可以关掉了, 我的问题重复了

@childe
Copy link
Owner

childe commented Apr 3, 2024

从日志看现象是这样的

  1. hearbeat失败,原因是 group 在 rebalance,所以我们这个消费者也要先停下,再重新加入 group,以便等待新的 partition assign
  2. join 成功,分到了新的 partition(因为其实只有一个消费者,所以分到了全部三个 partition)
  3. 准备开始消费
  4. heartbeat 又失败了,回到1,循环往复。

原因我还不能确定,明明已经重复 join & sync 成功返回了,下一次 heartbeat 又报错了。大概率是代码某些地方有逻辑问题。
你说的没有往三分片的 Topic 写数据,应该不是原因。

  1. 确认一下你的kafka 版本是什么?
  2. 启动的时候 ,添加 -v 5 打印更详细的日志看一下。

@XuHandsome
Copy link

  1. kafka_2.13-3.7.0
  2. 已经添加-v 5启动,后续有日志我会同步过来

想问下:
我的部署方案:topic三分片, gohangout 配置input线程为1, 容器化运行在k8s中,另外给pod配置hpa,允许当pod cpu或内存较高时横向弹出来新的示例, 最多三个,不知道这样实践是否有问题

@childe
Copy link
Owner

childe commented Apr 3, 2024

  1. kafka_2.13-3.7.0
  2. 已经添加-v 5启动,后续有日志我会同步过来

想问下: 我的部署方案:topic三分片, gohangout 配置input线程为1, 容器化运行在k8s中,另外给pod配置hpa,允许当pod cpu或内存较高时横向弹出来新的示例, 最多三个,不知道这样实践是否有问题

合理,没问题。

@XuHandsome
Copy link

从日志看现象是这样的

  1. hearbeat失败,原因是 group 在 rebalance,所以我们这个消费者也要先停下,再重新加入 group,以便等待新的 partition assign
  2. join 成功,分到了新的 partition(因为其实只有一个消费者,所以分到了全部三个 partition)
  3. 准备开始消费
  4. heartbeat 又失败了,回到1,循环往复。

原因我还不能确定,明明已经重复 join & sync 成功返回了,下一次 heartbeat 又报错了。大概率是代码某些地方有逻辑问题。 你说的没有往三分片的 Topic 写数据,应该不是原因。

  1. 确认一下你的kafka 版本是什么?
  2. 启动的时候 ,添加 -v 5 打印更详细的日志看一下。

@childe

启动命令:
/app/gohangout --config /app/config.yml -alsologtostderr --reload -prometheus 0.0.0.0:19119 -v 5 -log_dir /app/logs

又出现了报错

failed to send heartbeat, restart: parse response of 12(0) from 10.88.11.143:9092 error: The group is rebalancing, so a rejoin is needed

但是我看日志好像没有更详细。 附件ERROR级别日志中是有的, 而且我容器同步挂掉重启了一下, 这两天特别频繁
logdir.tgz

@XuHandsome
Copy link

看了下 这没有没有复现~ 只是正常的重新加入group,并且成功地继续消费日志了,。 但是我容器老莫名其妙就挂了, 看cpu内存也并没有出现问题,日志也没有打什么有用的信息 ,我再查查吧

@XuHandsome
Copy link

来了,这次是复现了, 运行一段时间就会报错,直至日志打满 pod重启,我启动了三个gohangout实例消费一个三分片topic,其中有一个或者多个实例会出现报错,日志见附件
logs.tgz

@XuHandsome
Copy link

重启实例后, 消息滞后持续上升, 但是实例消费速率一直上不去,就几条几条地跑,之前正常的时候遇到消费积压 是能跑满的
image
image

@childe
Copy link
Owner

childe commented Jun 11, 2024

来了,这次是复现了, 运行一段时间就会报错,直至日志打满 pod重启,我启动了三个gohangout实例消费一个三分片topic,其中有一个或者多个实例会出现报错,日志见附件 logs.tgz

这个BUG我看一下怎么回事

@childe
Copy link
Owner

childe commented Jun 11, 2024

重启实例后, 消息滞后持续上升, 但是实例消费速率一直上不去,就几条几条地跑,之前正常的时候遇到消费积压 是能跑满的 image image

  1. 看下gohangout配置
  2. 你看下gohangput CPU跑到多少,是不是Grok里面的正则把CPU打满了

@XuHandsome
Copy link

@childe 方便留个联系方式吗, 加交流群没通过

@xjxloveqsl
Copy link

我也遇到了同样的问题

@childe
Copy link
Owner

childe commented Sep 3, 2024

healer后来有升级,先关闭Issue。建议升一下。有需要再新开一个Issue看吧。

@childe childe closed this as completed Sep 3, 2024
@XuHandsome
Copy link

healer后来有升级,先关闭Issue。建议升一下。有需要再新开一个Issue看吧。

前面升级到1.10.5还是会有这个问题, 今天升到release1.10.7已经没再出现了

@zcola
Copy link
Author

zcola commented Sep 12, 2024

好,我有空升级下,先关了

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants