Sample Header Ad - 728x90

Kafka Debezium Connectors aurora DB lost connection, taking 14mins exactly to recover

0 votes
0 answers
10 views
I am using strimzi kafka setup in kubernetes production. I have 5 connect pods, 1 strimzi operator and ~9 connectors distributed across these 5 pods. Some pods are getting below error and only connectors hosted on those pods are going down and recovering on its own exactly after 14 minutes. INFO: Keepalive: Trying to restore lost connection to aurora-db-prod.cluster-randomstring.us-east-1.rds.amazonaws.com:3306 However, other pods which did not have this message, continues to receive messages just fine. usually this happens in 3 pods, and ends up impact almost 6-7 connectors, and after 14 minutes exactly it recovers without losing any data by starting from where it stopped on last offset. One of these connectors is very crucial and we can not expect it to be down for more than 3 minutes as it is customer impacting. ChatGPT recommended adding below for faster recovery, but it didn't help either: # ---- Fast Recovery Timeouts ---- database.connectionTimeout.ms: 10000 # Fail connection attempts fast (default: 30000) database.connect.backoff.max.ms: 30000 # Cap retry gap to 30s (default: 120000) # ---- Connector-Level Retries ---- connect.max.retries: 30 # 20 restart attempts (default: 3) connect.backoff.initial.delay.ms: 1000 # Small delay before restart connect.backoff.max.delay.ms: 8000 # Cap restart backoff to 8s (default: 60000) retriable.restart.connector.wait.ms: 5000 Below is the full config of one of my connector. ## NOTE: Except for the last few configs, all other configs are same for all connectors. Only last 8 configs are connector specific. KafkaConnectors: dbc-mysql-tables-connector: enabled: true annotations: {} labels: strimzi.io/cluster: debezium-connect-cluster spec: class: io.debezium.connector.mysql.MySqlConnector tasksMax: 1 autoRestart: enabled: true maxRestarts: 10 config: database.server.name: mysql_prod_tables snapshot.mode: schema_only snapshot.locking.mode: none topic.creation.enable: true topic.creation.default.replication.factor: 3 topic.creation.default.partitions: 1 topic.creation.default.compression.type: snappy database.history.kafka.topic: schema-changes.prod.mysql database.include.list: prod snapshot.new.tables: parallel tombstones.on.delete: "false" topic.naming.strategy: io.debezium.schema.DefaultTopicNamingStrategy topic.prefix: main.mysql key.converter.schemas.enable: "false" value.converter.schemas.enable: "false" key.converter: org.apache.kafka.connect.json.JsonConverter value.converter: org.apache.kafka.connect.json.JsonConverter schema.history.internal.kafka.topic: schema-history.prod.mysql include.schema.changes: true message.key.columns: "prod.*:id" decimal.handling.mode: string producer.override.compression.type: zstd producer.override.batch.size: 800000 producer.override.linger.ms: 5 producer.override.max.request.size: 50000000 database.history.kafka.recovery.poll.interval.ms: 60000 schema.history.internal.kafka.recovery.poll.interval.ms: 30000 errors.tolerance: all heartbeat.interval.ms: 30000 # 30 seconds, for example heartbeat.topics.prefix: debezium-heartbeat retry.backoff.ms: 800 errors.retry.timeout: 120000 errors.retry.delay.max.ms: 5000 errors.log.enable: true errors.log.include.messages: true # ---- Fast Recovery Timeouts ---- database.connectionTimeout.ms: 10000 # Fail connection attempts fast (default: 30000) database.connect.backoff.max.ms: 30000 # Cap retry gap to 30s (default: 120000) # secrets: database.host: database.port: 3306 database.user: database.apssword: # ---- Connector-Level Retries ---- connect.max.retries: 30 # 20 restart attempts (default: 3) connect.backoff.initial.delay.ms: 1000 # Small delay before restart connect.backoff.max.delay.ms: 8000 # Cap restart backoff to 8s (default: 60000) retriable.restart.connector.wait.ms: 5000 #below values are different for each connector, above are same for all connectors. database.server.name: mysql_prod_tables snapshot.mode: schema_only database.include.list: prod message.key.columns: "prod.*:id" database.server.id: 5434535 table.exclude.list: table.include.list: "" errors.deadletterqueue.topic.name: dlq.prod.mysql.tables ## Also NOTE: I am using debezium connector 2.7.4-final which is almost 4-5 version older. Can this be a bug in this older version which was resolved later? From what I checked online, I couldn't find any such which can confirm my doubts. Please help me as this is impacting our SLAs with customers almost every alternate day and I am still a rookie with Kafka and strimzi.
Asked by SafiJunaid (101 rep)
Aug 10, 2025, 01:00 PM
Last activity: Aug 10, 2025, 09:48 PM