Kafka Debezium Connectors aurora DB lost connection, taking 14mins exactly to recover
0
votes
0
answers
10
views
I am using strimzi kafka setup in kubernetes production. I have 5 connect pods, 1 strimzi operator and ~9 connectors distributed across these 5 pods.
Some pods are getting below error and only connectors hosted on those pods are going down and recovering on its own exactly after 14 minutes.
INFO: Keepalive: Trying to restore lost connection to aurora-db-prod.cluster-randomstring.us-east-1.rds.amazonaws.com:3306
However, other pods which did not have this message, continues to receive messages just fine.
usually this happens in 3 pods, and ends up impact almost 6-7 connectors, and after 14 minutes exactly it recovers without losing any data by starting from where it stopped on last offset.
One of these connectors is very crucial and we can not expect it to be down for more than 3 minutes as it is customer impacting.
ChatGPT recommended adding below for faster recovery, but it didn't help either:
# ---- Fast Recovery Timeouts ----
database.connectionTimeout.ms: 10000 # Fail connection attempts fast (default: 30000)
database.connect.backoff.max.ms: 30000 # Cap retry gap to 30s (default: 120000)
# ---- Connector-Level Retries ----
connect.max.retries: 30 # 20 restart attempts (default: 3)
connect.backoff.initial.delay.ms: 1000 # Small delay before restart
connect.backoff.max.delay.ms: 8000 # Cap restart backoff to 8s (default: 60000)
retriable.restart.connector.wait.ms: 5000
Below is the full config of one of my connector.
## NOTE: Except for the last few configs, all other configs are same for all connectors. Only last 8 configs are connector specific.
KafkaConnectors:
dbc-mysql-tables-connector:
enabled: true
annotations: {}
labels:
strimzi.io/cluster: debezium-connect-cluster
spec:
class: io.debezium.connector.mysql.MySqlConnector
tasksMax: 1
autoRestart:
enabled: true
maxRestarts: 10
config:
database.server.name: mysql_prod_tables
snapshot.mode: schema_only
snapshot.locking.mode: none
topic.creation.enable: true
topic.creation.default.replication.factor: 3
topic.creation.default.partitions: 1
topic.creation.default.compression.type: snappy
database.history.kafka.topic: schema-changes.prod.mysql
database.include.list: prod
snapshot.new.tables: parallel
tombstones.on.delete: "false"
topic.naming.strategy: io.debezium.schema.DefaultTopicNamingStrategy
topic.prefix: main.mysql
key.converter.schemas.enable: "false"
value.converter.schemas.enable: "false"
key.converter: org.apache.kafka.connect.json.JsonConverter
value.converter: org.apache.kafka.connect.json.JsonConverter
schema.history.internal.kafka.topic: schema-history.prod.mysql
include.schema.changes: true
message.key.columns: "prod.*:id"
decimal.handling.mode: string
producer.override.compression.type: zstd
producer.override.batch.size: 800000
producer.override.linger.ms: 5
producer.override.max.request.size: 50000000
database.history.kafka.recovery.poll.interval.ms: 60000
schema.history.internal.kafka.recovery.poll.interval.ms: 30000
errors.tolerance: all
heartbeat.interval.ms: 30000 # 30 seconds, for example
heartbeat.topics.prefix: debezium-heartbeat
retry.backoff.ms: 800
errors.retry.timeout: 120000
errors.retry.delay.max.ms: 5000
errors.log.enable: true
errors.log.include.messages: true
# ---- Fast Recovery Timeouts ----
database.connectionTimeout.ms: 10000 # Fail connection attempts fast (default: 30000)
database.connect.backoff.max.ms: 30000 # Cap retry gap to 30s (default: 120000)
# secrets:
database.host:
database.port: 3306
database.user:
database.apssword:
# ---- Connector-Level Retries ----
connect.max.retries: 30 # 20 restart attempts (default: 3)
connect.backoff.initial.delay.ms: 1000 # Small delay before restart
connect.backoff.max.delay.ms: 8000 # Cap restart backoff to 8s (default: 60000)
retriable.restart.connector.wait.ms: 5000
#below values are different for each connector, above are same for all connectors.
database.server.name: mysql_prod_tables
snapshot.mode: schema_only
database.include.list: prod
message.key.columns: "prod.*:id"
database.server.id: 5434535
table.exclude.list:
table.include.list: ""
errors.deadletterqueue.topic.name: dlq.prod.mysql.tables
## Also NOTE:
I am using debezium connector 2.7.4-final which is almost 4-5 version older. Can this be a bug in this older version which was resolved later? From what I checked online, I couldn't find any such which can confirm my doubts. Please help me as this is impacting our SLAs with customers almost every alternate day and I am still a rookie with Kafka and strimzi.
Asked by SafiJunaid
(101 rep)
Aug 10, 2025, 01:00 PM
Last activity: Aug 10, 2025, 09:48 PM
Last activity: Aug 10, 2025, 09:48 PM