Kafka Debezium Connectors aurora DB lost connection, taking 14mins exactly to recover

0 votes
0 answers
10 views
                          I am using strimzi kafka setup in kubernetes production. I have 5 connect pods, 1 strimzi operator and ~9 connectors distributed across these 5 pods. 

Some pods are getting below error and only connectors hosted on those pods are going down and recovering on its own exactly after 14 minutes. 

	
    INFO: Keepalive: Trying to restore lost connection to aurora-db-prod.cluster-randomstring.us-east-1.rds.amazonaws.com:3306


However, other pods which did not have this message, continues to receive messages just fine. 
usually this happens in 3 pods, and ends up impact almost 6-7 connectors, and after 14 minutes exactly it recovers without losing any data by starting from where it stopped on last offset. 

One of these connectors is very crucial and we can not expect it to be down for more than 3 minutes as it is customer impacting. 

ChatGPT recommended adding below for faster recovery, but it didn't help either:
       

        # ---- Fast Recovery Timeouts ----
        database.connectionTimeout.ms: 10000               # Fail connection attempts fast (default: 30000)
        database.connect.backoff.max.ms: 30000              # Cap retry gap to 30s (default: 120000)
        # ---- Connector-Level Retries ----
        connect.max.retries: 30                            # 20 restart attempts (default: 3)
        connect.backoff.initial.delay.ms: 1000             # Small delay before restart
        connect.backoff.max.delay.ms: 8000                 # Cap restart backoff to 8s (default: 60000)
        retriable.restart.connector.wait.ms: 5000


Below is the full config of one of my connector.

## NOTE: Except for the last few configs, all other configs are same for all connectors. Only last 8 configs are connector specific. 


    KafkaConnectors:
      dbc-mysql-tables-connector:
        enabled: true
        annotations: {}
        labels:
          strimzi.io/cluster: debezium-connect-cluster
        spec:
          class: io.debezium.connector.mysql.MySqlConnector
          tasksMax: 1
          autoRestart:
            enabled: true
            maxRestarts: 10
          config:
            database.server.name: mysql_prod_tables
            snapshot.mode: schema_only
            snapshot.locking.mode: none
            topic.creation.enable: true
            topic.creation.default.replication.factor: 3
            topic.creation.default.partitions: 1
            topic.creation.default.compression.type: snappy
            database.history.kafka.topic: schema-changes.prod.mysql
            database.include.list: prod
            snapshot.new.tables: parallel
            tombstones.on.delete: "false"
            topic.naming.strategy: io.debezium.schema.DefaultTopicNamingStrategy
            topic.prefix: main.mysql
            key.converter.schemas.enable: "false"
            value.converter.schemas.enable: "false"
            key.converter: org.apache.kafka.connect.json.JsonConverter
            value.converter: org.apache.kafka.connect.json.JsonConverter
            schema.history.internal.kafka.topic: schema-history.prod.mysql
            include.schema.changes: true
            message.key.columns: "prod.*:id"
            decimal.handling.mode: string
            producer.override.compression.type: zstd
            producer.override.batch.size: 800000
            producer.override.linger.ms: 5
            producer.override.max.request.size: 50000000
            database.history.kafka.recovery.poll.interval.ms: 60000
            schema.history.internal.kafka.recovery.poll.interval.ms: 30000
            errors.tolerance: all
            heartbeat.interval.ms: 30000      # 30 seconds, for example
            heartbeat.topics.prefix: debezium-heartbeat
            retry.backoff.ms: 800
            errors.retry.timeout: 120000
            errors.retry.delay.max.ms: 5000
            errors.log.enable: true
            errors.log.include.messages: true
            # ---- Fast Recovery Timeouts ----
            database.connectionTimeout.ms: 10000               # Fail connection attempts fast (default: 30000)
            database.connect.backoff.max.ms: 30000              # Cap retry gap to 30s (default: 120000)

            # secrets:
            database.host: 
            database.port: 3306
            database.user: 
            database.apssword: 

            # ---- Connector-Level Retries ----
            connect.max.retries: 30                            # 20 restart attempts (default: 3)
            connect.backoff.initial.delay.ms: 1000             # Small delay before restart
            connect.backoff.max.delay.ms: 8000                 # Cap restart backoff to 8s (default: 60000)
            retriable.restart.connector.wait.ms: 5000 

            #below values are different for each connector, above are same for all connectors. 
            database.server.name: mysql_prod_tables
            snapshot.mode: schema_only
            database.include.list: prod
            message.key.columns: "prod.*:id"
            database.server.id: 5434535
            table.exclude.list: 
            table.include.list: ""
            errors.deadletterqueue.topic.name: dlq.prod.mysql.tables


## Also NOTE:
I am using debezium connector 2.7.4-final which is almost 4-5 version older. Can this be a bug in this older version which was resolved later? From what I checked online, I couldn't find any such which can confirm my doubts. Please help me as this is impacting our SLAs with customers almost every alternate day and I am still a rookie with Kafka and strimzi.  
                        
Asked by SafiJunaid (101 rep)
Aug 10, 2025, 01:00 PM
Last activity: Aug 10, 2025, 09:48 PM
Kafka Debezium Connectors aurora DB lost connection, taking 14mins exactly to recover

Related Questions