Percona mysql xtradb cluster doesn't start properly and node restarts don't work

2 votes
1 answer
2284 views
                          **tl;dr**

When starting a fresh percona cluster of 3 kubernetes pods, the grastate.dat seq_no is set at -1 and doesn't change. On deleting one pod and watching it restart, expecting it to rejoin the cluster, it sets it's inital position to 00000000-0000-0000-0000-000000000000:-1 and tries to connect to itself (it's former ip), maybe because it'd been the first pod in the cluster? It then timeouts in it's erroneous connection to itself:

    2017-03-26T08:38:05.374058Z 0 [Note] WSREP: (b7571ff8, 'tcp://0.0.0.0:4567') connection to peer 00000000 with addr tcp://10.52.0.26:4567 timed out, no messages seen in PT3S

**The cluster doesn't get started properly and I'm unable to successfully restart pods in the cluster.**

**Full**

When I start the cluster from scratch. With blank data directories and a fresh etcd cluster, everything seems to come up. However I look at the grastate.dat and I find that the seq_no for each pod is -1:

    root@gluster-3:/mnt/gfs/gluster_vol-1/mysql# cat percona-0/grastate.dat
    # GALERA saved state
    version: 2.1
    uuid:    a91f70f2-11f8-11e7-8f3d-86c2e58790ac
    seqno:   -1
    safe_to_bootstrap: 0
    root@gluster-3:/mnt/gfs/gluster_vol-1/mysql# cat percona-1/grastate.dat
    # GALERA saved state
    version: 2.1
    uuid:    a91f70f2-11f8-11e7-8f3d-86c2e58790ac
    seqno:   -1
    safe_to_bootstrap: 0
    root@gluster-3:/mnt/gfs/gluster_vol-1/mysql# cat percona-2/grastate.dat
    # GALERA saved state
    version: 2.1
    uuid:    a91f70f2-11f8-11e7-8f3d-86c2e58790ac
    seqno:   -1
    safe_to_bootstrap: 0

At this point I can do mysql -h percona -u wordpress -p and connect and wordpress works too.

Scenario:
I have 3 percona pods

    / # jonathan@ubuntu:~/Projects/k8wp$ kubectl get pods
    NAME                         READY     STATUS    RESTARTS   AGE
    etcd-0                       1/1       Running   1          12h
    etcd-1                       1/1       Running   0          12h
    etcd-2                       1/1       Running   3          12h
    etcd-3                       1/1       Running   1          12h
    percona-0                    1/1       Running   0          8m
    percona-1                    1/1       Running   0          57m
    percona-2                    1/1       Running   0          57m

When I try to restart percona-0 it gets kicked out of the cluster on restarting, percona-0's gvwstate.dat file shows

    root@gluster-3:/mnt/gfs/gluster_vol-1/mysql# cat percona-0/gvwstate.dat
    my_uuid: b7571ff8-11f8-11e7-bd2d-8b50487e1523
    #vwbeg
    view_id: 3 b7571ff8-11f8-11e7-bd2d-8b50487e1523 3
    bootstrap: 0
    member: b7571ff8-11f8-11e7-bd2d-8b50487e1523 0
    member: bd05a643-11f8-11e7-9dab-1b4fc20eaf6a 0
    member: c33d6a73-11f8-11e7-9e86-fe1cf3d3367a 0
    #vwend

The other 2 pods in the cluster show:

    root@gluster-3:/mnt/gfs/gluster_vol-1/mysql# cat percona-1/gvwstate.dat
    my_uuid: bd05a643-11f8-11e7-9dab-1b4fc20eaf6a
    #vwbeg
    view_id: 3 bd05a643-11f8-11e7-9dab-1b4fc20eaf6a 4
    bootstrap: 0
    member: bd05a643-11f8-11e7-9dab-1b4fc20eaf6a 0
    member: c33d6a73-11f8-11e7-9e86-fe1cf3d3367a 0
    #vwend
    root@gluster-3:/mnt/gfs/gluster_vol-1/mysql# cat percona-2/gvwstate.dat
    my_uuid: c33d6a73-11f8-11e7-9e86-fe1cf3d3367a
    #vwbeg
    view_id: 3 bd05a643-11f8-11e7-9dab-1b4fc20eaf6a 4
    bootstrap: 0
    member: bd05a643-11f8-11e7-9dab-1b4fc20eaf6a 0
    member: c33d6a73-11f8-11e7-9e86-fe1cf3d3367a 0
    #vwend

Here are what I think are the relevant errors from percona-0's startup:

    2017-03-26T08:37:58.370605Z 0 [Note] WSREP: Setting initial position to 00000000-0000-0000-0000-000000000000:-1
    2017-03-26T08:37:58.372537Z 0 [Note] WSREP: gcomm: connecting to group 'wordpress-001', peer '10.52.0.26:'
    2017-03-26T08:38:01.373345Z 0 [Note] WSREP: (b7571ff8, 'tcp://0.0.0.0:4567') connection to peer 00000000 with addr tcp://10.52.0.26:4567 timed out, no messages seen in PT3S
    2017-03-26T08:38:01.373682Z 0 [Warning] WSREP: no nodes coming from prim view, prim not possible
    2017-03-26T08:38:01.373750Z 0 [Note] WSREP: view(view_id(NON_PRIM,b7571ff8,5) memb {
    	b7571ff8,0
    } joined {
    } left {
    } partitioned {
    })
    2017-03-26T08:38:01.373838Z 0 [Note] WSREP: gcomm: connected
    2017-03-26T08:38:01.373872Z 0 [Note] WSREP: Changing maximum packet size to 64500, resulting msg size: 32636
    2017-03-26T08:38:01.373987Z 0 [Note] WSREP: Shifting CLOSED -> OPEN (TO: 0)
    2017-03-26T08:38:01.374012Z 0 [Note] WSREP: Opened channel 'wordpress-001'
    2017-03-26T08:38:01.374108Z 0 [Note] WSREP: Waiting for SST to complete.
    2017-03-26T08:38:01.374417Z 0 [Note] WSREP: New COMPONENT: primary = no, bootstrap = no, my_idx = 0, memb_num = 1
    2017-03-26T08:38:01.374469Z 0 [Note] WSREP: Flow-control interval: [16, 16]
    2017-03-26T08:38:01.374491Z 0 [Note] WSREP: Received NON-PRIMARY.
    2017-03-26T08:38:01.374560Z 1 [Note] WSREP: New cluster view: global state: :-1, view# -1: non-Primary, number of nodes: 1, my index: 0, protocol version -1

The ip it's trying to connect to 10.52.0.26 in 2017-03-26T08:37:58.372537Z 0 [Note] WSREP: gcomm: connecting to group 'wordpress-001', peer '10.52.0.26:' is actually that pods previous ip, here's the listing of keys in etcd I did before deleting percona-0

    / # etcdctl ls --recursive
    /pxc-cluster
    /pxc-cluster/wordpress
    /pxc-cluster/queue
    /pxc-cluster/queue/wordpress
    /pxc-cluster/queue/wordpress-001
    /pxc-cluster/wordpress-001
    /pxc-cluster/wordpress-001/10.52.1.46
    /pxc-cluster/wordpress-001/10.52.1.46/ipaddr
    /pxc-cluster/wordpress-001/10.52.1.46/hostname
    /pxc-cluster/wordpress-001/10.52.2.33
    /pxc-cluster/wordpress-001/10.52.2.33/ipaddr
    /pxc-cluster/wordpress-001/10.52.2.33/hostname
    /pxc-cluster/wordpress-001/10.52.0.26
    /pxc-cluster/wordpress-001/10.52.0.26/hostname
    /pxc-cluster/wordpress-001/10.52.0.26/ipaddr

After kubectl delete pods/percona-0:

    / # etcdctl ls --recursive
    /pxc-cluster
    /pxc-cluster/queue
    /pxc-cluster/queue/wordpress
    /pxc-cluster/queue/wordpress-001
    /pxc-cluster/wordpress-001
    /pxc-cluster/wordpress-001/10.52.1.46
    /pxc-cluster/wordpress-001/10.52.1.46/ipaddr
    /pxc-cluster/wordpress-001/10.52.1.46/hostname
    /pxc-cluster/wordpress-001/10.52.2.33
    /pxc-cluster/wordpress-001/10.52.2.33/ipaddr
    /pxc-cluster/wordpress-001/10.52.2.33/hostname
    /pxc-cluster/wordpress

Also during the restart percona-0 tried to register to etcd with:

    {"action":"create","node":{"key":"/pxc-cluster/queue/wordpress-001/00000000000000009886","value":"10.52.0.27","expiration":"2017-03-26T08:38:57.980325718Z","ttl":60,"modifiedIndex":9886,"createdIndex":9886}}
    {"action":"set","node":{"key":"/pxc-cluster/wordpress-001/10.52.0.27/ipaddr","value":"10.52.0.27","expiration":"2017-03-26T08:38:28.01814818Z","ttl":30,"modifiedIndex":9887,"createdIndex":9887}}
    {"action":"set","node":{"key":"/pxc-cluster/wordpress-001/10.52.0.27/hostname","value":"percona-0","expiration":"2017-03-26T08:38:28.037188157Z","ttl":30,"modifiedIndex":9888,"createdIndex":9888}}
    {"action":"update","node":{"key":"/pxc-cluster/wordpress-001/10.52.0.27","dir":true,"expiration":"2017-03-26T08:38:28.054726795Z","ttl":30,"modifiedIndex":9889,"createdIndex":9887},"prevNode":{"key":"/pxc-cluster/wordpress-001/10.52.0.27","dir":true,"modifiedIndex":9887,"createdIndex":9887}}

which doesn't work.

From the second member of the cluster percona-1:

    2017-03-26T08:37:44.069583Z 0 [Note] WSREP: (bd05a643, 'tcp://0.0.0.0:4567') turning message relay requesting on, nonlive peers: tcp://10.52.0.26:4567 
    2017-03-26T08:37:45.069756Z 0 [Note] WSREP: (bd05a643, 'tcp://0.0.0.0:4567') reconnecting to b7571ff8 (tcp://10.52.0.26:4567), attempt 0
    2017-03-26T08:37:48.570332Z 0 [Note] WSREP: (bd05a643, 'tcp://0.0.0.0:4567') connection to peer 00000000 with addr tcp://10.52.0.26:4567 timed out, no messages seen in PT3S
    2017-03-26T08:37:49.605089Z 0 [Note] WSREP: evs::proto(bd05a643, GATHER, view_id(REG,b7571ff8,3)) suspecting node: b7571ff8
    2017-03-26T08:37:49.605276Z 0 [Note] WSREP: evs::proto(bd05a643, GATHER, view_id(REG,b7571ff8,3)) suspected node without join message, declaring inactive
    2017-03-26T08:37:50.104676Z 0 [Note] WSREP: declaring c33d6a73 at tcp://10.52.2.33:4567 stable

**New Info:**
I restarted percona-0 again, and this time it somehow came up! After a few tries I realised the pod needs to restarted twice to come up i.e. after deleting it the first time, it comes up with the above errors, after deleting it the second time it comes up okay and syncs with the other members. Could this be because it was the first pod in the cluster?

I've tested deleting the other pods but they all come back up okay.

The issue only lies with percona-0.

Also;
Taking down all the pods at once, if my node was to crash, that's the situation where the pods don't come back up at all! I suspect it's because no state is saved to grastate.dat , i.e. seq_no remains -1 even though the global id may change, the pods exit with mysqld shutdown, and the following errors:

    jonathan@ubuntu:~/Projects/k8wp$ kubectl logs percona-2 | grep ERROR
    2017-03-26T11:20:25.795085Z 0 [ERROR] WSREP: failed to open gcomm backend connection: 110: failed to reach primary view: 110 (Connection timed out)
    2017-03-26T11:20:25.795276Z 0 [ERROR] WSREP: gcs/src/gcs_core.cpp:gcs_core_open():208: Failed to open backend connection: -110 (Connection timed out)
    2017-03-26T11:20:25.795544Z 0 [ERROR] WSREP: gcs/src/gcs.cpp:gcs_open():1437: Failed to open channel 'wordpress-001' at 'gcomm://10.52.2.36': -110 (Connection timed out)
    2017-03-26T11:20:25.795618Z 0 [ERROR] WSREP: gcs connect failed: Connection timed out
    2017-03-26T11:20:25.795645Z 0 [ERROR] WSREP: wsrep::connect(gcomm://10.52.2.36) failed: 7
    2017-03-26T11:20:25.795693Z 0 [ERROR] Aborting
    jonathan@ubuntu:~/Projects/k8wp$ kubectl logs percona-1 | grep ERROR
    2017-03-26T11:20:27.093780Z 0 [ERROR] WSREP: failed to open gcomm backend connection: 110: failed to reach primary view: 110 (Connection timed out)
    2017-03-26T11:20:27.093977Z 0 [ERROR] WSREP: gcs/src/gcs_core.cpp:gcs_core_open():208: Failed to open backend connection: -110 (Connection timed out)
    2017-03-26T11:20:27.094145Z 0 [ERROR] WSREP: gcs/src/gcs.cpp:gcs_open():1437: Failed to open channel 'wordpress-001' at 'gcomm://10.52.1.49': -110 (Connection timed out)
    2017-03-26T11:20:27.094200Z 0 [ERROR] WSREP: gcs connect failed: Connection timed out
    2017-03-26T11:20:27.094227Z 0 [ERROR] WSREP: wsrep::connect(gcomm://10.52.1.49) failed: 7
    2017-03-26T11:20:27.094247Z 0 [ERROR] Aborting
    jonathan@ubuntu:~/Projects/k8wp$ kubectl logs percona-0 | grep ERROR
    2017-03-26T11:20:52.040214Z 0 [ERROR] WSREP: failed to open gcomm backend connection: 110: failed to reach primary view: 110 (Connection timed out)
    2017-03-26T11:20:52.040279Z 0 [ERROR] WSREP: gcs/src/gcs_core.cpp:gcs_core_open():208: Failed to open backend connection: -110 (Connection timed out)
    2017-03-26T11:20:52.040385Z 0 [ERROR] WSREP: gcs/src/gcs.cpp:gcs_open():1437: Failed to open channel 'wordpress-001' at 'gcomm://10.52.2.36': -110 (Connection timed out)
    2017-03-26T11:20:52.040437Z 0 [ERROR] WSREP: gcs connect failed: Connection timed out
    2017-03-26T11:20:52.040471Z 0 [ERROR] WSREP: wsrep::connect(gcomm://10.52.2.36) failed: 7
    2017-03-26T11:20:52.040508Z 0 [ERROR] Aborting

grastate.dat on deleting all pods:

    root@gluster-3:/mnt/gfs/gluster_vol-1/mysql# cat percona-0/grastate.dat
    # GALERA saved state
    version: 2.1
    uuid:    a91f70f2-11f8-11e7-8f3d-86c2e58790ac
    seqno:   -1
    safe_to_bootstrap: 0
     root@gluster-3:/mnt/gfs/gluster_vol-1/mysql# cat percona-1/grastate.dat
    # GALERA saved state
    version: 2.1
    uuid:    a91f70f2-11f8-11e7-8f3d-86c2e58790ac
    seqno:   -1
    safe_to_bootstrap: 0
     root@gluster-3:/mnt/gfs/gluster_vol-1/mysql# cat percona-2/grastate.dat
    # GALERA saved state
    version: 2.1
    uuid:    a91f70f2-11f8-11e7-8f3d-86c2e58790ac
    seqno:   -1
    safe_to_bootstrap: 0
No, gvwstate.dat
                        
Asked by Jonathan (121 rep)
Mar 26, 2017, 09:18 AM
Last activity: Jul 23, 2025, 03:01 AM
Percona mysql xtradb cluster doesn't start properly and node restarts don't work

Related Questions