Sample Header Ad - 728x90

Percona mysql xtradb cluster doesn't start properly and node restarts don't work

2 votes
1 answer
2284 views
**tl;dr** When starting a fresh percona cluster of 3 kubernetes pods, the grastate.dat seq_no is set at -1 and doesn't change. On deleting one pod and watching it restart, expecting it to rejoin the cluster, it sets it's inital position to 00000000-0000-0000-0000-000000000000:-1 and tries to connect to itself (it's former ip), maybe because it'd been the first pod in the cluster? It then timeouts in it's erroneous connection to itself: 2017-03-26T08:38:05.374058Z 0 [Note] WSREP: (b7571ff8, 'tcp://0.0.0.0:4567') connection to peer 00000000 with addr tcp://10.52.0.26:4567 timed out, no messages seen in PT3S **The cluster doesn't get started properly and I'm unable to successfully restart pods in the cluster.** **Full** When I start the cluster from scratch. With blank data directories and a fresh etcd cluster, everything seems to come up. However I look at the grastate.dat and I find that the seq_no for each pod is -1: root@gluster-3:/mnt/gfs/gluster_vol-1/mysql# cat percona-0/grastate.dat # GALERA saved state version: 2.1 uuid: a91f70f2-11f8-11e7-8f3d-86c2e58790ac seqno: -1 safe_to_bootstrap: 0 root@gluster-3:/mnt/gfs/gluster_vol-1/mysql# cat percona-1/grastate.dat # GALERA saved state version: 2.1 uuid: a91f70f2-11f8-11e7-8f3d-86c2e58790ac seqno: -1 safe_to_bootstrap: 0 root@gluster-3:/mnt/gfs/gluster_vol-1/mysql# cat percona-2/grastate.dat # GALERA saved state version: 2.1 uuid: a91f70f2-11f8-11e7-8f3d-86c2e58790ac seqno: -1 safe_to_bootstrap: 0 At this point I can do mysql -h percona -u wordpress -p and connect and wordpress works too. Scenario: I have 3 percona pods / # jonathan@ubuntu:~/Projects/k8wp$ kubectl get pods NAME READY STATUS RESTARTS AGE etcd-0 1/1 Running 1 12h etcd-1 1/1 Running 0 12h etcd-2 1/1 Running 3 12h etcd-3 1/1 Running 1 12h percona-0 1/1 Running 0 8m percona-1 1/1 Running 0 57m percona-2 1/1 Running 0 57m When I try to restart percona-0 it gets kicked out of the cluster on restarting, percona-0's gvwstate.dat file shows root@gluster-3:/mnt/gfs/gluster_vol-1/mysql# cat percona-0/gvwstate.dat my_uuid: b7571ff8-11f8-11e7-bd2d-8b50487e1523 #vwbeg view_id: 3 b7571ff8-11f8-11e7-bd2d-8b50487e1523 3 bootstrap: 0 member: b7571ff8-11f8-11e7-bd2d-8b50487e1523 0 member: bd05a643-11f8-11e7-9dab-1b4fc20eaf6a 0 member: c33d6a73-11f8-11e7-9e86-fe1cf3d3367a 0 #vwend The other 2 pods in the cluster show: root@gluster-3:/mnt/gfs/gluster_vol-1/mysql# cat percona-1/gvwstate.dat my_uuid: bd05a643-11f8-11e7-9dab-1b4fc20eaf6a #vwbeg view_id: 3 bd05a643-11f8-11e7-9dab-1b4fc20eaf6a 4 bootstrap: 0 member: bd05a643-11f8-11e7-9dab-1b4fc20eaf6a 0 member: c33d6a73-11f8-11e7-9e86-fe1cf3d3367a 0 #vwend root@gluster-3:/mnt/gfs/gluster_vol-1/mysql# cat percona-2/gvwstate.dat my_uuid: c33d6a73-11f8-11e7-9e86-fe1cf3d3367a #vwbeg view_id: 3 bd05a643-11f8-11e7-9dab-1b4fc20eaf6a 4 bootstrap: 0 member: bd05a643-11f8-11e7-9dab-1b4fc20eaf6a 0 member: c33d6a73-11f8-11e7-9e86-fe1cf3d3367a 0 #vwend Here are what I think are the relevant errors from percona-0's startup: 2017-03-26T08:37:58.370605Z 0 [Note] WSREP: Setting initial position to 00000000-0000-0000-0000-000000000000:-1 2017-03-26T08:37:58.372537Z 0 [Note] WSREP: gcomm: connecting to group 'wordpress-001', peer '10.52.0.26:' 2017-03-26T08:38:01.373345Z 0 [Note] WSREP: (b7571ff8, 'tcp://0.0.0.0:4567') connection to peer 00000000 with addr tcp://10.52.0.26:4567 timed out, no messages seen in PT3S 2017-03-26T08:38:01.373682Z 0 [Warning] WSREP: no nodes coming from prim view, prim not possible 2017-03-26T08:38:01.373750Z 0 [Note] WSREP: view(view_id(NON_PRIM,b7571ff8,5) memb { b7571ff8,0 } joined { } left { } partitioned { }) 2017-03-26T08:38:01.373838Z 0 [Note] WSREP: gcomm: connected 2017-03-26T08:38:01.373872Z 0 [Note] WSREP: Changing maximum packet size to 64500, resulting msg size: 32636 2017-03-26T08:38:01.373987Z 0 [Note] WSREP: Shifting CLOSED -> OPEN (TO: 0) 2017-03-26T08:38:01.374012Z 0 [Note] WSREP: Opened channel 'wordpress-001' 2017-03-26T08:38:01.374108Z 0 [Note] WSREP: Waiting for SST to complete. 2017-03-26T08:38:01.374417Z 0 [Note] WSREP: New COMPONENT: primary = no, bootstrap = no, my_idx = 0, memb_num = 1 2017-03-26T08:38:01.374469Z 0 [Note] WSREP: Flow-control interval: [16, 16] 2017-03-26T08:38:01.374491Z 0 [Note] WSREP: Received NON-PRIMARY. 2017-03-26T08:38:01.374560Z 1 [Note] WSREP: New cluster view: global state: :-1, view# -1: non-Primary, number of nodes: 1, my index: 0, protocol version -1 The ip it's trying to connect to 10.52.0.26 in 2017-03-26T08:37:58.372537Z 0 [Note] WSREP: gcomm: connecting to group 'wordpress-001', peer '10.52.0.26:' is actually that pods previous ip, here's the listing of keys in etcd I did before deleting percona-0 / # etcdctl ls --recursive /pxc-cluster /pxc-cluster/wordpress /pxc-cluster/queue /pxc-cluster/queue/wordpress /pxc-cluster/queue/wordpress-001 /pxc-cluster/wordpress-001 /pxc-cluster/wordpress-001/10.52.1.46 /pxc-cluster/wordpress-001/10.52.1.46/ipaddr /pxc-cluster/wordpress-001/10.52.1.46/hostname /pxc-cluster/wordpress-001/10.52.2.33 /pxc-cluster/wordpress-001/10.52.2.33/ipaddr /pxc-cluster/wordpress-001/10.52.2.33/hostname /pxc-cluster/wordpress-001/10.52.0.26 /pxc-cluster/wordpress-001/10.52.0.26/hostname /pxc-cluster/wordpress-001/10.52.0.26/ipaddr After kubectl delete pods/percona-0: / # etcdctl ls --recursive /pxc-cluster /pxc-cluster/queue /pxc-cluster/queue/wordpress /pxc-cluster/queue/wordpress-001 /pxc-cluster/wordpress-001 /pxc-cluster/wordpress-001/10.52.1.46 /pxc-cluster/wordpress-001/10.52.1.46/ipaddr /pxc-cluster/wordpress-001/10.52.1.46/hostname /pxc-cluster/wordpress-001/10.52.2.33 /pxc-cluster/wordpress-001/10.52.2.33/ipaddr /pxc-cluster/wordpress-001/10.52.2.33/hostname /pxc-cluster/wordpress Also during the restart percona-0 tried to register to etcd with: {"action":"create","node":{"key":"/pxc-cluster/queue/wordpress-001/00000000000000009886","value":"10.52.0.27","expiration":"2017-03-26T08:38:57.980325718Z","ttl":60,"modifiedIndex":9886,"createdIndex":9886}} {"action":"set","node":{"key":"/pxc-cluster/wordpress-001/10.52.0.27/ipaddr","value":"10.52.0.27","expiration":"2017-03-26T08:38:28.01814818Z","ttl":30,"modifiedIndex":9887,"createdIndex":9887}} {"action":"set","node":{"key":"/pxc-cluster/wordpress-001/10.52.0.27/hostname","value":"percona-0","expiration":"2017-03-26T08:38:28.037188157Z","ttl":30,"modifiedIndex":9888,"createdIndex":9888}} {"action":"update","node":{"key":"/pxc-cluster/wordpress-001/10.52.0.27","dir":true,"expiration":"2017-03-26T08:38:28.054726795Z","ttl":30,"modifiedIndex":9889,"createdIndex":9887},"prevNode":{"key":"/pxc-cluster/wordpress-001/10.52.0.27","dir":true,"modifiedIndex":9887,"createdIndex":9887}} which doesn't work. From the second member of the cluster percona-1: 2017-03-26T08:37:44.069583Z 0 [Note] WSREP: (bd05a643, 'tcp://0.0.0.0:4567') turning message relay requesting on, nonlive peers: tcp://10.52.0.26:4567 2017-03-26T08:37:45.069756Z 0 [Note] WSREP: (bd05a643, 'tcp://0.0.0.0:4567') reconnecting to b7571ff8 (tcp://10.52.0.26:4567), attempt 0 2017-03-26T08:37:48.570332Z 0 [Note] WSREP: (bd05a643, 'tcp://0.0.0.0:4567') connection to peer 00000000 with addr tcp://10.52.0.26:4567 timed out, no messages seen in PT3S 2017-03-26T08:37:49.605089Z 0 [Note] WSREP: evs::proto(bd05a643, GATHER, view_id(REG,b7571ff8,3)) suspecting node: b7571ff8 2017-03-26T08:37:49.605276Z 0 [Note] WSREP: evs::proto(bd05a643, GATHER, view_id(REG,b7571ff8,3)) suspected node without join message, declaring inactive 2017-03-26T08:37:50.104676Z 0 [Note] WSREP: declaring c33d6a73 at tcp://10.52.2.33:4567 stable **New Info:** I restarted percona-0 again, and this time it somehow came up! After a few tries I realised the pod needs to restarted twice to come up i.e. after deleting it the first time, it comes up with the above errors, after deleting it the second time it comes up okay and syncs with the other members. Could this be because it was the first pod in the cluster? I've tested deleting the other pods but they all come back up okay. The issue only lies with percona-0. Also; Taking down all the pods at once, if my node was to crash, that's the situation where the pods don't come back up at all! I suspect it's because no state is saved to grastate.dat , i.e. seq_no remains -1 even though the global id may change, the pods exit with mysqld shutdown, and the following errors: jonathan@ubuntu:~/Projects/k8wp$ kubectl logs percona-2 | grep ERROR 2017-03-26T11:20:25.795085Z 0 [ERROR] WSREP: failed to open gcomm backend connection: 110: failed to reach primary view: 110 (Connection timed out) 2017-03-26T11:20:25.795276Z 0 [ERROR] WSREP: gcs/src/gcs_core.cpp:gcs_core_open():208: Failed to open backend connection: -110 (Connection timed out) 2017-03-26T11:20:25.795544Z 0 [ERROR] WSREP: gcs/src/gcs.cpp:gcs_open():1437: Failed to open channel 'wordpress-001' at 'gcomm://10.52.2.36': -110 (Connection timed out) 2017-03-26T11:20:25.795618Z 0 [ERROR] WSREP: gcs connect failed: Connection timed out 2017-03-26T11:20:25.795645Z 0 [ERROR] WSREP: wsrep::connect(gcomm://10.52.2.36) failed: 7 2017-03-26T11:20:25.795693Z 0 [ERROR] Aborting jonathan@ubuntu:~/Projects/k8wp$ kubectl logs percona-1 | grep ERROR 2017-03-26T11:20:27.093780Z 0 [ERROR] WSREP: failed to open gcomm backend connection: 110: failed to reach primary view: 110 (Connection timed out) 2017-03-26T11:20:27.093977Z 0 [ERROR] WSREP: gcs/src/gcs_core.cpp:gcs_core_open():208: Failed to open backend connection: -110 (Connection timed out) 2017-03-26T11:20:27.094145Z 0 [ERROR] WSREP: gcs/src/gcs.cpp:gcs_open():1437: Failed to open channel 'wordpress-001' at 'gcomm://10.52.1.49': -110 (Connection timed out) 2017-03-26T11:20:27.094200Z 0 [ERROR] WSREP: gcs connect failed: Connection timed out 2017-03-26T11:20:27.094227Z 0 [ERROR] WSREP: wsrep::connect(gcomm://10.52.1.49) failed: 7 2017-03-26T11:20:27.094247Z 0 [ERROR] Aborting jonathan@ubuntu:~/Projects/k8wp$ kubectl logs percona-0 | grep ERROR 2017-03-26T11:20:52.040214Z 0 [ERROR] WSREP: failed to open gcomm backend connection: 110: failed to reach primary view: 110 (Connection timed out) 2017-03-26T11:20:52.040279Z 0 [ERROR] WSREP: gcs/src/gcs_core.cpp:gcs_core_open():208: Failed to open backend connection: -110 (Connection timed out) 2017-03-26T11:20:52.040385Z 0 [ERROR] WSREP: gcs/src/gcs.cpp:gcs_open():1437: Failed to open channel 'wordpress-001' at 'gcomm://10.52.2.36': -110 (Connection timed out) 2017-03-26T11:20:52.040437Z 0 [ERROR] WSREP: gcs connect failed: Connection timed out 2017-03-26T11:20:52.040471Z 0 [ERROR] WSREP: wsrep::connect(gcomm://10.52.2.36) failed: 7 2017-03-26T11:20:52.040508Z 0 [ERROR] Aborting grastate.dat on deleting all pods: root@gluster-3:/mnt/gfs/gluster_vol-1/mysql# cat percona-0/grastate.dat # GALERA saved state version: 2.1 uuid: a91f70f2-11f8-11e7-8f3d-86c2e58790ac seqno: -1 safe_to_bootstrap: 0 root@gluster-3:/mnt/gfs/gluster_vol-1/mysql# cat percona-1/grastate.dat # GALERA saved state version: 2.1 uuid: a91f70f2-11f8-11e7-8f3d-86c2e58790ac seqno: -1 safe_to_bootstrap: 0 root@gluster-3:/mnt/gfs/gluster_vol-1/mysql# cat percona-2/grastate.dat # GALERA saved state version: 2.1 uuid: a91f70f2-11f8-11e7-8f3d-86c2e58790ac seqno: -1 safe_to_bootstrap: 0 No, gvwstate.dat
Asked by Jonathan (121 rep)
Mar 26, 2017, 09:18 AM
Last activity: Jul 23, 2025, 03:01 AM