PCS Stonith (fencing) will kill two node cluster if first is down

2 votes
1 answer
3242 views
                          I have configured a two node physical server cluster (HP ProLiant DL560 Gen8) using pcs (corosync/pacemaker/pcsd). I have also configured fencing on them using fence_ilo4.

The weird thing will happen if one node goes down (under DOWN i mean power OFF), the second node will die as well. Fencing will kill itself causing both servers to be offline.

How do i correct this behavior?

The thing i tried is to add "wait_for_all: 0" and "expected_votes: 1" in /etc/corosync/corosync.conf under quorum section. But it will still kill it.

At some point, some maintenance is to be performed on one of those servers, and it will have to be shutdown. I don't want for the other node to go down if this happens.

Here are some outputs

    [root@kvm_aquila-02 ~]# pcs quorum status
    Quorum information
    ------------------
    Date:             Fri Jun 28 09:07:18 2019
    Quorum provider:  corosync_votequorum
    Nodes:            2
    Node ID:          2
    Ring ID:          1/284
    Quorate:          Yes
    
    Votequorum information
    ----------------------
    Expected votes:   2
    Highest expected: 2
    Total votes:      2
    Quorum:           1  
    Flags:            2Node Quorate 
    
    Membership information
    ----------------------
        Nodeid      Votes    Qdevice Name
             1          1         NR kvm_aquila-01
             2          1         NR kvm_aquila-02 (local)


    [root@kvm_aquila-02 ~]# pcs config show
    Cluster Name: kvm_aquila
    Corosync Nodes:
     kvm_aquila-01 kvm_aquila-02
    Pacemaker Nodes:
     kvm_aquila-01 kvm_aquila-02
    
    Resources:
     Clone: dlm-clone
      Meta Attrs: interleave=true ordered=true 
      Resource: dlm (class=ocf provider=pacemaker type=controld)
       Operations: monitor interval=30s on-fail=fence (dlm-monitor-interval-30s)
                   start interval=0s timeout=90 (dlm-start-interval-0s)
                   stop interval=0s timeout=100 (dlm-stop-interval-0s)
     Clone: clvmd-clone
      Meta Attrs: interleave=true ordered=true 
      Resource: clvmd (class=ocf provider=heartbeat type=clvm)
       Operations: monitor interval=30s on-fail=fence (clvmd-monitor-interval-30s)
                   start interval=0s timeout=90s (clvmd-start-interval-0s)
                   stop interval=0s timeout=90s (clvmd-stop-interval-0s)
     Group: test_VPS
      Resource: test (class=ocf provider=heartbeat type=VirtualDomain)
       Attributes: config=/shared/xml/test.xml hypervisor=qemu:///system migration_transport=ssh
       Meta Attrs: allow-migrate=true is-managed=true priority=100 target-role=Started 
       Utilization: cpu=4 hv_memory=4096
       Operations: migrate_from interval=0 timeout=120s (test-migrate_from-interval-0)
                   migrate_to interval=0 timeout=120 (test-migrate_to-interval-0)
                   monitor interval=10 timeout=30 (test-monitor-interval-10)
                   start interval=0s timeout=300s (test-start-interval-0s)
                   stop interval=0s timeout=300s (test-stop-interval-0s)
    
    Stonith Devices:
     Resource: kvm_aquila-01 (class=stonith type=fence_ilo4)
      Attributes: ipaddr=10.0.4.39 login=fencing passwd=0ToleranciJa pcmk_host_list="kvm_aquila-01 kvm_aquila-02"
      Operations: monitor interval=60s (kvm_aquila-01-monitor-interval-60s)
     Resource: kvm_aquila-02 (class=stonith type=fence_ilo4)
      Attributes: ipaddr=10.0.4.49 login=fencing passwd=0ToleranciJa pcmk_host_list="kvm_aquila-01 kvm_aquila-02"
      Operations: monitor interval=60s (kvm_aquila-02-monitor-interval-60s)
    Fencing Levels:
    
    Location Constraints:
    Ordering Constraints:
      start dlm-clone then start clvmd-clone (kind:Mandatory)
    Colocation Constraints:
      clvmd-clone with dlm-clone (score:INFINITY)
    Ticket Constraints:
    
    Alerts:
     No alerts defined
    
    Resources Defaults:
     No defaults set
    Operations Defaults:
     No defaults set
    
    Cluster Properties:
     cluster-infrastructure: corosync
     cluster-name: kvm_aquila
     dc-version: 1.1.19-8.el7_6.4-c3c624ea3d
     have-watchdog: false
     last-lrm-refresh: 1561619537
     no-quorum-policy: ignore
     stonith-enabled: true
    
    Quorum:
      Options:
        wait_for_all: 0

    [root@kvm_aquila-02 ~]# pcs cluster status
    Cluster Status:
     Stack: corosync
     Current DC: kvm_aquila-02 (version 1.1.19-8.el7_6.4-c3c624ea3d) - partition with quorum
     Last updated: Fri Jun 28 09:14:11 2019
     Last change: Thu Jun 27 16:23:44 2019 by root via cibadmin on kvm_aquila-01
     2 nodes configured
     7 resources configured
    
    PCSD Status:
      kvm_aquila-02: Online
      kvm_aquila-01: Online
    [root@kvm_aquila-02 ~]# pcs status
    Cluster name: kvm_aquila
    Stack: corosync
    Current DC: kvm_aquila-02 (version 1.1.19-8.el7_6.4-c3c624ea3d) - partition with quorum
    Last updated: Fri Jun 28 09:14:31 2019
    Last change: Thu Jun 27 16:23:44 2019 by root via cibadmin on kvm_aquila-01
    
    2 nodes configured
    7 resources configured
    
    Online: [ kvm_aquila-01 kvm_aquila-02 ]
    
    Full list of resources:
    
     kvm_aquila-01	(stonith:fence_ilo4):	Started kvm_aquila-01
     kvm_aquila-02	(stonith:fence_ilo4):	Started kvm_aquila-02
     Clone Set: dlm-clone [dlm]
         Started: [ kvm_aquila-01 kvm_aquila-02 ]
     Clone Set: clvmd-clone [clvmd]
         Started: [ kvm_aquila-01 kvm_aquila-02 ]
     Resource Group: test_VPS
         test	(ocf::heartbeat:VirtualDomain):	Started kvm_aquila-01
    
    Daemon Status:
      corosync: active/enabled
      pacemaker: active/enabled
      pcsd: active/enabled



                        
Asked by Marko Todoric (437 rep)
Jun 28, 2019, 07:14 AM
Last activity: Jun 28, 2019, 02:38 PM
PCS Stonith (fencing) will kill two node cluster if first is down

Related Questions