Unix & Linux Stack Exchange

Q&A for users of Linux, FreeBSD and other Unix-like operating systems

Latest Questions

0 votes

1 answers

2303 views

Secondary DRBD node does not auto-start in Pacemaker+Corosync setup

I am trying to set up a 2-PC cluster with shared resources: `ClusterIP`, `ClusterSamba`, `ClusterNFS`, `DRBD` (cloned resource), and a `DRBDFS`. The beginning of the project followed the [Clusters from Scratch](https://clusterlabs.org/pacemaker/doc/en-US/Pacemaker/2.0/html/Clusters_from_Scratch/inde...

I am trying to set up a 2-PC cluster with shared resources: ClusterIP, ClusterSamba, ClusterNFS, DRBD (cloned resource), and a DRBDFS. The beginning of the project followed the [Clusters from Scratch](https://clusterlabs.org/pacemaker/doc/en-US/Pacemaker/2.0/html/Clusters_from_Scratch/index.html) guide. When everything in this guide is done, it works without problems. So, I wanted to use parts of that guide and build my own setup: I created one shared IP (ClusterIP) that is automatically assigned to one node, and (here is where it gets tricky) on that node, I mount my /dev/drbd1 device to /exports and then share this mount through **SAMBA** and **NFS**. When I start the cluster, all resources come up as they should, _but DRBD does not go up on the secondary node_ (Primary/Unknown). If I bring it up manually, it syncs and works. Also, when I stop the cluster (or forcibly reboot the first node), all resources transfer to the other node and everything works, _except DRBD on the other node goes into an Unknown state_. ### So now, here is the problem: **Why does DRBD go down on the secondary node when I stop the cluster? Or why doesn't it start in the Secondary role on the secondary node?** Sorry if my description is bad. --- ## Here are the commands I used

# apt install -y pacemaker pcs psmisc policycoreutils-python-utils drbd-utils samba nfs-kernel-server 
# systemctl start pcsd.service
# systemctl enable pcsd.service
# passwd hacluster
# pcs host auth alice bob
# pcs cluster setup myCluster alice bob --force
# pcs cluster start --all
# pcs property set stonith-enabled=false
# pcs property set no-quorum-policy=ignore
# modprobe drbd
# echo drbd >/etc/modules-load.d/drbd.conf
# drbdadm create-md r0
# drbdadm up r0
# drbdadm primary r0 --force
# mkfs.ext4 /dev/drbd1
# systemctl disable smbd
# systemctl disable nfs-kernel-server.service 
# mkdir /exports
# vi /etc/samba/smb.conf 
# vi /etc/exports 
# pcs resource create ClusterIP ocf:heartbeat:IPaddr2 ip=10.1.1.30 cidr_netmask=24 op monitor interval=30s
# pcs resource defaults resource-stickiness=100
# pcs resource op defaults timeout=240s
# pcs resource create ClusterSamba lsb:smbd op monitor interval=60s
# pcs resource create ClusterNFS ocf:heartbeat:nfsserver op monitor interval=60s
# pcs resource create DRBD ocf:linbit:drbd drbd_resource=r0 op monitor interval=60s
# pcs resource promotable DRBD promoted-max=1 promoted-node-max=1 clone-max=2 clone-node-max=1 notify=true
# pcs resource create DRBDFS Filesystem device="/dev/drbd1" directory="/exports" fstype="ext4"
# pcs constraint order ClusterIP then ClusterNFS
# pcs constraint order ClusterNFS then ClusterSamba
# pcs constraint order promote DRBD-clone then start DRBDFS
# pcs constraint order DRBDFS then ClusterNFS
# pcs constraint order ClusterIP then DRBD-clone
# pcs constraint colocation ClusterSamba with ClusterIP
# pcs constraint colocation add ClusterSamba with ClusterIP
# pcs constraint colocation add ClusterNFS with ClusterIP
# pcs constraint colocation add DRBDFS with DRBD-clone INFINITY with-rsc-role=Master
# pcs constraint colocation add DRBD-clone with ClusterIP
# pcs cluster stop --all && sleep 2 && pcs cluster start --all

--- ## Configs and stats ### /etc/drbd.d/r0.res

resource r0 {
 device /dev/drbd1;
 disk /dev/sdb;
 meta-disk internal;
 net {
  allow-two-primaries;
 }
 on alice {
  address 10.1.1.31:7788;
 }
 on bob {
  address 10.1.1.32:7788;
 } 
}

--- ### /etc/corosync/corosync.conf

totem {
    version: 2
    cluster_name: myCluster
    transport: knet
    crypto_cipher: aes256
    crypto_hash: sha256
}

nodelist {
    node {
        ring0_addr: alice
        name: alice
        nodeid: 1
    }

    node {
        ring0_addr: bob
        name: bob
        nodeid: 2
    }
}

quorum {
    provider: corosync_votequorum
    two_node: 1
}

logging {
    to_logfile: yes
    logfile: /var/log/corosync/corosync.log
    to_syslog: yes
    timestamp: on
}

--- ### pcs status

Cluster name: myCluster
Stack: corosync
Current DC: alice (version 2.0.1-9e909a5bdd) - partition with quorum
Last updated: Fri May 15 12:28:30 2020
Last change: Fri May 15 11:04:50 2020 by root via cibadmin on bob

2 nodes configured
6 resources configured

Online: [ alice bob ]

Full list of resources:

 ClusterIP      (ocf::heartbeat:IPaddr2):       Started alice
 ClusterSamba   (lsb:smbd):     Started alice
 ClusterNFS     (ocf::heartbeat:nfsserver):     Started alice
 Clone Set: DRBD-clone [DRBD] (promotable)
 Masters: [ alice ]
 Stopped: [ bob ]
 DRBDFS (ocf::heartbeat:Filesystem):    Started alice

Daemon Status:
  corosync: active/disabled
  pacemaker: active/disabled
  pcsd: active/enabled

--- ### pcs constraint --full

Location Constraints:

Ordering Constraints:
  start ClusterIP then start ClusterNFS (kind:Mandatory) (id:order-ClusterIP-ClusterNFS-mandatory)
  start ClusterNFS then start ClusterSamba (kind:Mandatory) (id:order-ClusterNFS-ClusterSamba-mandatory)
  promote DRBD-clone then start DRBDFS (kind:Mandatory) (id:order-DRBD-clone-DRBDFS-mandatory)
  start DRBDFS then start ClusterNFS (kind:Mandatory) (id:order-DRBDFS-ClusterNFS-mandatory)
  start ClusterIP then start DRBD-clone (kind:Mandatory) (id:order-ClusterIP-DRBD-clone-mandatory)
  start ClusterIP then promote DRBD-clone (kind:Mandatory) (id:order-ClusterIP-DRBD-clone-mandatory-1)

Colocation Constraints:
  ClusterSamba with ClusterIP (score:INFINITY) (id:colocation-ClusterSamba-ClusterIP-INFINITY)
  ClusterNFS with ClusterIP (score:INFINITY) (id:colocation-ClusterNFS-ClusterIP-INFINITY)
  DRBDFS with DRBD-clone (score:INFINITY) (with-rsc-role:Master) (id:colocation-DRBDFS-DRBD-clone-INFINITY)
  DRBD-clone with ClusterIP (score:INFINITY) (id:colocation-DRBD-clone-ClusterIP-INFINITY)

Ticket Constraints:

--- ### /proc/drbd

version: 8.4.10 (api:1/proto:86-101)
srcversion: 983FCB77F30137D4E127B83 

 1: cs:WFConnection ro:Primary/Unknown ds:UpToDate/DUnknown C r-----
    ns:0 nr:4 dw:8 dr:17 al:1 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f oos:4

Miki (31 rep)

May 15, 2020, 11:12 AM • Last activity: Jun 19, 2025, 10:03 PM

2 votes

1 answers

2579 views

After failover Pacemaker moves resource back when node comes back

linux cluster pacemaker corosync

I'm using Pacemaker & Corosync for my cluster. When a node dies pacemaker moving my resources to another online node. Everything ok here. But when the dead node comes back, Pacemaker moving the resource back. I don't have any "location" line in my config and also I tried with "unmove" command but no...

                                  I'm using Pacemaker & Corosync for my cluster.
When a node dies pacemaker moving my resources to another online node. Everything ok here.
But when the dead node comes back, Pacemaker moving the resource back.
I don't have any "location" line in my config and also I tried with "unmove" command but nothing changed.

I failed at somewhere and need to find the reason. 

**crm configure sh**

    node 1: DEV1
    node 2: DEV2
    primitive poolip IPaddr2 \
    	params ip=10.1.60.33 nic=enp2s0f0 cidr_netmask=24 \
    	meta migration-threshold=2 target-role=Started \
    	op monitor interval=20 timeout=20 on-fail=restart
    primitive gui systemd:gui \
    	op monitor interval=20s \
    	meta target-role=Started
    primitive gui-ip IPaddr2 \
    	params ip=10.1.60.35 nic=enp2s0f0 cidr_netmask=24 \
    	meta migration-threshold=2 target-role=Started \
    	op monitor interval=20 timeout=20 on-fail=restart
    colocation cluster-gui inf: gui gui-ip
    order gui-after-ip Mandatory: gui-ip gui
    property cib-bootstrap-options: \
    	have-watchdog=false \
    	dc-version=2.0.0-1-8cf3fe749e \
    	cluster-infrastructure=corosync \
    	cluster-name=mycluster \
    	stonith-enabled=false \
    	no-quorum-policy=ignore \
    	last-lrm-refresh=1545920437
    rsc_defaults rsc-options: \
    	migration-threshold=10 \
    	resource-stickiness=100


**pcs resource defaults**

    migration-threshold=10
    resource-stickiness=100

**pcs resource show gui**

    Resource: gui (class=systemd type=gui)
     Meta Attrs: target-role=Started
     Operations: monitor interval=20s (gui-monitor-20s)


                                

Ozbit (439 rep)

Jan 2, 2019, 08:58 AM • Last activity: Jun 14, 2025, 09:07 PM

1 votes

2 answers

8573 views

pcs stonith not working

pacemaker corosync pcs

i have 2 virtual centos7 nodes , root can login passwordless among themself, i have configured stonith like this but the services are not coming up, fencing is not happening , im new to this, could someone help me rectify issue~ [root@node1 cluster]# pcs stonith create nub1 fence_virt pcmk_host_list...

                                  i have 2 virtual centos7 nodes , root can login passwordless among themself,

i have configured stonith like this but the services are not coming up, fencing is not happening , im new to this, could someone help me rectify issue~ 

    [root@node1 cluster]# pcs stonith create nub1 fence_virt pcmk_host_list="node1"
    [root@node1 cluster]# pcs stonith create nub2 fence_virt pcmk_host_list="node2"
    [root@node1 cluster]# pcs stonith show
     nub1   (stonith:fence_virt):   Stopped
     nub2   (stonith:fence_virt):   Stopped
    [root@node1 cluster]#
    [root@node1 cluster]#
    [root@node1 cluster]#
    [root@node1 cluster]#
    [root@node1 cluster]# pcs status
    Cluster name: mycluster
    Stack: corosync
    Current DC: node2 (version 1.1.15-11.el7_3.5-e174ec8) - partition with quorum
    Last updated: Tue Jul 25 07:03:37 2017          Last change: Tue Jul 25 07:02:00 2017 by root via cibadmin on node1
    
    2 nodes and 3 resources configured
    
    Online: [ node1 node2 ]
    
    Full list of resources:
    
     ClusterIP      (ocf::heartbeat:IPaddr2):       Started node1
     nub1   (stonith:fence_virt):   Stopped
     nub2   (stonith:fence_virt):   Stopped
    
    Failed Actions:
    * nub1_start_0 on node1 'unknown error' (1): call=56, status=Error, exitreason='none',
        last-rc-change='Tue Jul 25 07:01:34 2017', queued=0ms, exec=7006ms
    * nub2_start_0 on node1 'unknown error' (1): call=58, status=Error, exitreason='none',
        last-rc-change='Tue Jul 25 07:01:42 2017', queued=0ms, exec=7009ms
    * nub1_start_0 on node2 'unknown error' (1): call=54, status=Error, exitreason='none',
        last-rc-change='Tue Jul 25 07:01:26 2017', queued=0ms, exec=7010ms
    * nub2_start_0 on node2 'unknown error' (1): call=60, status=Error, exitreason='none',
        last-rc-change='Tue Jul 25 07:01:34 2017', queued=0ms, exec=7013ms
    
    
    Daemon Status:
      corosync: active/enabled
      pacemaker: active/enabled
      pcsd: active/enabled





    [root@node1 cluster]# pcs stonith fence node2
    Error: unable to fence 'node2'
    Command failed: No route to host
    
    [root@node1 cluster]# pcs stonith fence nub2
    Error: unable to fence 'nub2'
    Command failed: No such device
    
    [root@node1 cluster]# ping node2
    PING node2 (192.168.100.102) 56(84) bytes of data.
    64 bytes from node2 (192.168.100.102): icmp_seq=1 ttl=64 time=0.247 ms
    64 bytes from node2 (192.168.100.102): icmp_seq=2 ttl=64 time=0.304 ms
    ^C
    --- node2 ping statistics ---
    2 packets transmitted, 2 received, 0% packet loss, time 1001ms
    rtt min/avg/max/mdev = 0.247/0.275/0.304/0.032 ms


                                

Mohammed Ali (691 rep)

Jul 25, 2017, 11:10 AM • Last activity: Feb 10, 2024, 02:01 AM

0 votes

1 answers

11525 views

DRBD - 'node1' not defined in your config (for this host) - Error when setting Primary

hostname hosts drbd pacemaker corosync

I am getting the following error when trying to set the Primary node for DRBD. 'node1' not defined in your config (for this host). I know this is related to DNS/Hostname/Hosts and the config clusterdb.res. I know this because I originally got an error when trying to start clusterdb.res if node1 didn...

                                  I am getting the following error when trying to set the Primary node for DRBD.

    'node1' not defined in your config (for this host).

I know this is related to DNS/Hostname/Hosts and the config clusterdb.res. I know this because I originally got an error when trying to start clusterdb.res if node1 didn't resolve correctly. So what confuses me is that I can start the clusterdb.res if either use:



*I have used this command on the hosts*

    hostnamectl set-hostname $(uname -n | sed s/\\..*//)

To make the hostname resolve to node1 instead of node1.localdomain

Or add node1.localdomain to the config, either works. But I have tried all combinations and can't seem to get this command to take :

    drbdadm primary --force node1 && cat /proc/drbd

**My Configs**

/etc/drbd.d/clusterdb.res

    resource clusterdb{
     	protocol C;
    	meta-disk internal;
 	    device /dev/drbd0;

    startup {
    	wfc-timeout 30;
    	outdated-wfc-timeout 20;
    	degr-wfc-timeout 30;
    }

    net {
    	cram-hmac-alg sha1;
    	shared-secret sync_disk;
    }


    syncer {
    	rate 10M;
	    al-extents 257;
	    on-no-data-accessible io-error;
	    verify-alg sha1;
    }
    on node1 {
    	disk /dev/sda3;
    	address 192.168.1.216:7788;
    }
    on node2 {
    	disk /dev/sda3;
    	address 192.168.1.217:7788;
    }
    }

/etc/hosts :


    127.0.0.1   localhost localhost.localdomain localhost4 localhost4.localdomain4
    ::1         localhost localhost.localdomain localhost6 localhost6.localdomain6
    192.168.1.216 node1
    192.168.1.217 node2

/etc/hostname

    node#

My full write up ATM (wip) 

**Edits :**

     [root@node1 ~]# hostname
     node1
     [root@node1 ~]# cat /etc/hosts
     127.0.0.1   localhost localhost.localdomain localhost4      localhost4.localdomain4
     ::1         localhost localhost.localdomain localhost6      localhost6.localdomain6
     127.0.1.1     node1
     192.168.1.216 node1
     192.168.1.217 node2
     [root@node1 ~]#

Update: I have gotten this to work with LVM following this guide  exactly, so I think my issue actually lies with the following lines of code. But for now I think i will stick with LVM since it works, unless somebody else really wants to work on this. (My working LVM writeup)  

    device /dev/drbd0;
or

     device /dev/drbd0; 

The reason I say this, is I used the same hosts/hostname/shortname/ip_addr but LVM and it worked, but maybe I missed something the first time, I fixed in my new VM Template (I started from scratch to build LVM)


                                

FreeSoftwareServers (2682 rep)

May 1, 2016, 01:59 AM • Last activity: Mar 8, 2023, 02:50 AM

0 votes

1 answers

195 views

How to increase a number of nfsd threads in pcs/corosync environment?

linux nfs corosync pcs

I have NFS server bound with *pcs/corosync* to provide stability and HA. The default number of nfsd threads is 8, wich I find too little as I can observe that mostly all 8 are 90% busy all the time. The nfs/corosync/pcs system consists of 3 servers and a dist storage. How can I increase a number of...

                                  I have NFS server bound with *pcs/corosync* to provide stability and HA. The default number of nfsd threads is 8, wich I find too little as I can observe that mostly all 8 are 90% busy all the time.
The nfs/corosync/pcs system consists of 3 servers and a dist storage. How can I increase a number of threads: should I modify */etc/sysconfig/nfs* file on all NFS nodes or should I make some changes somewhere else? 
Sorry for newbie question but I have nobody to ask for it. Appreciate any help, thank you.

OS: Centos 7.

Mirimat (33 rep)

Sep 13, 2022, 09:16 AM • Last activity: Oct 4, 2022, 01:28 PM

0 votes

1 answers

200 views

Convert puppet manifest config to hiera

ruby puppet yaml pacemaker corosync

I installed corosync-pacemaker cluster via puppet. Now I would to like keep my data into hiera file. How should I convert cs_primitive section into yaml file? cs_primitive { 'nfsshare_fs': primitive_class => 'ocf', primitive_type => 'Filesystem', provided_by => 'heartbeat', parameters => { 'device'...

                                  I installed corosync-pacemaker cluster via puppet. Now I would to like keep my data into hiera file. How should I convert cs_primitive section into yaml file?

    cs_primitive { 'nfsshare_fs':
      primitive_class => 'ocf',
      primitive_type  => 'Filesystem',
      provided_by     => 'heartbeat',
      parameters      => { 'device' => '/dev/disk/lvname', 'directory' => '/share', 'fstype' => 'ext4' },
    }->



I tried the below code but it didn't work.

    corosync::cs_primitive:
      'nfsshare_fs':
        primitive_class: 'ocf'
        primitive_type: 'Filesystem'
        provided_by: 'heartbeat'
        parameters:
          device: '/dev/disk/by-id/lvname'
          directory: '/share'
          fstype: 'ext4'

Thanks.
                                

fortunate1357 (1 rep)

Apr 4, 2022, 06:21 PM • Last activity: Jul 14, 2022, 07:27 PM

0 votes

1 answers

26 views

VM managed by corosync not detecting new CPUs

virtual-machine corosync

I have an HA cluster managed by corosync, and I need to increase the CPU allocation to one of the VMs. I have done the following: * `pcs resource disable myVM` * Wait for VM to stop * Edit the xml file (confirmed the correct file by `pcs sources show --full`) - within the `cpu` section I changed the...

                                  I have an HA cluster managed by corosync, and I need to increase the CPU allocation to one of the VMs.

I have done the following:

* pcs resource disable myVM
* Wait for VM to stop
* Edit the xml file (confirmed the correct file by pcs sources show --full) - within the cpu section I changed the entry: `` to change the number of cores to 8.
* Make sure that xml file is synced across all physical hosts
* pcs resource enable myVM

But when the VM comes back up, /proc/cpuinfo shows that it still has only 4 cores (I don't have hot plug CPUs enabled / am not sure how to enable this). There are plenty of CPU cores available on the physical hosts.

Can anyone tell me what I'm doing wrong that's preventing the VM from starting up with 8 cores instead of 4? It must be something obvious but I can't see it!

Phil Evans (101 rep)

Jun 9, 2022, 10:57 AM • Last activity: Jun 17, 2022, 07:55 AM

0 votes

0 answers

459 views

HA-Cluster / corosync / pacemaker: Active-Active cluster with service ip / service ip is not switching

pacemaker high-availability corosync

How to configure crm to migrate the ServiceIP if one Service is failed? node 1: web01a \ attributes standby=off node 2: web01b \ attributes standby=off primitive Apache2 systemd:apache2 \ operations $id=Apache2-operations \ op start interval=0 timeout=100 \ op stop interval=0 timeout=100 \ op monito...

                                  How to configure crm to migrate the ServiceIP if one Service is failed?

    node 1: web01a \
		attributes standby=off
    node 2: web01b \
    	attributes standby=off
    primitive Apache2 systemd:apache2 \
    	operations $id=Apache2-operations \
    	op start interval=0 timeout=100 \
    	op stop interval=0 timeout=100 \
    	op monitor interval=15 timeout=100 start-delay=15 \
    	meta
    primitive PHP-FPM systemd:php7.4-fpm \
    	operations $id=PHP-FPM-operations \
    	op start interval=0 timeout=100 \
    	op stop interval=0 timeout=100 \
    	op monitor interval=15 timeout=100 start-delay=15 \
    	meta
    primitive Redis systemd:redis-server \
    	operations $id=Redis-operations \
    	op start interval=0 timeout=100 \
    	op stop interval=0 timeout=100 \
    	op monitor interval=15 timeout=100 start-delay=15 \
    	meta
    primitive ServiceIP IPaddr2 \
    	params ip=1.2.3.4 \
    	operations $id=ServiceIP-operations \
    	op monitor interval=10 timeout=20 start-delay=0 \
    	op_params migration-threshold=1 \
    	meta
    primitive lsyncd systemd:lsyncd \
    	op start interval=0 timeout=100 \
    	op stop interval=0 timeout=100 \
    	op monitor interval=15 timeout=100 start-delay=15 \
    	meta target-role=Started
    group ActiveNode ServiceIP lsyncd
    group WebServer Apache2 PHP-FPM Redis
    clone cl_WS WebServer \
    	meta clone-max=2 notify=true interleave=true
    colocation col_cl_WS_ActiveNode 100: cl_WS ActiveNode
    property cib-bootstrap-options: \
    	have-watchdog=false \
    	dc-version=2.0.3-4b1f869f0f \
    	cluster-infrastructure=corosync \
    	cluster-name=debian \
    	stonith-enabled=false \
    	no-quorum-policy=ignore \
    	startup-fencing=false \
    	maintenance-mode=false \
    	last-lrm-refresh=1622628525 \
    	start-failure-is-fatal=true

These services should always be started
- Apache2
- PHP-FPM
- Redis

If one of these services is not running, the node is unhelthy.
The **ServiceIP** and **lsyncd** should switch to an healthy node.

When I killed the apache2 process, the IP is not switched.


                                

FaxMax (726 rep)

Jun 2, 2021, 12:29 PM

2 votes

1 answers

5355 views

Pacemaker - Corosync - HA - Simple Custom Resource Testing - Status flapping - Started - Failed - Stopped - Started

scripting pacemaker high-availability corosync

I am testing using the OCF:Heartbeat:Dummy script and I want to make a very basic setup just to know it works and build on that. The only information I can find was this web blog here. https://raymii.org/s/tutorials/Corosync_Pacemaker_-_Execute_a_script_on_failover.html It has some typos but basical...

                                  I am testing using the OCF:Heartbeat:Dummy script and I want to make a very basic setup just to know it works and build on that.

The only information I can find was this web blog here. 
https://raymii.org/s/tutorials/Corosync_Pacemaker_-_Execute_a_script_on_failover.html 

It has some typos but basically worked for me.

The script currently just contains the following :

    sudo nano /usr/local/bin/failover.sh && sudo chmod +x /usr/local/bin/failover.sh
    
    #!/bin/sh
    
    touch /tmp/testfailover.sh

Here is my setup :

    cp /usr/lib/ocf/resource.d/heartbeat/Dummy /usr/lib/ocf/resource.d/heartbeat/FailOverScript
    
    sudo nano /usr/lib/ocf/resource.d/heartbeat/FailOverScript
    
    dummy_start() {
        dummy_monitor
        /usr/local/bin/failover.sh
        if [ $? =  $OCF_SUCCESS ]; then
        return $OCF_SUCCESS
        fi
        touch ${OCF_RESKEY_state}
    }
    
    sed -i 's/Dummy/FailOverScript/g' /usr/lib/ocf/resource.d/heartbeat/FailOverScript
    
    sed -i 's/dummy/FailOverScript/g' /usr/lib/ocf/resource.d/heartbeat/FailOverScript
    
    pcs resource create FailOverScript ocf:heartbeat:FailOverScript op monitor interval="30"

The only testing I can really do :

    [root@node2 ~]# /usr/lib/ocf/resource.d/heartbeat/FailOverScript start ; echo $?
    DEBUG: default start : 0
    0

ocf-tester doesn't seem to exist in the latest HA Software Suite, not really sure how to manually install it, but the script "half works".

**The script doesn't need monitoring, its supposed to be very basic, but it seems to be flapping and giving me the following error code. Any idea's what to do?**

    FailOverScript (ocf::heartbeat:FailOverScript):        Started
    node2
    
    Failed Actions:
    * FailOverScript_monitor_30000 on node2 'not running' (7): call=
    24423, status=complete, exitreason='none',
        last-rc-change='Tue Aug 16 15:53:50 2016', queued=0ms, exec=
    9ms

**Example of what I want to do:**

Cluster start

Script runs "start.sh"

Cluster fails over to node2.

On node1 script runs "fail.sh"

On node2 script runs "start.sh"

and vis versa if it fails the other direction.

Note: The script does work, I get /tmp/testfailover.sh. I even tried putting another script under dummy_stop to remove the file and that worked, but it just keeps flapping along removing/adding/removing/adding file and starting/failing/stoping/starting etc etc.

Thanks for reading!

FreeSoftwareServers (2682 rep)

Aug 16, 2016, 07:56 PM • Last activity: Dec 21, 2020, 06:56 AM

1 votes

0 answers

603 views

Cannot seem to start pcs cluster (NFS Cluster) disk_fencing trouble

nfs cluster corosync pcs

For the life of me, I can't find a clear answer on how to start my NFS active / passive cluster. I have two nodes, node1 and node2 and followed the guide here: https://www.linuxtechi.com/configure-nfs-server-clustering-pacemaker-centos-7-rhel-7/ Here are my logs: May 25 10:35:59 node1 stonith-ng[392...

                                  For the life of me, I can't find a clear answer on how to start my NFS active / passive cluster. I have two nodes, node1 and node2 and followed the guide here: https://www.linuxtechi.com/configure-nfs-server-clustering-pacemaker-centos-7-rhel-7/ 

Here are my logs:

    May 25 10:35:59 node1 stonith-ng:  notice: Couldn't find anyone to fence (on) node1 with any device
    May 25 10:35:59 node1 stonith-ng:   error: Operation on of node1 by  for crmd.3928@node1.97f683f8: No route to host
    May 25 10:35:59 node1 crmd:  notice: Stonith operation 142/2:72:0:f3e078bf-24f5-4160-95c1-0eeeea0e5e12: No route to host (-113)
    May 25 10:35:59 node1 crmd:  notice: Stonith operation 142 for node1 failed (No route to host): aborting transition.
    May 25 10:35:59 node1 crmd: warning: Too many failures (71) to fence node1, giving up
    May 25 10:35:59 node1 crmd:  notice: Transition aborted: Stonith failed
    May 25 10:35:59 node1 crmd:   error: Unfencing of node1 by  failed: No route to host (-113)
    May 25 10:35:59 node1 stonith-ng:  notice: Couldn't find anyone to fence (on) node2 with any device
    May 25 10:35:59 node1 stonith-ng:   error: Operation on of node2 by  for crmd.3928@node1.2680795a: No route to host
    May 25 10:35:59 node1 crmd:  notice: Stonith operation 143/1:72:0:f3e078bf-24f5-4160-95c1-0eeeea0e5e12: No route to host (-113)
    May 25 10:35:59 node1 crmd:  notice: Stonith operation 143 for node2 failed (No route to host): aborting transition.
    May 25 10:35:59 node1 crmd: warning: Too many failures (71) to fence node2, giving up
    May 25 10:35:59 node1 crmd:   error: Unfencing of node2 by  failed: No route to host (-113)

Here is the status:

    [root@node1 ~]# pcs status
    Cluster name: nfs_cluster
    Stack: corosync
    Current DC: node1 (version 1.1.20-5.amzn2.0.2-3c4c782f70) - partition with quorum
    Last updated: Mon May 25 10:45:56 2020
    Last change: Sun May 24 21:04:55 2020 by root via cibadmin on node1
    
    2 nodes configured
    5 resources configured
    
    Online: [ node1 node2 ]
    
    Full list of resources:
    
     disk_fencing   (stonith:fence_scsi):   Stopped
     Resource Group: nfsgrp
         nfsshare   (ocf::heartbeat:Filesystem):    Stopped
         nfsd       (ocf::heartbeat:nfsserver):     Stopped
         nfsroot    (ocf::heartbeat:exportfs):      Stopped
         nfsip      (ocf::heartbeat:IPaddr2):       Stopped
    
    Failed Fencing Actions:
    * unfencing of node2 failed: delegate=, client=crmd.3928, origin=node1,
        last-failed='Mon May 25 10:35:59 2020'
    * unfencing of node1 failed: delegate=, client=crmd.3928, origin=node1,
        last-failed='Mon May 25 10:35:59 2020'
    
    Daemon Status:
      corosync: active/enabled
      pacemaker: active/enabled
      pcsd: active/enabled
    [root@node1 ~]# 


The disk_fencing is set to scsi, but not sure if that is the best options for two AWS ec2 instances. Perhaps I can't get disk_fencing to work so it can't start? I can ping node1 from node 2 and vice versa. Open to ideas...
                                

jasontt33 (11 rep)

May 25, 2020, 10:49 AM

0 votes

0 answers

685 views

Unmount data volume hosting NFS exports

nfs unmounting pacemaker corosync

So I've got a customer with two NFS server setup in a pacemaker cluster active/standby configuration. It is a legacy system running RHEL6. On the servers there is /mnt/data1 which is an xfs mountpoint on a drbd mirrored disk. The mount is active on one node at a time and controlled by pacemaker (so...

                                  So I've got a customer with two NFS server setup in a pacemaker cluster active/standby configuration. It is a legacy system running RHEL6. On the servers there is /mnt/data1 which is an xfs mountpoint on a drbd mirrored disk. The mount is active on one node at a time and controlled by pacemaker (so is drbd for that matter).

My problem is in critical cases where I need to move active services to the other server without shutting down the nfs clients first. I can shutdown the NFS services, but no matter what I try, I can't unmount the /mnt/data1 filesystem as it reports as 'busy'.

I tried changing the daemon stop sequence on the nodes. Right now I have the following sequence:

 - rpc.mountd
 - nfsd
 - exportfs -au
 - rpc.statd

Both 'lsof /mnt/data1' and 'fuser -mv /mnt/data1' do not report any open files on the mountpoint and I can verify no terminal sessions are there either.  Short of having to shutdown the box (which kills any debugging I would like to do), I can't get the filesystem unmounted to allow pacemaker to cleanly move the filesystem mount to the other node.  I assume that there are some hanging file locks, but I'm not sure how else to kill them.

Any ideas are appreciated.
                                

pbrunnen (113 rep)

Nov 10, 2019, 05:36 PM • Last activity: Nov 10, 2019, 05:41 PM

1 votes

1 answers

2580 views

Pacemaker: Primary node is rebooted and comes back is primary instead of standby

linux reboot pacemaker corosync high-availability

We are using pacemaker, corosync to automate failovers. We noticed one behaviour- when primary node is rebooted, the standby node takes over as primary - which is fine. When the node comes back online and services are started on it, it takes back the role of Primary. It should ideally start as standby. Are we missing any configuration? > pcs resource defaults O/p: resource-stickiness: INFINITY migration-threshold: 0 Stickiness is set to INFINITY. Please suggest. Adding Config details: ======================

[root@Node1 heartbeat]# pcs config show –l
Cluster Name: cluster1
Corosync Nodes:
 Node1 Node2
Pacemaker Nodes:
 Node1 Node2

Resources:
 Master: msPostgresql
  Meta Attrs: master-node-max=1 clone-max=2 notify=true master-max=1 clone-node-max=1
  Resource: pgsql (class=ocf provider=heartbeat type=pgsql)
   Attributes: master_ip=10.70.10.1 node_list="Node1 Node2" pgctl=/usr/pgsql-9.6/bin/pg_ctl pgdata=/var/lib/pgsql/9.6/data/ primary_conninfo_opt="keepalives_idle=60 keepalives_interval=5 keepalives_count=5" psql=/usr/pgsql-9.6/bin/psql rep_mode=async restart_on_promote=true restore_command="cp /var/lib/pgsql/9.6/data/archivedir/%f %p"
   Meta Attrs: failure-timeout=60
   Operations: demote interval=0s on-fail=stop timeout=60s (pgsql-demote-interval-0s)
               methods interval=0s timeout=5s (pgsql-methods-interval-0s)
               monitor interval=4s on-fail=restart timeout=60s (pgsql-monitor-interval-4s)
               monitor interval=3s on-fail=restart role=Master timeout=60s (pgsql-monitor-interval-3s)
               notify interval=0s timeout=60s (pgsql-notify-interval-0s)
               promote interval=0s on-fail=restart timeout=60s (pgsql-promote-interval-0s)
               start interval=0s on-fail=restart timeout=60s (pgsql-start-interval-0s)
               stop interval=0s on-fail=block timeout=60s (pgsql-stop-interval-0s)
 Group: master-group
  Resource: vip-master (class=ocf provider=heartbeat type=IPaddr2)
   Attributes: cidr_netmask=24 ip=10.70.10.2
   Operations: monitor interval=10s on-fail=restart timeout=60s (vip-master-monitor-interval-10s)
               start interval=0s on-fail=restart timeout=60s (vip-master-start-interval-0s)
               stop interval=0s on-fail=block timeout=60s (vip-master-stop-interval-0s)
  Resource: vip-rep (class=ocf provider=heartbeat type=IPaddr2)
   Attributes: cidr_netmask=24 ip=10.70.10.1
   Meta Attrs: migration-threshold=0
   Operations: monitor interval=10s on-fail=restart timeout=60s (vip-rep-monitor-interval-10s)
               start interval=0s on-fail=stop timeout=60s (vip-rep-start-interval-0s)
               stop interval=0s on-fail=ignore timeout=60s (vip-rep-stop-interval-0s)

Stonith Devices:
Fencing Levels:

Location Constraints:
Ordering Constraints:
  promote msPostgresql then start master-group (score:INFINITY) (non-symmetrical)
  demote msPostgresql then stop master-group (score:0) (non-symmetrical)
Colocation Constraints:
  master-group with msPostgresql (score:INFINITY) (rsc-role:Started) (with-rsc-role:Master)
Ticket Constraints:

Alerts:
 No alerts defined

Resources Defaults:
 resource-stickiness: INFINITY
 migration-threshold: 0
Operations Defaults:
 No defaults set

Cluster Properties:
 cluster-infrastructure: corosync
 cluster-name: cluster1
 cluster-recheck-interval: 60
 dc-version: 1.1.19-8.el7-c3c624ea3d
 have-watchdog: false
 no-quorum-policy: ignore
 start-failure-is-fatal: false
 stonith-enabled: false
Node Attributes:
 Node1: pgsql-data-status=STREAMING|ASYNC
 Node2: pgsql-data-status=LATEST

Quorum:
  Options:

Thanks !

User2019 (11 rep)

Sep 12, 2019, 09:30 AM • Last activity: Sep 16, 2019, 06:18 PM

0 votes

1 answers

887 views

corosync 2Node vs two_node flags

corosync

is the **`2node`** equivalent to the **`two_node`** flag ? Is it the same ? [root@srv1 ~]# corosync-quorumtool -s Quorum information ------------------ Date: Wed Mar 20 04:49:10 2019 Quorum provider: corosync_votequorum Nodes: 2 Node ID: 1 Ring ID: 1/464 Quorate: Yes Votequorum information ---------...

                                  is the **2node** equivalent to the **two_node** flag ? Is it the same ? 

    [root@srv1 ~]# corosync-quorumtool -s
    Quorum information
    ------------------
    Date:             Wed Mar 20 04:49:10 2019
    Quorum provider:  corosync_votequorum
    Nodes:            2
    Node ID:          1
    Ring ID:          1/464
    Quorate:          Yes
    
    Votequorum information
    ----------------------
    Expected votes:   2
    Highest expected: 2
    Total votes:      2
    Quorum:           1
    Flags:            2Node Quorate WaitForAll
    
    Membership information
    ----------------------
        Nodeid      Votes Name
             1          1 srv1cr1 (local)
             2          1 srv2cr1

http://people.redhat.com/ccaulfie/docs/Votequorum_Intro.pdf 

I was not able to find an answer to my question and the documentation is referencing **two_node** only. 
                                

blabla_trace (385 rep)

Mar 20, 2019, 08:57 AM • Last activity: Aug 9, 2019, 02:42 PM

2 votes

1 answers

3241 views

PCS Stonith (fencing) will kill two node cluster if first is down

cluster pacemaker corosync pcs

I have configured a two node physical server cluster (HP ProLiant DL560 Gen8) using pcs (corosync/pacemaker/pcsd). I have also configured fencing on them using fence_ilo4. The weird thing will happen if one node goes down (under DOWN i mean power OFF), the second node will die as well. Fencing will...

                                  I have configured a two node physical server cluster (HP ProLiant DL560 Gen8) using pcs (corosync/pacemaker/pcsd). I have also configured fencing on them using fence_ilo4.

The weird thing will happen if one node goes down (under DOWN i mean power OFF), the second node will die as well. Fencing will kill itself causing both servers to be offline.

How do i correct this behavior?

The thing i tried is to add "wait_for_all: 0" and "expected_votes: 1" in /etc/corosync/corosync.conf under quorum section. But it will still kill it.

At some point, some maintenance is to be performed on one of those servers, and it will have to be shutdown. I don't want for the other node to go down if this happens.

Here are some outputs

    [root@kvm_aquila-02 ~]# pcs quorum status
    Quorum information
    ------------------
    Date:             Fri Jun 28 09:07:18 2019
    Quorum provider:  corosync_votequorum
    Nodes:            2
    Node ID:          2
    Ring ID:          1/284
    Quorate:          Yes
    
    Votequorum information
    ----------------------
    Expected votes:   2
    Highest expected: 2
    Total votes:      2
    Quorum:           1  
    Flags:            2Node Quorate 
    
    Membership information
    ----------------------
        Nodeid      Votes    Qdevice Name
             1          1         NR kvm_aquila-01
             2          1         NR kvm_aquila-02 (local)


    [root@kvm_aquila-02 ~]# pcs config show
    Cluster Name: kvm_aquila
    Corosync Nodes:
     kvm_aquila-01 kvm_aquila-02
    Pacemaker Nodes:
     kvm_aquila-01 kvm_aquila-02
    
    Resources:
     Clone: dlm-clone
      Meta Attrs: interleave=true ordered=true 
      Resource: dlm (class=ocf provider=pacemaker type=controld)
       Operations: monitor interval=30s on-fail=fence (dlm-monitor-interval-30s)
                   start interval=0s timeout=90 (dlm-start-interval-0s)
                   stop interval=0s timeout=100 (dlm-stop-interval-0s)
     Clone: clvmd-clone
      Meta Attrs: interleave=true ordered=true 
      Resource: clvmd (class=ocf provider=heartbeat type=clvm)
       Operations: monitor interval=30s on-fail=fence (clvmd-monitor-interval-30s)
                   start interval=0s timeout=90s (clvmd-start-interval-0s)
                   stop interval=0s timeout=90s (clvmd-stop-interval-0s)
     Group: test_VPS
      Resource: test (class=ocf provider=heartbeat type=VirtualDomain)
       Attributes: config=/shared/xml/test.xml hypervisor=qemu:///system migration_transport=ssh
       Meta Attrs: allow-migrate=true is-managed=true priority=100 target-role=Started 
       Utilization: cpu=4 hv_memory=4096
       Operations: migrate_from interval=0 timeout=120s (test-migrate_from-interval-0)
                   migrate_to interval=0 timeout=120 (test-migrate_to-interval-0)
                   monitor interval=10 timeout=30 (test-monitor-interval-10)
                   start interval=0s timeout=300s (test-start-interval-0s)
                   stop interval=0s timeout=300s (test-stop-interval-0s)
    
    Stonith Devices:
     Resource: kvm_aquila-01 (class=stonith type=fence_ilo4)
      Attributes: ipaddr=10.0.4.39 login=fencing passwd=0ToleranciJa pcmk_host_list="kvm_aquila-01 kvm_aquila-02"
      Operations: monitor interval=60s (kvm_aquila-01-monitor-interval-60s)
     Resource: kvm_aquila-02 (class=stonith type=fence_ilo4)
      Attributes: ipaddr=10.0.4.49 login=fencing passwd=0ToleranciJa pcmk_host_list="kvm_aquila-01 kvm_aquila-02"
      Operations: monitor interval=60s (kvm_aquila-02-monitor-interval-60s)
    Fencing Levels:
    
    Location Constraints:
    Ordering Constraints:
      start dlm-clone then start clvmd-clone (kind:Mandatory)
    Colocation Constraints:
      clvmd-clone with dlm-clone (score:INFINITY)
    Ticket Constraints:
    
    Alerts:
     No alerts defined
    
    Resources Defaults:
     No defaults set
    Operations Defaults:
     No defaults set
    
    Cluster Properties:
     cluster-infrastructure: corosync
     cluster-name: kvm_aquila
     dc-version: 1.1.19-8.el7_6.4-c3c624ea3d
     have-watchdog: false
     last-lrm-refresh: 1561619537
     no-quorum-policy: ignore
     stonith-enabled: true
    
    Quorum:
      Options:
        wait_for_all: 0

    [root@kvm_aquila-02 ~]# pcs cluster status
    Cluster Status:
     Stack: corosync
     Current DC: kvm_aquila-02 (version 1.1.19-8.el7_6.4-c3c624ea3d) - partition with quorum
     Last updated: Fri Jun 28 09:14:11 2019
     Last change: Thu Jun 27 16:23:44 2019 by root via cibadmin on kvm_aquila-01
     2 nodes configured
     7 resources configured
    
    PCSD Status:
      kvm_aquila-02: Online
      kvm_aquila-01: Online
    [root@kvm_aquila-02 ~]# pcs status
    Cluster name: kvm_aquila
    Stack: corosync
    Current DC: kvm_aquila-02 (version 1.1.19-8.el7_6.4-c3c624ea3d) - partition with quorum
    Last updated: Fri Jun 28 09:14:31 2019
    Last change: Thu Jun 27 16:23:44 2019 by root via cibadmin on kvm_aquila-01
    
    2 nodes configured
    7 resources configured
    
    Online: [ kvm_aquila-01 kvm_aquila-02 ]
    
    Full list of resources:
    
     kvm_aquila-01	(stonith:fence_ilo4):	Started kvm_aquila-01
     kvm_aquila-02	(stonith:fence_ilo4):	Started kvm_aquila-02
     Clone Set: dlm-clone [dlm]
         Started: [ kvm_aquila-01 kvm_aquila-02 ]
     Clone Set: clvmd-clone [clvmd]
         Started: [ kvm_aquila-01 kvm_aquila-02 ]
     Resource Group: test_VPS
         test	(ocf::heartbeat:VirtualDomain):	Started kvm_aquila-01
    
    Daemon Status:
      corosync: active/enabled
      pacemaker: active/enabled
      pcsd: active/enabled



                                

Marko Todoric (437 rep)

Jun 28, 2019, 07:14 AM • Last activity: Jun 28, 2019, 02:38 PM

1 votes

2 answers

9752 views

Corosync error "No interfaces defined" in a cluster member

linux pacemaker corosync

I am having an error starting corosync on a cluster member: May 16 00:53:32 neftis corosync[19741]: [MAIN ] Corosync Cluster Engine ('2.3.4'): started and ready to provide service. May 16 00:53:32 neftis corosync[19741]: [MAIN ] Corosync built-in features: dbus systemd xmlconf snmp pie relro bindnow...

                                  I am having an error starting corosync on a cluster member:

    May 16 00:53:32 neftis corosync:  [MAIN  ] Corosync Cluster Engine ('2.3.4'): started and ready to provide service.
    May 16 00:53:32 neftis corosync:  [MAIN  ] Corosync built-in features: dbus systemd xmlconf snmp pie relro bindnow
    May 16 00:53:32 neftis corosync:  [MAIN  ] parse error in config: No interfaces defined
    May 16 00:53:32 neftis corosync:  [MAIN  ] Corosync Cluster Engine exiting with status 8 at main.c:1278.
    May 16 00:53:32 neftis corosync: Starting Corosync Cluster Engine (corosync): [FALL�]
    May 16 00:53:32 neftis systemd: corosync.service: control process exited, code=exited status=1
    May 16 00:53:32 neftis systemd: Failed to start Corosync Cluster Engine.
    May 16 00:53:32 neftis systemd: Unit corosync.service entered failed state.
    May 16 00:53:32 neftis systemd: corosync.service failed.
    May 16 00:54:06 neftis systemd: Cannot add dependency job for unit firewalld.service, ignoring: Unit firewalld.service is masked.
    May 16 00:54:06 neftis systemd: Starting Corosync Cluster Engine...
    May 16 00:54:06 neftis corosync:  [MAIN  ] Corosync Cluster Engine ('2.3.4'): started and ready to provide service.
    May 16 00:54:06 neftis corosync:  [MAIN  ] Corosync built-in features: dbus systemd xmlconf snmp pie relro bindnow
    May 16 00:54:06 neftis corosync:  [MAIN  ] parse error in config: No interfaces defined
    May 16 00:54:06 neftis corosync:  [MAIN  ] Corosync Cluster Engine exiting with status 8 at main.c:1278.
    May 16 00:54:06 neftis corosync: Starting Corosync Cluster Engine (corosync): [FALL�]
    May 16 00:54:06 neftis systemd: corosync.service: control process exited, code=exited status=1
    May 16 00:54:06 neftis systemd: Failed to start Corosync Cluster Engine.
    May 16 00:54:06 neftis systemd: Unit corosync.service entered failed state.

Here is my config on the three nodes but it is failing just in netfis which I added recently.

    totem {
        version: 2
        secauth: off
        cluster_name: cluster-osiris
        transport: udpu
    }
    
    nodelist {
        node {
            ring0_addr: isis.localdoamin
            nodeid: 1
        }
    
        node {
            ring0_addr: horus.localdoamin
            nodeid: 2
        }
    
        node {
            ring0_addr: netfis.localdoamin
            nodeid: 3
        }
    }
    
    quorum {
        provider: corosync_votequorum
    }
    
    logging {
        to_syslog: yes
    }

I am running a pacemaker, corosync, pcs cluster on CentOS 7.1 64. bits.

I searched on internet but It is not clear what is going on.

Could you help me?
                                

mijhael3000 (85 rep)

May 16, 2016, 04:28 AM • Last activity: Jun 18, 2019, 08:42 AM

1 votes

0 answers

788 views

Adding (cifs, samba)Filesystem resource to PCS

mount samba pacemaker corosync pcs

I am trying to create _PCS_ resource _Filesystem_ on Samba share(cifs) file system type. Here is a command that I have used to create pcs resource. ``` root@shaunak-VirtualBox:~# pcs resource show SMBDiskResourceName Resource: SMBDiskResourceName (class=ocf provider=heartbeat type=Filesystem) Attrib...

I am trying to create _PCS_ resource _Filesystem_ on Samba share(cifs) file system type. Here is a command that I have used to create pcs resource.

root@shaunak-VirtualBox:~# pcs resource show SMBDiskResourceName
 Resource: SMBDiskResourceName (class=ocf provider=heartbeat type=Filesystem)
  Attributes: device=//192.168.1.6/my_data_share directory=/var/opt/my/data fstype=cifs options="vers=3.0,username=myuser,password=myuser,uid= 998,gid=998,file_mode=0777,dir_mode=0777"

Here is a error I am getting while starting resource,

root@shaunak-VirtualBox:~# pcs resource debug-start SMBDiskResourceName
Error performing operation: Operation not permitted
Operation start for SMBDiskResourceName (ocf:heartbeat:Filesystem) returned 1
 >  stderr: INFO: Running start for //192.168.1.6/mssql_data_share on /var/opt/mssql/data
 >  stderr: 
 >  stderr: Usage:
 >  stderr:  mount [-lhV]
 >  stderr:  mount -a [options]
 >  stderr:  mount [options] [--source]  | [--target] 
 >  stderr:  mount [options]  
 >  stderr:  mount   []
 >  stderr: 
 >  stderr: Mount a filesystem.
 >  stderr: 
.....
 >  stderr: 
 >  stderr: For more details see mount(8).
 >  stderr: ocf-exit-reason:Couldn't mount filesystem //192.168.1.6/my_data_share on /var/opt/my/data

Above error says, mount is not happening on

/192.168.1.6/my_data_share

but where I run mount command, I am able to do mounting. Note sure what I am missing here. Here is a my mount command which I executed successfully.

mount -t  cifs //192.168.1.6/my_data_share /var/opt/my/data -o vers=3.0,username=myuser,password=myuser,uid=998,gid=998,file_mode=0777,dir_mode=0777

Shaunak Patel (111 rep)

Mar 26, 2019, 05:24 PM • Last activity: Mar 26, 2019, 06:34 PM

3 votes

1 answers

15321 views

quorum in a two-node cluster with pacemaker

pacemaker corosync

I have two node active-passive cluster. [Clusters_from_Scratch][1] > If a cluster splits into two (or more) groups of nodes that can no > longer communicate with each other (aka. partitions), quorum is used > to prevent resources from starting on more nodes than desired, which > would risk data corr...

                                  I have two node active-passive cluster.

Clusters_from_Scratch  

> If a cluster splits into two (or more) groups of nodes that can no
> longer communicate with each other (aka. partitions), quorum is used
> to prevent resources from starting on more nodes than desired, which
> would risk data corruption. A cluster has quorum when more than half
> of all known nodes are online in the same partition
> 
> By the above definition, a two-node cluster would only have quorum
> when both nodes are running. This would make the creation of a
> two-node cluster pointless, but corosync has the ability to treat
> two-node clusters as if only one node is required for quorum. The pcs
> cluster setup command will automatically configure two_node: 1 in
> corosync.conf, so a two-node cluster will "just work".

Here's my config:



So how can the cluster now decide which one has quorum?

                                

blabla_trace (385 rep)

Feb 22, 2019, 09:09 PM • Last activity: Feb 22, 2019, 10:53 PM

1 votes

1 answers

306 views

When I start corosync all servers panics with core dumps

linux pacemaker corosync

I upgraded my servers. Then I started corosync service one by one on my servers. I started first on 3 server and I wait 5 min. Then I started next 4 corosync on other servers and 7 server crashed in same time. I'm using corosync since 5 years. I was using; Kernel: 4.14.32-1-lts Corosync 2.4.2-1 Pace...

                                  I upgraded my servers. Then I started corosync service one by one on my servers. I started first on 3 server and I wait 5 min. Then I started next 4 corosync on other servers and 7 server crashed in same time.
I'm using corosync since 5 years. I was using; 

    Kernel: 4.14.32-1-lts
    Corosync 2.4.2-1 
    Pacemaker 1.1.18-1

 and I never saw this before.
I guess something is broken in new corosync version really really bad! 

    Kernel: 4.14.70-1-lts
    Corosync 2.4.4-3 
    Pacemaker 2.0.0-1

-

**This is my corosync.conf: https://paste.ubuntu.com/p/7KCq8pHKn3/** 
**Can you tell me how can I find the reason of the problem?**

    Sep 25 08:56:03 SRV-2 corosync:   [TOTEM ] A new membership (10.10.112.10:56) was formed. Members joined: 7
    Sep 25 08:56:03 SRV-2 corosync:   [VOTEQ ] Waiting for all cluster members. Current votes: 7 expected_votes: 28
    Sep 25 08:56:03 SRV-2 corosync:   [VOTEQ ] Waiting for all cluster members. Current votes: 7 expected_votes: 28
    Sep 25 08:56:03 SRV-2 corosync:   [VOTEQ ] Waiting for all cluster members. Current votes: 7 expected_votes: 28
    Sep 25 08:56:03 SRV-2 corosync:   [VOTEQ ] Waiting for all cluster members. Current votes: 7 expected_votes: 28
    Sep 25 08:56:03 SRV-2 corosync:   [QUORUM] Members: 1 2 3 4 5 6 7
    Sep 25 08:56:03 SRV-2 corosync:   [MAIN  ] Completed service synchronization, ready to provide service.
    Sep 25 08:56:03 SRV-2 corosync:   [VOTEQ ] Waiting for all cluster members. Current votes: 7 expected_votes: 28
    Sep 25 08:56:03 SRV-2 systemd: Created slice system-systemd\x2dcoredump.slice.
    Sep 25 08:56:03 SRV-2 systemd: Started Process Core Dump (PID 43798/UID 0).
    Sep 25 08:56:03 SRV-2 systemd: corosync.service: Main process exited, code=dumped, status=11/SEGV
    Sep 25 08:56:03 SRV-2 systemd: corosync.service: Failed with result 'core-dump'.
    Sep 25 08:56:03 SRV-2 kernel: watchdog: watchdog0: watchdog did not stop!
    Sep 25 08:56:03 SRV-2 systemd-coredump: Process 29089 (corosync) of user 0 dumped core.
    
                                                          Stack trace of thread 29089:
                                                          #0  0x0000000000000000 n/a (n/a)
    Write failed: Broken pipe


    coredumpctl info
               PID: 23658 (corosync)
               UID: 0 (root)
               GID: 0 (root)
            Signal: 11 (SEGV)
         Timestamp: Mon 2018-09-24 09:50:58 +03 (1 day 3h ago)
      Command Line: corosync
        Executable: /usr/bin/corosync
     Control Group: /system.slice/corosync.service
              Unit: corosync.service
             Slice: system.slice
           Boot ID: 79d67a83f83c4804be6ded8e6bd5f54d
        Machine ID: 9b1ca27d3f4746c6bcfcdb93b83f3d45
          Hostname: SRV-1
           Storage: /var/lib/systemd/coredump/core.corosync.0.79d67a83f83c4804be6ded8e6bd5f54d.23658.153777185>
           Message: Process 23658 (corosync) of user 0 dumped core.
    
                    Stack trace of thread 23658:
                    #0  0x0000000000000000 n/a (n/a)
    
               PID: 5164 (corosync)
               UID: 0 (root)
               GID: 0 (root)
            Signal: 11 (SEGV)
         Timestamp: Tue 2018-09-25 08:56:03 +03 (4h 9min ago)
      Command Line: corosync
        Executable: /usr/bin/corosync
     Control Group: /system.slice/corosync.service
              Unit: corosync.service
             Slice: system.slice
           Boot ID: 2f49ec6cdcc144f0a8eb712bbfbd7203
        Machine ID: 9b1ca27d3f4746c6bcfcdb93b83f3d45
          Hostname: SRV-1
           Storage: /var/lib/systemd/coredump/core.corosync.0.2f49ec6cdcc144f0a8eb712bbfbd7203.5164.1537854963>
           Message: Process 5164 (corosync) of user 0 dumped core.
    
                    Stack trace of thread 5164:
                    #0  0x0000000000000000 n/a (n/a)

I cant find more log so I can't dig the problem.


                                

Ozbit (439 rep)

Sep 25, 2018, 10:03 AM • Last activity: Oct 11, 2018, 12:29 PM

1 votes

1 answers

848 views

How to install debug symbols for corosync package on CentOS?

centos package-management debugging core-dump corosync

I got a crash in `corosync` which I would like to view in gdb. However, currently the core dump shows me only this much info Debug logs for core.1385 (Generated on Jul 26 10:17 BST) [Thread debugging using libthread_db enabled] Core was generated by `corosync -f'. Program terminated with signal 6, A...

                                  I got a crash in corosync which I would like to view in gdb. However, currently the core dump shows me only this much info

    Debug logs for core.1385 (Generated on Jul 26 10:17 BST)
    
    [Thread debugging using libthread_db enabled]
    Core was generated by `corosync -f'.
    Program terminated with signal 6, Aborted.
    #0 0x00007f68b2783495 in raise () from /lib64/libc.so.6
    #0 0x00007f68b2783495 in raise () from /lib64/libc.so.6
    #1 0x00007f68b2784c75 in abort () from /lib64/libc.so.6
    #2 0x00007f68b277c60e in __assert_fail_base () from /lib64/libc.so.6
    #3 0x00007f68b277c6d0 in __assert_fail () from /lib64/libc.so.6
    #4 0x00007f68b3530f2c in ?? () from /usr/lib64/libtotem_pg.so.4
    #5 0x00007f68b3534eaf in ?? () from /usr/lib64/libtotem_pg.so.4
    #6 0x00007f68b3535259 in ?? () from /usr/lib64/libtotem_pg.so.4
    #7 0x00007f68b352f108 in rrp_deliver_fn () from /usr/lib64/libtotem_pg.so.4
    #8 0x00007f68b352be2a in ?? () from /usr/lib64/libtotem_pg.so.4
    #9 0x00007f68b3524482 in poll_run () from /usr/lib64/libtotem_pg.so.4
    #10 0x00000000004079b6 in main ()

I guess I need to install the debug info packages for corosync and whatever is libtotem_pg.so.4. How to do this?
                                

Serge Rogatch (167 rep)

Jul 26, 2018, 04:05 PM • Last activity: Jul 26, 2018, 04:59 PM

2 votes

1 answers

2841 views

Unable to mount gfs2 file system on Debian Stretch, probable dlm mis-config?

debian distributed-filesystem shared-disk corosync

I am experimenting with gfs2 on Debian Stretch, and having some difficulties. I am a reasonably experienced Linux admin, but new to shared-disk and parallel file systems. My immediate project is to mount a gfs2-formatted iscsi-exported device on multiple clients as a shared file system. For the mome...

                                  I am experimenting with gfs2 on Debian Stretch, and having some difficulties. I am a reasonably experienced Linux admin, but new to shared-disk and parallel file systems.

My immediate project is to mount a gfs2-formatted iscsi-exported device on multiple clients as a shared file system.  For the moment, I am not interested in HA or fencing, although this may be important later on.

The iscsi part is fine, I am able to log in to the target, format it as an xfs file system, and also mount it on multiple clients and verify that it shows up with the same blkid.

To do the gfs2 business, I am following the scheme on the Debian stretch "gfs2" man page, modified for my config, and embellished slightly by various searches and so forth.

Man page is here:
https://manpages.debian.org/stretch/gfs2-utils/gfs2.5.en.html 

The actual error is, when I attempt to mount my gfs2 file system, the mount command returns with

    mount: mount(2) failed: /mnt: No such file or directory
  
... where /mnt is the desired mount point, which certainly does
exist.  (If you attempt to mount to a nonexistent mount point the
error is "mount: mount point /wrong does not exist").

Related, at each mount attempt, dmesg reports:

    gfs2: can't find protocol lock_dlm

I briefly went down the path of assuming the problem was that Debian packages do not provide "/sbin/mount.gfs2", and looked for that, but I think that was an incorrect guess.

I have a five-machine cluster (of Raspberry Pis, in case it matters), named, somewhat idiosyncratically, pio, pi, pj, pk, and pl.  They all have fixed static IP addresses, and there's no domain. 

I have installed the Debian gfs2, corosync, and dlm-controld packages.

For the corosync step, my corosync config is (e.g. for pio, intended to be the master of the cluster):

    totem {
          version: 2
          cluster_name: rpitest
          token: 3000
          token_retransmits_before_loss_const: 10
          clear_node_high_bit: yes
          crypto_cipher: none
          crypto_hash: none
          nodeid: 17
          interface {
                  ringnumber: 0
                  bindnetaddr: 192.168.0.17
                  mcastport: 5405
                  ttl: 1
          }
    }
    nodelist { 
          node {
                  ring0_addr: 192.168.0.17
                  nodeid: 17
          }
          node {
                  ring0_addr: 192.168.0.11
                  nodeid: 1
          }
          node {
                  ring0_addr: 192.168.0.12
                  nodeid: 2
          }
          node {
                  ring0_addr: 192.168.0.13
                  nodeid: 3
          }
          node {
                  ring0_addr: 192.168.0.14
                  nodeid: 4
          }
    }
    logging {
          fileline: off
          to_stderr: no
          to_logfile: no
          to_syslog: yes
          syslog_facility: daemon
          debug: off
          timestamp: on
          logger_subsys {
                  subsys: QUORUM
                  debug: off
          }
    }
    quorum {
          provider: corosync_votequorum
          expected_votes: 5
    }

This file is present on all the nodes, with appropriate node-specific changes to the nodeid and bindnetaddr fields in the totem section.

The corosync tool starts without error on all nodes, and all the
nodes also have sane-looking output from corosync-quorumtool, thus:

    root@pio:~# corosync-quorumtool 
    Quorum information
    ------------------
    Date:             Sun Apr 22 11:04:13 2018
    Quorum provider:  corosync_votequorum
    Nodes:            5
    Node ID:          17
    Ring ID:          1/124
    Quorate:          Yes
     
    Votequorum information
    ----------------------
    Expected votes:   5
    Highest expected: 5
    Total votes:      5
    Quorum:           3  
    Flags:            Quorate 
    
    Membership information
    ----------------------
    Nodeid      Votes Name
             1          1 192.168.0.11
             2          1 192.168.0.12
             3          1 192.168.0.13
             4          1 192.168.0.14
            17          1 192.168.0.17 (local)

The dlm-controld package was installed, and /etc/dlm/dlm.conf created with
the following simple config.  Again, I am skipping fencing for now.

The dlm.conf file is the same on all the nodes.

    enable_fencing=0
    
    lockspace rpitest nodir=1
    master rpitest node=17

I am unclear on whether or not the DLM "lockspace" name is supposed to match the corosync cluster name or not.  I see the same behavior either way.  

The dlm-controld service starts without errors, and the the output of "dlm_tool status" appears sane:

    root@pio:~# dlm_tool status
    cluster nodeid 17 quorate 1 ring seq 124 124
    daemon now 1367 fence_pid 0 
    node 1 M add 31 rem 0 fail 0 fence 0 at 0 0
    node 2 M add 31 rem 0 fail 0 fence 0 at 0 0
    node 3 M add 31 rem 0 fail 0 fence 0 at 0 0
    node 4 M add 31 rem 0 fail 0 fence 0 at 0 0
    node 17 M add 7 rem 0 fail 0 fence 0 at 0 0

The gfs2 file system was created by:

    mkfs -t gfs2 -p lock_dlm -j 5 -t rpitest:one /path/to/device

Subsequent to this, "blkid /path/to/device" reports:

    /path/to/device: LABEL="rpitest:one" UUID= TYPE="gfs2"

It looks the same on all the iscsi clients.

At this point, I feel like I should be able to mount the gfs2 file system on any/all of the clients, but here is where I get the error above -- the mount command reports a "no such file or directory", and dmesg and syslog report "gfs2: can't find protocol lock_dlm".

There are several other gfs2 guides out there, but many of them seem to be RH/CentOS specific, and for other cluster-management schemes besides corosync, like cman or pacemaker.  Those aren't necessarily deal-breakers, but it's high-value to me to have this work on  nearly-stock Debian Stretch.  

It also seems likely to me that this is probably a pretty simple dlm misconfiguration, but I can't seem to nail it down.

Additional clues:  When I try to "join" a lockspace via

    dlm_tool join 

... I get a dmesg output:

    dlm cluster name 'rpitest' is being used without an application provided cluster name

This happens independently of whether the lockspace I am joining is "rpitest" or not.  This suggests that lockspace names and cluster names are indeed the same thing, and/but that the dlm is evidently not aware of the corosync config?

                                

Andrew Reid (53 rep)

Apr 22, 2018, 04:44 PM • Last activity: Apr 24, 2018, 06:09 AM

Showing page 1 of 20 total questions