Sample Header Ad - 728x90

Unix & Linux Stack Exchange

Q&A for users of Linux, FreeBSD and other Unix-like operating systems

Latest Questions

0 votes
1 answers
2303 views
Secondary DRBD node does not auto-start in Pacemaker+Corosync setup
I am trying to set up a 2-PC cluster with shared resources: `ClusterIP`, `ClusterSamba`, `ClusterNFS`, `DRBD` (cloned resource), and a `DRBDFS`. The beginning of the project followed the [Clusters from Scratch](https://clusterlabs.org/pacemaker/doc/en-US/Pacemaker/2.0/html/Clusters_from_Scratch/inde...
I am trying to set up a 2-PC cluster with shared resources: ClusterIP, ClusterSamba, ClusterNFS, DRBD (cloned resource), and a DRBDFS. The beginning of the project followed the [Clusters from Scratch](https://clusterlabs.org/pacemaker/doc/en-US/Pacemaker/2.0/html/Clusters_from_Scratch/index.html) guide. When everything in this guide is done, it works without problems. So, I wanted to use parts of that guide and build my own setup: I created one shared IP (ClusterIP) that is automatically assigned to one node, and (here is where it gets tricky) on that node, I mount my /dev/drbd1 device to /exports and then share this mount through **SAMBA** and **NFS**. When I start the cluster, all resources come up as they should, _but DRBD does not go up on the secondary node_ (Primary/Unknown). If I bring it up manually, it syncs and works. Also, when I stop the cluster (or forcibly reboot the first node), all resources transfer to the other node and everything works, _except DRBD on the other node goes into an Unknown state_. ### So now, here is the problem: **Why does DRBD go down on the secondary node when I stop the cluster? Or why doesn't it start in the Secondary role on the secondary node?** Sorry if my description is bad. --- ## Here are the commands I used
# apt install -y pacemaker pcs psmisc policycoreutils-python-utils drbd-utils samba nfs-kernel-server 
# systemctl start pcsd.service
# systemctl enable pcsd.service
# passwd hacluster
# pcs host auth alice bob
# pcs cluster setup myCluster alice bob --force
# pcs cluster start --all
# pcs property set stonith-enabled=false
# pcs property set no-quorum-policy=ignore
# modprobe drbd
# echo drbd >/etc/modules-load.d/drbd.conf
# drbdadm create-md r0
# drbdadm up r0
# drbdadm primary r0 --force
# mkfs.ext4 /dev/drbd1
# systemctl disable smbd
# systemctl disable nfs-kernel-server.service 
# mkdir /exports
# vi /etc/samba/smb.conf 
# vi /etc/exports 
# pcs resource create ClusterIP ocf:heartbeat:IPaddr2 ip=10.1.1.30 cidr_netmask=24 op monitor interval=30s
# pcs resource defaults resource-stickiness=100
# pcs resource op defaults timeout=240s
# pcs resource create ClusterSamba lsb:smbd op monitor interval=60s
# pcs resource create ClusterNFS ocf:heartbeat:nfsserver op monitor interval=60s
# pcs resource create DRBD ocf:linbit:drbd drbd_resource=r0 op monitor interval=60s
# pcs resource promotable DRBD promoted-max=1 promoted-node-max=1 clone-max=2 clone-node-max=1 notify=true
# pcs resource create DRBDFS Filesystem device="/dev/drbd1" directory="/exports" fstype="ext4"
# pcs constraint order ClusterIP then ClusterNFS
# pcs constraint order ClusterNFS then ClusterSamba
# pcs constraint order promote DRBD-clone then start DRBDFS
# pcs constraint order DRBDFS then ClusterNFS
# pcs constraint order ClusterIP then DRBD-clone
# pcs constraint colocation ClusterSamba with ClusterIP
# pcs constraint colocation add ClusterSamba with ClusterIP
# pcs constraint colocation add ClusterNFS with ClusterIP
# pcs constraint colocation add DRBDFS with DRBD-clone INFINITY with-rsc-role=Master
# pcs constraint colocation add DRBD-clone with ClusterIP
# pcs cluster stop --all && sleep 2 && pcs cluster start --all
--- ## Configs and stats ### /etc/drbd.d/r0.res
resource r0 {
 device /dev/drbd1;
 disk /dev/sdb;
 meta-disk internal;
 net {
  allow-two-primaries;
 }
 on alice {
  address 10.1.1.31:7788;
 }
 on bob {
  address 10.1.1.32:7788;
 } 
}
--- ### /etc/corosync/corosync.conf
totem {
    version: 2
    cluster_name: myCluster
    transport: knet
    crypto_cipher: aes256
    crypto_hash: sha256
}

nodelist {
    node {
        ring0_addr: alice
        name: alice
        nodeid: 1
    }

    node {
        ring0_addr: bob
        name: bob
        nodeid: 2
    }
}

quorum {
    provider: corosync_votequorum
    two_node: 1
}

logging {
    to_logfile: yes
    logfile: /var/log/corosync/corosync.log
    to_syslog: yes
    timestamp: on
}
--- ### pcs status
Cluster name: myCluster
Stack: corosync
Current DC: alice (version 2.0.1-9e909a5bdd) - partition with quorum
Last updated: Fri May 15 12:28:30 2020
Last change: Fri May 15 11:04:50 2020 by root via cibadmin on bob

2 nodes configured
6 resources configured

Online: [ alice bob ]

Full list of resources:

 ClusterIP      (ocf::heartbeat:IPaddr2):       Started alice
 ClusterSamba   (lsb:smbd):     Started alice
 ClusterNFS     (ocf::heartbeat:nfsserver):     Started alice
 Clone Set: DRBD-clone [DRBD] (promotable)
 Masters: [ alice ]
 Stopped: [ bob ]
 DRBDFS (ocf::heartbeat:Filesystem):    Started alice

Daemon Status:
  corosync: active/disabled
  pacemaker: active/disabled
  pcsd: active/enabled
--- ### pcs constraint --full
Location Constraints:

Ordering Constraints:
  start ClusterIP then start ClusterNFS (kind:Mandatory) (id:order-ClusterIP-ClusterNFS-mandatory)
  start ClusterNFS then start ClusterSamba (kind:Mandatory) (id:order-ClusterNFS-ClusterSamba-mandatory)
  promote DRBD-clone then start DRBDFS (kind:Mandatory) (id:order-DRBD-clone-DRBDFS-mandatory)
  start DRBDFS then start ClusterNFS (kind:Mandatory) (id:order-DRBDFS-ClusterNFS-mandatory)
  start ClusterIP then start DRBD-clone (kind:Mandatory) (id:order-ClusterIP-DRBD-clone-mandatory)
  start ClusterIP then promote DRBD-clone (kind:Mandatory) (id:order-ClusterIP-DRBD-clone-mandatory-1)

Colocation Constraints:
  ClusterSamba with ClusterIP (score:INFINITY) (id:colocation-ClusterSamba-ClusterIP-INFINITY)
  ClusterNFS with ClusterIP (score:INFINITY) (id:colocation-ClusterNFS-ClusterIP-INFINITY)
  DRBDFS with DRBD-clone (score:INFINITY) (with-rsc-role:Master) (id:colocation-DRBDFS-DRBD-clone-INFINITY)
  DRBD-clone with ClusterIP (score:INFINITY) (id:colocation-DRBD-clone-ClusterIP-INFINITY)

Ticket Constraints:
--- ### /proc/drbd
version: 8.4.10 (api:1/proto:86-101)
srcversion: 983FCB77F30137D4E127B83 

 1: cs:WFConnection ro:Primary/Unknown ds:UpToDate/DUnknown C r-----
    ns:0 nr:4 dw:8 dr:17 al:1 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f oos:4
Miki (31 rep)
May 15, 2020, 11:12 AM • Last activity: Jun 19, 2025, 10:03 PM
2 votes
1 answers
2579 views
After failover Pacemaker moves resource back when node comes back
I'm using Pacemaker & Corosync for my cluster. When a node dies pacemaker moving my resources to another online node. Everything ok here. But when the dead node comes back, Pacemaker moving the resource back. I don't have any "location" line in my config and also I tried with "unmove" command but no...
I'm using Pacemaker & Corosync for my cluster. When a node dies pacemaker moving my resources to another online node. Everything ok here. But when the dead node comes back, Pacemaker moving the resource back. I don't have any "location" line in my config and also I tried with "unmove" command but nothing changed. I failed at somewhere and need to find the reason. **crm configure sh** node 1: DEV1 node 2: DEV2 primitive poolip IPaddr2 \ params ip=10.1.60.33 nic=enp2s0f0 cidr_netmask=24 \ meta migration-threshold=2 target-role=Started \ op monitor interval=20 timeout=20 on-fail=restart primitive gui systemd:gui \ op monitor interval=20s \ meta target-role=Started primitive gui-ip IPaddr2 \ params ip=10.1.60.35 nic=enp2s0f0 cidr_netmask=24 \ meta migration-threshold=2 target-role=Started \ op monitor interval=20 timeout=20 on-fail=restart colocation cluster-gui inf: gui gui-ip order gui-after-ip Mandatory: gui-ip gui property cib-bootstrap-options: \ have-watchdog=false \ dc-version=2.0.0-1-8cf3fe749e \ cluster-infrastructure=corosync \ cluster-name=mycluster \ stonith-enabled=false \ no-quorum-policy=ignore \ last-lrm-refresh=1545920437 rsc_defaults rsc-options: \ migration-threshold=10 \ resource-stickiness=100 **pcs resource defaults** migration-threshold=10 resource-stickiness=100 **pcs resource show gui** Resource: gui (class=systemd type=gui) Meta Attrs: target-role=Started Operations: monitor interval=20s (gui-monitor-20s)
Ozbit (439 rep)
Jan 2, 2019, 08:58 AM • Last activity: Jun 14, 2025, 09:07 PM
1 votes
2 answers
8573 views
pcs stonith not working
i have 2 virtual centos7 nodes , root can login passwordless among themself, i have configured stonith like this but the services are not coming up, fencing is not happening , im new to this, could someone help me rectify issue~ [root@node1 cluster]# pcs stonith create nub1 fence_virt pcmk_host_list...
i have 2 virtual centos7 nodes , root can login passwordless among themself, i have configured stonith like this but the services are not coming up, fencing is not happening , im new to this, could someone help me rectify issue~ [root@node1 cluster]# pcs stonith create nub1 fence_virt pcmk_host_list="node1" [root@node1 cluster]# pcs stonith create nub2 fence_virt pcmk_host_list="node2" [root@node1 cluster]# pcs stonith show nub1 (stonith:fence_virt): Stopped nub2 (stonith:fence_virt): Stopped [root@node1 cluster]# [root@node1 cluster]# [root@node1 cluster]# [root@node1 cluster]# [root@node1 cluster]# pcs status Cluster name: mycluster Stack: corosync Current DC: node2 (version 1.1.15-11.el7_3.5-e174ec8) - partition with quorum Last updated: Tue Jul 25 07:03:37 2017 Last change: Tue Jul 25 07:02:00 2017 by root via cibadmin on node1 2 nodes and 3 resources configured Online: [ node1 node2 ] Full list of resources: ClusterIP (ocf::heartbeat:IPaddr2): Started node1 nub1 (stonith:fence_virt): Stopped nub2 (stonith:fence_virt): Stopped Failed Actions: * nub1_start_0 on node1 'unknown error' (1): call=56, status=Error, exitreason='none', last-rc-change='Tue Jul 25 07:01:34 2017', queued=0ms, exec=7006ms * nub2_start_0 on node1 'unknown error' (1): call=58, status=Error, exitreason='none', last-rc-change='Tue Jul 25 07:01:42 2017', queued=0ms, exec=7009ms * nub1_start_0 on node2 'unknown error' (1): call=54, status=Error, exitreason='none', last-rc-change='Tue Jul 25 07:01:26 2017', queued=0ms, exec=7010ms * nub2_start_0 on node2 'unknown error' (1): call=60, status=Error, exitreason='none', last-rc-change='Tue Jul 25 07:01:34 2017', queued=0ms, exec=7013ms Daemon Status: corosync: active/enabled pacemaker: active/enabled pcsd: active/enabled [root@node1 cluster]# pcs stonith fence node2 Error: unable to fence 'node2' Command failed: No route to host [root@node1 cluster]# pcs stonith fence nub2 Error: unable to fence 'nub2' Command failed: No such device [root@node1 cluster]# ping node2 PING node2 (192.168.100.102) 56(84) bytes of data. 64 bytes from node2 (192.168.100.102): icmp_seq=1 ttl=64 time=0.247 ms 64 bytes from node2 (192.168.100.102): icmp_seq=2 ttl=64 time=0.304 ms ^C --- node2 ping statistics --- 2 packets transmitted, 2 received, 0% packet loss, time 1001ms rtt min/avg/max/mdev = 0.247/0.275/0.304/0.032 ms
Mohammed Ali (691 rep)
Jul 25, 2017, 11:10 AM • Last activity: Feb 10, 2024, 02:01 AM
0 votes
1 answers
11525 views
DRBD - 'node1' not defined in your config (for this host) - Error when setting Primary
I am getting the following error when trying to set the Primary node for DRBD. 'node1' not defined in your config (for this host). I know this is related to DNS/Hostname/Hosts and the config clusterdb.res. I know this because I originally got an error when trying to start clusterdb.res if node1 didn...
I am getting the following error when trying to set the Primary node for DRBD. 'node1' not defined in your config (for this host). I know this is related to DNS/Hostname/Hosts and the config clusterdb.res. I know this because I originally got an error when trying to start clusterdb.res if node1 didn't resolve correctly. So what confuses me is that I can start the clusterdb.res if either use: *I have used this command on the hosts* hostnamectl set-hostname $(uname -n | sed s/\\..*//) To make the hostname resolve to node1 instead of node1.localdomain Or add node1.localdomain to the config, either works. But I have tried all combinations and can't seem to get this command to take : drbdadm primary --force node1 && cat /proc/drbd **My Configs** /etc/drbd.d/clusterdb.res resource clusterdb{ protocol C; meta-disk internal; device /dev/drbd0; startup { wfc-timeout 30; outdated-wfc-timeout 20; degr-wfc-timeout 30; } net { cram-hmac-alg sha1; shared-secret sync_disk; } syncer { rate 10M; al-extents 257; on-no-data-accessible io-error; verify-alg sha1; } on node1 { disk /dev/sda3; address 192.168.1.216:7788; } on node2 { disk /dev/sda3; address 192.168.1.217:7788; } } /etc/hosts : 127.0.0.1 localhost localhost.localdomain localhost4 localhost4.localdomain4 ::1 localhost localhost.localdomain localhost6 localhost6.localdomain6 192.168.1.216 node1 192.168.1.217 node2 /etc/hostname node# My full write up ATM (wip) **Edits :** [root@node1 ~]# hostname node1 [root@node1 ~]# cat /etc/hosts 127.0.0.1 localhost localhost.localdomain localhost4 localhost4.localdomain4 ::1 localhost localhost.localdomain localhost6 localhost6.localdomain6 127.0.1.1 node1 192.168.1.216 node1 192.168.1.217 node2 [root@node1 ~]# Update: I have gotten this to work with LVM following this guide exactly, so I think my issue actually lies with the following lines of code. But for now I think i will stick with LVM since it works, unless somebody else really wants to work on this. (My working LVM writeup) device /dev/drbd0; or device /dev/drbd0; The reason I say this, is I used the same hosts/hostname/shortname/ip_addr but LVM and it worked, but maybe I missed something the first time, I fixed in my new VM Template (I started from scratch to build LVM)
FreeSoftwareServers (2682 rep)
May 1, 2016, 01:59 AM • Last activity: Mar 8, 2023, 02:50 AM
0 votes
1 answers
195 views
How to increase a number of nfsd threads in pcs/corosync environment?
I have NFS server bound with *pcs/corosync* to provide stability and HA. The default number of nfsd threads is 8, wich I find too little as I can observe that mostly all 8 are 90% busy all the time. The nfs/corosync/pcs system consists of 3 servers and a dist storage. How can I increase a number of...
I have NFS server bound with *pcs/corosync* to provide stability and HA. The default number of nfsd threads is 8, wich I find too little as I can observe that mostly all 8 are 90% busy all the time. The nfs/corosync/pcs system consists of 3 servers and a dist storage. How can I increase a number of threads: should I modify */etc/sysconfig/nfs* file on all NFS nodes or should I make some changes somewhere else? Sorry for newbie question but I have nobody to ask for it. Appreciate any help, thank you. OS: Centos 7.
Mirimat (33 rep)
Sep 13, 2022, 09:16 AM • Last activity: Oct 4, 2022, 01:28 PM
0 votes
1 answers
200 views
Convert puppet manifest config to hiera
I installed corosync-pacemaker cluster via puppet. Now I would to like keep my data into hiera file. How should I convert cs_primitive section into yaml file? cs_primitive { 'nfsshare_fs': primitive_class => 'ocf', primitive_type => 'Filesystem', provided_by => 'heartbeat', parameters => { 'device'...
I installed corosync-pacemaker cluster via puppet. Now I would to like keep my data into hiera file. How should I convert cs_primitive section into yaml file? cs_primitive { 'nfsshare_fs': primitive_class => 'ocf', primitive_type => 'Filesystem', provided_by => 'heartbeat', parameters => { 'device' => '/dev/disk/lvname', 'directory' => '/share', 'fstype' => 'ext4' }, }-> I tried the below code but it didn't work. corosync::cs_primitive: 'nfsshare_fs': primitive_class: 'ocf' primitive_type: 'Filesystem' provided_by: 'heartbeat' parameters: device: '/dev/disk/by-id/lvname' directory: '/share' fstype: 'ext4' Thanks.
fortunate1357 (1 rep)
Apr 4, 2022, 06:21 PM • Last activity: Jul 14, 2022, 07:27 PM
0 votes
1 answers
26 views
VM managed by corosync not detecting new CPUs
I have an HA cluster managed by corosync, and I need to increase the CPU allocation to one of the VMs. I have done the following: * `pcs resource disable myVM` * Wait for VM to stop * Edit the xml file (confirmed the correct file by `pcs sources show --full`) - within the `cpu` section I changed the...
I have an HA cluster managed by corosync, and I need to increase the CPU allocation to one of the VMs. I have done the following: * pcs resource disable myVM * Wait for VM to stop * Edit the xml file (confirmed the correct file by pcs sources show --full) - within the cpu section I changed the entry: `` to change the number of cores to 8. * Make sure that xml file is synced across all physical hosts * pcs resource enable myVM But when the VM comes back up, /proc/cpuinfo shows that it still has only 4 cores (I don't have hot plug CPUs enabled / am not sure how to enable this). There are plenty of CPU cores available on the physical hosts. Can anyone tell me what I'm doing wrong that's preventing the VM from starting up with 8 cores instead of 4? It must be something obvious but I can't see it!
Phil Evans (101 rep)
Jun 9, 2022, 10:57 AM • Last activity: Jun 17, 2022, 07:55 AM
0 votes
0 answers
459 views
HA-Cluster / corosync / pacemaker: Active-Active cluster with service ip / service ip is not switching
How to configure crm to migrate the ServiceIP if one Service is failed? node 1: web01a \ attributes standby=off node 2: web01b \ attributes standby=off primitive Apache2 systemd:apache2 \ operations $id=Apache2-operations \ op start interval=0 timeout=100 \ op stop interval=0 timeout=100 \ op monito...
How to configure crm to migrate the ServiceIP if one Service is failed? node 1: web01a \ attributes standby=off node 2: web01b \ attributes standby=off primitive Apache2 systemd:apache2 \ operations $id=Apache2-operations \ op start interval=0 timeout=100 \ op stop interval=0 timeout=100 \ op monitor interval=15 timeout=100 start-delay=15 \ meta primitive PHP-FPM systemd:php7.4-fpm \ operations $id=PHP-FPM-operations \ op start interval=0 timeout=100 \ op stop interval=0 timeout=100 \ op monitor interval=15 timeout=100 start-delay=15 \ meta primitive Redis systemd:redis-server \ operations $id=Redis-operations \ op start interval=0 timeout=100 \ op stop interval=0 timeout=100 \ op monitor interval=15 timeout=100 start-delay=15 \ meta primitive ServiceIP IPaddr2 \ params ip=1.2.3.4 \ operations $id=ServiceIP-operations \ op monitor interval=10 timeout=20 start-delay=0 \ op_params migration-threshold=1 \ meta primitive lsyncd systemd:lsyncd \ op start interval=0 timeout=100 \ op stop interval=0 timeout=100 \ op monitor interval=15 timeout=100 start-delay=15 \ meta target-role=Started group ActiveNode ServiceIP lsyncd group WebServer Apache2 PHP-FPM Redis clone cl_WS WebServer \ meta clone-max=2 notify=true interleave=true colocation col_cl_WS_ActiveNode 100: cl_WS ActiveNode property cib-bootstrap-options: \ have-watchdog=false \ dc-version=2.0.3-4b1f869f0f \ cluster-infrastructure=corosync \ cluster-name=debian \ stonith-enabled=false \ no-quorum-policy=ignore \ startup-fencing=false \ maintenance-mode=false \ last-lrm-refresh=1622628525 \ start-failure-is-fatal=true These services should always be started - Apache2 - PHP-FPM - Redis If one of these services is not running, the node is unhelthy. The **ServiceIP** and **lsyncd** should switch to an healthy node. When I killed the apache2 process, the IP is not switched.
FaxMax (726 rep)
Jun 2, 2021, 12:29 PM
2 votes
1 answers
5355 views
Pacemaker - Corosync - HA - Simple Custom Resource Testing - Status flapping - Started - Failed - Stopped - Started
I am testing using the OCF:Heartbeat:Dummy script and I want to make a very basic setup just to know it works and build on that. The only information I can find was this web blog here. https://raymii.org/s/tutorials/Corosync_Pacemaker_-_Execute_a_script_on_failover.html It has some typos but basical...
I am testing using the OCF:Heartbeat:Dummy script and I want to make a very basic setup just to know it works and build on that. The only information I can find was this web blog here. https://raymii.org/s/tutorials/Corosync_Pacemaker_-_Execute_a_script_on_failover.html It has some typos but basically worked for me. The script currently just contains the following : sudo nano /usr/local/bin/failover.sh && sudo chmod +x /usr/local/bin/failover.sh #!/bin/sh touch /tmp/testfailover.sh Here is my setup : cp /usr/lib/ocf/resource.d/heartbeat/Dummy /usr/lib/ocf/resource.d/heartbeat/FailOverScript sudo nano /usr/lib/ocf/resource.d/heartbeat/FailOverScript dummy_start() { dummy_monitor /usr/local/bin/failover.sh if [ $? = $OCF_SUCCESS ]; then return $OCF_SUCCESS fi touch ${OCF_RESKEY_state} } sed -i 's/Dummy/FailOverScript/g' /usr/lib/ocf/resource.d/heartbeat/FailOverScript sed -i 's/dummy/FailOverScript/g' /usr/lib/ocf/resource.d/heartbeat/FailOverScript pcs resource create FailOverScript ocf:heartbeat:FailOverScript op monitor interval="30" The only testing I can really do : [root@node2 ~]# /usr/lib/ocf/resource.d/heartbeat/FailOverScript start ; echo $? DEBUG: default start : 0 0 ocf-tester doesn't seem to exist in the latest HA Software Suite, not really sure how to manually install it, but the script "half works". **The script doesn't need monitoring, its supposed to be very basic, but it seems to be flapping and giving me the following error code. Any idea's what to do?** FailOverScript (ocf::heartbeat:FailOverScript): Started node2 Failed Actions: * FailOverScript_monitor_30000 on node2 'not running' (7): call= 24423, status=complete, exitreason='none', last-rc-change='Tue Aug 16 15:53:50 2016', queued=0ms, exec= 9ms **Example of what I want to do:** Cluster start Script runs "start.sh" Cluster fails over to node2. On node1 script runs "fail.sh" On node2 script runs "start.sh" and vis versa if it fails the other direction. Note: The script does work, I get /tmp/testfailover.sh. I even tried putting another script under dummy_stop to remove the file and that worked, but it just keeps flapping along removing/adding/removing/adding file and starting/failing/stoping/starting etc etc. Thanks for reading!
FreeSoftwareServers (2682 rep)
Aug 16, 2016, 07:56 PM • Last activity: Dec 21, 2020, 06:56 AM
1 votes
0 answers
603 views
Cannot seem to start pcs cluster (NFS Cluster) disk_fencing trouble
For the life of me, I can't find a clear answer on how to start my NFS active / passive cluster. I have two nodes, node1 and node2 and followed the guide here: https://www.linuxtechi.com/configure-nfs-server-clustering-pacemaker-centos-7-rhel-7/ Here are my logs: May 25 10:35:59 node1 stonith-ng[392...
For the life of me, I can't find a clear answer on how to start my NFS active / passive cluster. I have two nodes, node1 and node2 and followed the guide here: https://www.linuxtechi.com/configure-nfs-server-clustering-pacemaker-centos-7-rhel-7/ Here are my logs: May 25 10:35:59 node1 stonith-ng: notice: Couldn't find anyone to fence (on) node1 with any device May 25 10:35:59 node1 stonith-ng: error: Operation on of node1 by for crmd.3928@node1.97f683f8: No route to host May 25 10:35:59 node1 crmd: notice: Stonith operation 142/2:72:0:f3e078bf-24f5-4160-95c1-0eeeea0e5e12: No route to host (-113) May 25 10:35:59 node1 crmd: notice: Stonith operation 142 for node1 failed (No route to host): aborting transition. May 25 10:35:59 node1 crmd: warning: Too many failures (71) to fence node1, giving up May 25 10:35:59 node1 crmd: notice: Transition aborted: Stonith failed May 25 10:35:59 node1 crmd: error: Unfencing of node1 by failed: No route to host (-113) May 25 10:35:59 node1 stonith-ng: notice: Couldn't find anyone to fence (on) node2 with any device May 25 10:35:59 node1 stonith-ng: error: Operation on of node2 by for crmd.3928@node1.2680795a: No route to host May 25 10:35:59 node1 crmd: notice: Stonith operation 143/1:72:0:f3e078bf-24f5-4160-95c1-0eeeea0e5e12: No route to host (-113) May 25 10:35:59 node1 crmd: notice: Stonith operation 143 for node2 failed (No route to host): aborting transition. May 25 10:35:59 node1 crmd: warning: Too many failures (71) to fence node2, giving up May 25 10:35:59 node1 crmd: error: Unfencing of node2 by failed: No route to host (-113) Here is the status: [root@node1 ~]# pcs status Cluster name: nfs_cluster Stack: corosync Current DC: node1 (version 1.1.20-5.amzn2.0.2-3c4c782f70) - partition with quorum Last updated: Mon May 25 10:45:56 2020 Last change: Sun May 24 21:04:55 2020 by root via cibadmin on node1 2 nodes configured 5 resources configured Online: [ node1 node2 ] Full list of resources: disk_fencing (stonith:fence_scsi): Stopped Resource Group: nfsgrp nfsshare (ocf::heartbeat:Filesystem): Stopped nfsd (ocf::heartbeat:nfsserver): Stopped nfsroot (ocf::heartbeat:exportfs): Stopped nfsip (ocf::heartbeat:IPaddr2): Stopped Failed Fencing Actions: * unfencing of node2 failed: delegate=, client=crmd.3928, origin=node1, last-failed='Mon May 25 10:35:59 2020' * unfencing of node1 failed: delegate=, client=crmd.3928, origin=node1, last-failed='Mon May 25 10:35:59 2020' Daemon Status: corosync: active/enabled pacemaker: active/enabled pcsd: active/enabled [root@node1 ~]# The disk_fencing is set to scsi, but not sure if that is the best options for two AWS ec2 instances. Perhaps I can't get disk_fencing to work so it can't start? I can ping node1 from node 2 and vice versa. Open to ideas...
jasontt33 (11 rep)
May 25, 2020, 10:49 AM
0 votes
0 answers
685 views
Unmount data volume hosting NFS exports
So I've got a customer with two NFS server setup in a pacemaker cluster active/standby configuration. It is a legacy system running RHEL6. On the servers there is /mnt/data1 which is an xfs mountpoint on a drbd mirrored disk. The mount is active on one node at a time and controlled by pacemaker (so...
So I've got a customer with two NFS server setup in a pacemaker cluster active/standby configuration. It is a legacy system running RHEL6. On the servers there is /mnt/data1 which is an xfs mountpoint on a drbd mirrored disk. The mount is active on one node at a time and controlled by pacemaker (so is drbd for that matter). My problem is in critical cases where I need to move active services to the other server without shutting down the nfs clients first. I can shutdown the NFS services, but no matter what I try, I can't unmount the /mnt/data1 filesystem as it reports as 'busy'. I tried changing the daemon stop sequence on the nodes. Right now I have the following sequence: - rpc.mountd - nfsd - exportfs -au - rpc.statd Both 'lsof /mnt/data1' and 'fuser -mv /mnt/data1' do not report any open files on the mountpoint and I can verify no terminal sessions are there either. Short of having to shutdown the box (which kills any debugging I would like to do), I can't get the filesystem unmounted to allow pacemaker to cleanly move the filesystem mount to the other node. I assume that there are some hanging file locks, but I'm not sure how else to kill them. Any ideas are appreciated.
pbrunnen (113 rep)
Nov 10, 2019, 05:36 PM • Last activity: Nov 10, 2019, 05:41 PM
1 votes
1 answers
2580 views
Pacemaker: Primary node is rebooted and comes back is primary instead of standby
We are using pacemaker, corosync to automate failovers. We noticed one behaviour- when primary node is rebooted, the standby node takes over as primary - which is fine. When the node comes back online and services are started on it, it takes back the role of Primary. It should ideally start as stand...
We are using pacemaker, corosync to automate failovers. We noticed one behaviour- when primary node is rebooted, the standby node takes over as primary - which is fine. When the node comes back online and services are started on it, it takes back the role of Primary. It should ideally start as standby. Are we missing any configuration? > pcs resource defaults O/p: resource-stickiness: INFINITY migration-threshold: 0 Stickiness is set to INFINITY. Please suggest. Adding Config details: ======================
[root@Node1 heartbeat]# pcs config show –l
Cluster Name: cluster1
Corosync Nodes:
 Node1 Node2
Pacemaker Nodes:
 Node1 Node2

Resources:
 Master: msPostgresql
  Meta Attrs: master-node-max=1 clone-max=2 notify=true master-max=1 clone-node-max=1
  Resource: pgsql (class=ocf provider=heartbeat type=pgsql)
   Attributes: master_ip=10.70.10.1 node_list="Node1 Node2" pgctl=/usr/pgsql-9.6/bin/pg_ctl pgdata=/var/lib/pgsql/9.6/data/ primary_conninfo_opt="keepalives_idle=60 keepalives_interval=5 keepalives_count=5" psql=/usr/pgsql-9.6/bin/psql rep_mode=async restart_on_promote=true restore_command="cp /var/lib/pgsql/9.6/data/archivedir/%f %p"
   Meta Attrs: failure-timeout=60
   Operations: demote interval=0s on-fail=stop timeout=60s (pgsql-demote-interval-0s)
               methods interval=0s timeout=5s (pgsql-methods-interval-0s)
               monitor interval=4s on-fail=restart timeout=60s (pgsql-monitor-interval-4s)
               monitor interval=3s on-fail=restart role=Master timeout=60s (pgsql-monitor-interval-3s)
               notify interval=0s timeout=60s (pgsql-notify-interval-0s)
               promote interval=0s on-fail=restart timeout=60s (pgsql-promote-interval-0s)
               start interval=0s on-fail=restart timeout=60s (pgsql-start-interval-0s)
               stop interval=0s on-fail=block timeout=60s (pgsql-stop-interval-0s)
 Group: master-group
  Resource: vip-master (class=ocf provider=heartbeat type=IPaddr2)
   Attributes: cidr_netmask=24 ip=10.70.10.2
   Operations: monitor interval=10s on-fail=restart timeout=60s (vip-master-monitor-interval-10s)
               start interval=0s on-fail=restart timeout=60s (vip-master-start-interval-0s)
               stop interval=0s on-fail=block timeout=60s (vip-master-stop-interval-0s)
  Resource: vip-rep (class=ocf provider=heartbeat type=IPaddr2)
   Attributes: cidr_netmask=24 ip=10.70.10.1
   Meta Attrs: migration-threshold=0
   Operations: monitor interval=10s on-fail=restart timeout=60s (vip-rep-monitor-interval-10s)
               start interval=0s on-fail=stop timeout=60s (vip-rep-start-interval-0s)
               stop interval=0s on-fail=ignore timeout=60s (vip-rep-stop-interval-0s)

Stonith Devices:
Fencing Levels:

Location Constraints:
Ordering Constraints:
  promote msPostgresql then start master-group (score:INFINITY) (non-symmetrical)
  demote msPostgresql then stop master-group (score:0) (non-symmetrical)
Colocation Constraints:
  master-group with msPostgresql (score:INFINITY) (rsc-role:Started) (with-rsc-role:Master)
Ticket Constraints:

Alerts:
 No alerts defined

Resources Defaults:
 resource-stickiness: INFINITY
 migration-threshold: 0
Operations Defaults:
 No defaults set

Cluster Properties:
 cluster-infrastructure: corosync
 cluster-name: cluster1
 cluster-recheck-interval: 60
 dc-version: 1.1.19-8.el7-c3c624ea3d
 have-watchdog: false
 no-quorum-policy: ignore
 start-failure-is-fatal: false
 stonith-enabled: false
Node Attributes:
 Node1: pgsql-data-status=STREAMING|ASYNC
 Node2: pgsql-data-status=LATEST

Quorum:
  Options:

Thanks !
User2019 (11 rep)
Sep 12, 2019, 09:30 AM • Last activity: Sep 16, 2019, 06:18 PM
0 votes
1 answers
887 views
corosync 2Node vs two_node flags
is the **`2node`** equivalent to the **`two_node`** flag ? Is it the same ? [root@srv1 ~]# corosync-quorumtool -s Quorum information ------------------ Date: Wed Mar 20 04:49:10 2019 Quorum provider: corosync_votequorum Nodes: 2 Node ID: 1 Ring ID: 1/464 Quorate: Yes Votequorum information ---------...
is the **2node** equivalent to the **two_node** flag ? Is it the same ? [root@srv1 ~]# corosync-quorumtool -s Quorum information ------------------ Date: Wed Mar 20 04:49:10 2019 Quorum provider: corosync_votequorum Nodes: 2 Node ID: 1 Ring ID: 1/464 Quorate: Yes Votequorum information ---------------------- Expected votes: 2 Highest expected: 2 Total votes: 2 Quorum: 1 Flags: 2Node Quorate WaitForAll Membership information ---------------------- Nodeid Votes Name 1 1 srv1cr1 (local) 2 1 srv2cr1 http://people.redhat.com/ccaulfie/docs/Votequorum_Intro.pdf I was not able to find an answer to my question and the documentation is referencing **two_node** only.
blabla_trace (385 rep)
Mar 20, 2019, 08:57 AM • Last activity: Aug 9, 2019, 02:42 PM
2 votes
1 answers
3241 views
PCS Stonith (fencing) will kill two node cluster if first is down
I have configured a two node physical server cluster (HP ProLiant DL560 Gen8) using pcs (corosync/pacemaker/pcsd). I have also configured fencing on them using fence_ilo4. The weird thing will happen if one node goes down (under DOWN i mean power OFF), the second node will die as well. Fencing will...
I have configured a two node physical server cluster (HP ProLiant DL560 Gen8) using pcs (corosync/pacemaker/pcsd). I have also configured fencing on them using fence_ilo4. The weird thing will happen if one node goes down (under DOWN i mean power OFF), the second node will die as well. Fencing will kill itself causing both servers to be offline. How do i correct this behavior? The thing i tried is to add "wait_for_all: 0" and "expected_votes: 1" in /etc/corosync/corosync.conf under quorum section. But it will still kill it. At some point, some maintenance is to be performed on one of those servers, and it will have to be shutdown. I don't want for the other node to go down if this happens. Here are some outputs [root@kvm_aquila-02 ~]# pcs quorum status Quorum information ------------------ Date: Fri Jun 28 09:07:18 2019 Quorum provider: corosync_votequorum Nodes: 2 Node ID: 2 Ring ID: 1/284 Quorate: Yes Votequorum information ---------------------- Expected votes: 2 Highest expected: 2 Total votes: 2 Quorum: 1 Flags: 2Node Quorate Membership information ---------------------- Nodeid Votes Qdevice Name 1 1 NR kvm_aquila-01 2 1 NR kvm_aquila-02 (local) [root@kvm_aquila-02 ~]# pcs config show Cluster Name: kvm_aquila Corosync Nodes: kvm_aquila-01 kvm_aquila-02 Pacemaker Nodes: kvm_aquila-01 kvm_aquila-02 Resources: Clone: dlm-clone Meta Attrs: interleave=true ordered=true Resource: dlm (class=ocf provider=pacemaker type=controld) Operations: monitor interval=30s on-fail=fence (dlm-monitor-interval-30s) start interval=0s timeout=90 (dlm-start-interval-0s) stop interval=0s timeout=100 (dlm-stop-interval-0s) Clone: clvmd-clone Meta Attrs: interleave=true ordered=true Resource: clvmd (class=ocf provider=heartbeat type=clvm) Operations: monitor interval=30s on-fail=fence (clvmd-monitor-interval-30s) start interval=0s timeout=90s (clvmd-start-interval-0s) stop interval=0s timeout=90s (clvmd-stop-interval-0s) Group: test_VPS Resource: test (class=ocf provider=heartbeat type=VirtualDomain) Attributes: config=/shared/xml/test.xml hypervisor=qemu:///system migration_transport=ssh Meta Attrs: allow-migrate=true is-managed=true priority=100 target-role=Started Utilization: cpu=4 hv_memory=4096 Operations: migrate_from interval=0 timeout=120s (test-migrate_from-interval-0) migrate_to interval=0 timeout=120 (test-migrate_to-interval-0) monitor interval=10 timeout=30 (test-monitor-interval-10) start interval=0s timeout=300s (test-start-interval-0s) stop interval=0s timeout=300s (test-stop-interval-0s) Stonith Devices: Resource: kvm_aquila-01 (class=stonith type=fence_ilo4) Attributes: ipaddr=10.0.4.39 login=fencing passwd=0ToleranciJa pcmk_host_list="kvm_aquila-01 kvm_aquila-02" Operations: monitor interval=60s (kvm_aquila-01-monitor-interval-60s) Resource: kvm_aquila-02 (class=stonith type=fence_ilo4) Attributes: ipaddr=10.0.4.49 login=fencing passwd=0ToleranciJa pcmk_host_list="kvm_aquila-01 kvm_aquila-02" Operations: monitor interval=60s (kvm_aquila-02-monitor-interval-60s) Fencing Levels: Location Constraints: Ordering Constraints: start dlm-clone then start clvmd-clone (kind:Mandatory) Colocation Constraints: clvmd-clone with dlm-clone (score:INFINITY) Ticket Constraints: Alerts: No alerts defined Resources Defaults: No defaults set Operations Defaults: No defaults set Cluster Properties: cluster-infrastructure: corosync cluster-name: kvm_aquila dc-version: 1.1.19-8.el7_6.4-c3c624ea3d have-watchdog: false last-lrm-refresh: 1561619537 no-quorum-policy: ignore stonith-enabled: true Quorum: Options: wait_for_all: 0 [root@kvm_aquila-02 ~]# pcs cluster status Cluster Status: Stack: corosync Current DC: kvm_aquila-02 (version 1.1.19-8.el7_6.4-c3c624ea3d) - partition with quorum Last updated: Fri Jun 28 09:14:11 2019 Last change: Thu Jun 27 16:23:44 2019 by root via cibadmin on kvm_aquila-01 2 nodes configured 7 resources configured PCSD Status: kvm_aquila-02: Online kvm_aquila-01: Online [root@kvm_aquila-02 ~]# pcs status Cluster name: kvm_aquila Stack: corosync Current DC: kvm_aquila-02 (version 1.1.19-8.el7_6.4-c3c624ea3d) - partition with quorum Last updated: Fri Jun 28 09:14:31 2019 Last change: Thu Jun 27 16:23:44 2019 by root via cibadmin on kvm_aquila-01 2 nodes configured 7 resources configured Online: [ kvm_aquila-01 kvm_aquila-02 ] Full list of resources: kvm_aquila-01 (stonith:fence_ilo4): Started kvm_aquila-01 kvm_aquila-02 (stonith:fence_ilo4): Started kvm_aquila-02 Clone Set: dlm-clone [dlm] Started: [ kvm_aquila-01 kvm_aquila-02 ] Clone Set: clvmd-clone [clvmd] Started: [ kvm_aquila-01 kvm_aquila-02 ] Resource Group: test_VPS test (ocf::heartbeat:VirtualDomain): Started kvm_aquila-01 Daemon Status: corosync: active/enabled pacemaker: active/enabled pcsd: active/enabled
Marko Todoric (437 rep)
Jun 28, 2019, 07:14 AM • Last activity: Jun 28, 2019, 02:38 PM
1 votes
2 answers
9752 views
Corosync error "No interfaces defined" in a cluster member
I am having an error starting corosync on a cluster member: May 16 00:53:32 neftis corosync[19741]: [MAIN ] Corosync Cluster Engine ('2.3.4'): started and ready to provide service. May 16 00:53:32 neftis corosync[19741]: [MAIN ] Corosync built-in features: dbus systemd xmlconf snmp pie relro bindnow...
I am having an error starting corosync on a cluster member: May 16 00:53:32 neftis corosync: [MAIN ] Corosync Cluster Engine ('2.3.4'): started and ready to provide service. May 16 00:53:32 neftis corosync: [MAIN ] Corosync built-in features: dbus systemd xmlconf snmp pie relro bindnow May 16 00:53:32 neftis corosync: [MAIN ] parse error in config: No interfaces defined May 16 00:53:32 neftis corosync: [MAIN ] Corosync Cluster Engine exiting with status 8 at main.c:1278. May 16 00:53:32 neftis corosync: Starting Corosync Cluster Engine (corosync): [FALL�] May 16 00:53:32 neftis systemd: corosync.service: control process exited, code=exited status=1 May 16 00:53:32 neftis systemd: Failed to start Corosync Cluster Engine. May 16 00:53:32 neftis systemd: Unit corosync.service entered failed state. May 16 00:53:32 neftis systemd: corosync.service failed. May 16 00:54:06 neftis systemd: Cannot add dependency job for unit firewalld.service, ignoring: Unit firewalld.service is masked. May 16 00:54:06 neftis systemd: Starting Corosync Cluster Engine... May 16 00:54:06 neftis corosync: [MAIN ] Corosync Cluster Engine ('2.3.4'): started and ready to provide service. May 16 00:54:06 neftis corosync: [MAIN ] Corosync built-in features: dbus systemd xmlconf snmp pie relro bindnow May 16 00:54:06 neftis corosync: [MAIN ] parse error in config: No interfaces defined May 16 00:54:06 neftis corosync: [MAIN ] Corosync Cluster Engine exiting with status 8 at main.c:1278. May 16 00:54:06 neftis corosync: Starting Corosync Cluster Engine (corosync): [FALL�] May 16 00:54:06 neftis systemd: corosync.service: control process exited, code=exited status=1 May 16 00:54:06 neftis systemd: Failed to start Corosync Cluster Engine. May 16 00:54:06 neftis systemd: Unit corosync.service entered failed state. Here is my config on the three nodes but it is failing just in netfis which I added recently. totem { version: 2 secauth: off cluster_name: cluster-osiris transport: udpu } nodelist { node { ring0_addr: isis.localdoamin nodeid: 1 } node { ring0_addr: horus.localdoamin nodeid: 2 } node { ring0_addr: netfis.localdoamin nodeid: 3 } } quorum { provider: corosync_votequorum } logging { to_syslog: yes } I am running a pacemaker, corosync, pcs cluster on CentOS 7.1 64. bits. I searched on internet but It is not clear what is going on. Could you help me?
mijhael3000 (85 rep)
May 16, 2016, 04:28 AM • Last activity: Jun 18, 2019, 08:42 AM
1 votes
0 answers
788 views
Adding (cifs, samba)Filesystem resource to PCS
I am trying to create _PCS_ resource _Filesystem_ on Samba share(cifs) file system type. Here is a command that I have used to create pcs resource. ``` root@shaunak-VirtualBox:~# pcs resource show SMBDiskResourceName Resource: SMBDiskResourceName (class=ocf provider=heartbeat type=Filesystem) Attrib...
I am trying to create _PCS_ resource _Filesystem_ on Samba share(cifs) file system type. Here is a command that I have used to create pcs resource.
root@shaunak-VirtualBox:~# pcs resource show SMBDiskResourceName
 Resource: SMBDiskResourceName (class=ocf provider=heartbeat type=Filesystem)
  Attributes: device=//192.168.1.6/my_data_share directory=/var/opt/my/data fstype=cifs options="vers=3.0,username=myuser,password=myuser,uid= 998,gid=998,file_mode=0777,dir_mode=0777"
Here is a error I am getting while starting resource,
root@shaunak-VirtualBox:~# pcs resource debug-start SMBDiskResourceName
Error performing operation: Operation not permitted
Operation start for SMBDiskResourceName (ocf:heartbeat:Filesystem) returned 1
 >  stderr: INFO: Running start for //192.168.1.6/mssql_data_share on /var/opt/mssql/data
 >  stderr: 
 >  stderr: Usage:
 >  stderr:  mount [-lhV]
 >  stderr:  mount -a [options]
 >  stderr:  mount [options] [--source]  | [--target] 
 >  stderr:  mount [options]  
 >  stderr:  mount   []
 >  stderr: 
 >  stderr: Mount a filesystem.
 >  stderr: 
.....
 >  stderr: 
 >  stderr: For more details see mount(8).
 >  stderr: ocf-exit-reason:Couldn't mount filesystem //192.168.1.6/my_data_share on /var/opt/my/data
Above error says, mount is not happening on
/192.168.1.6/my_data_share
but where I run mount command, I am able to do mounting. Note sure what I am missing here. Here is a my mount command which I executed successfully.
mount -t  cifs //192.168.1.6/my_data_share /var/opt/my/data -o vers=3.0,username=myuser,password=myuser,uid=998,gid=998,file_mode=0777,dir_mode=0777
Shaunak Patel (111 rep)
Mar 26, 2019, 05:24 PM • Last activity: Mar 26, 2019, 06:34 PM
3 votes
1 answers
15321 views
quorum in a two-node cluster with pacemaker
I have two node active-passive cluster. [Clusters_from_Scratch][1] > If a cluster splits into two (or more) groups of nodes that can no > longer communicate with each other (aka. partitions), quorum is used > to prevent resources from starting on more nodes than desired, which > would risk data corr...
I have two node active-passive cluster. Clusters_from_Scratch > If a cluster splits into two (or more) groups of nodes that can no > longer communicate with each other (aka. partitions), quorum is used > to prevent resources from starting on more nodes than desired, which > would risk data corruption. A cluster has quorum when more than half > of all known nodes are online in the same partition > > By the above definition, a two-node cluster would only have quorum > when both nodes are running. This would make the creation of a > two-node cluster pointless, but corosync has the ability to treat > two-node clusters as if only one node is required for quorum. The pcs > cluster setup command will automatically configure two_node: 1 in > corosync.conf, so a two-node cluster will "just work". Here's my config: enter image description here So how can the cluster now decide which one has quorum?
blabla_trace (385 rep)
Feb 22, 2019, 09:09 PM • Last activity: Feb 22, 2019, 10:53 PM
1 votes
1 answers
306 views
When I start corosync all servers panics with core dumps
I upgraded my servers. Then I started corosync service one by one on my servers. I started first on 3 server and I wait 5 min. Then I started next 4 corosync on other servers and 7 server crashed in same time. I'm using corosync since 5 years. I was using; Kernel: 4.14.32-1-lts Corosync 2.4.2-1 Pace...
I upgraded my servers. Then I started corosync service one by one on my servers. I started first on 3 server and I wait 5 min. Then I started next 4 corosync on other servers and 7 server crashed in same time. I'm using corosync since 5 years. I was using; Kernel: 4.14.32-1-lts Corosync 2.4.2-1 Pacemaker 1.1.18-1 and I never saw this before. I guess something is broken in new corosync version really really bad! Kernel: 4.14.70-1-lts Corosync 2.4.4-3 Pacemaker 2.0.0-1 - **This is my corosync.conf: https://paste.ubuntu.com/p/7KCq8pHKn3/** **Can you tell me how can I find the reason of the problem?** Sep 25 08:56:03 SRV-2 corosync: [TOTEM ] A new membership (10.10.112.10:56) was formed. Members joined: 7 Sep 25 08:56:03 SRV-2 corosync: [VOTEQ ] Waiting for all cluster members. Current votes: 7 expected_votes: 28 Sep 25 08:56:03 SRV-2 corosync: [VOTEQ ] Waiting for all cluster members. Current votes: 7 expected_votes: 28 Sep 25 08:56:03 SRV-2 corosync: [VOTEQ ] Waiting for all cluster members. Current votes: 7 expected_votes: 28 Sep 25 08:56:03 SRV-2 corosync: [VOTEQ ] Waiting for all cluster members. Current votes: 7 expected_votes: 28 Sep 25 08:56:03 SRV-2 corosync: [QUORUM] Members: 1 2 3 4 5 6 7 Sep 25 08:56:03 SRV-2 corosync: [MAIN ] Completed service synchronization, ready to provide service. Sep 25 08:56:03 SRV-2 corosync: [VOTEQ ] Waiting for all cluster members. Current votes: 7 expected_votes: 28 Sep 25 08:56:03 SRV-2 systemd: Created slice system-systemd\x2dcoredump.slice. Sep 25 08:56:03 SRV-2 systemd: Started Process Core Dump (PID 43798/UID 0). Sep 25 08:56:03 SRV-2 systemd: corosync.service: Main process exited, code=dumped, status=11/SEGV Sep 25 08:56:03 SRV-2 systemd: corosync.service: Failed with result 'core-dump'. Sep 25 08:56:03 SRV-2 kernel: watchdog: watchdog0: watchdog did not stop! Sep 25 08:56:03 SRV-2 systemd-coredump: Process 29089 (corosync) of user 0 dumped core. Stack trace of thread 29089: #0 0x0000000000000000 n/a (n/a) Write failed: Broken pipe coredumpctl info PID: 23658 (corosync) UID: 0 (root) GID: 0 (root) Signal: 11 (SEGV) Timestamp: Mon 2018-09-24 09:50:58 +03 (1 day 3h ago) Command Line: corosync Executable: /usr/bin/corosync Control Group: /system.slice/corosync.service Unit: corosync.service Slice: system.slice Boot ID: 79d67a83f83c4804be6ded8e6bd5f54d Machine ID: 9b1ca27d3f4746c6bcfcdb93b83f3d45 Hostname: SRV-1 Storage: /var/lib/systemd/coredump/core.corosync.0.79d67a83f83c4804be6ded8e6bd5f54d.23658.153777185> Message: Process 23658 (corosync) of user 0 dumped core. Stack trace of thread 23658: #0 0x0000000000000000 n/a (n/a) PID: 5164 (corosync) UID: 0 (root) GID: 0 (root) Signal: 11 (SEGV) Timestamp: Tue 2018-09-25 08:56:03 +03 (4h 9min ago) Command Line: corosync Executable: /usr/bin/corosync Control Group: /system.slice/corosync.service Unit: corosync.service Slice: system.slice Boot ID: 2f49ec6cdcc144f0a8eb712bbfbd7203 Machine ID: 9b1ca27d3f4746c6bcfcdb93b83f3d45 Hostname: SRV-1 Storage: /var/lib/systemd/coredump/core.corosync.0.2f49ec6cdcc144f0a8eb712bbfbd7203.5164.1537854963> Message: Process 5164 (corosync) of user 0 dumped core. Stack trace of thread 5164: #0 0x0000000000000000 n/a (n/a) I cant find more log so I can't dig the problem.
Ozbit (439 rep)
Sep 25, 2018, 10:03 AM • Last activity: Oct 11, 2018, 12:29 PM
1 votes
1 answers
848 views
How to install debug symbols for corosync package on CentOS?
I got a crash in `corosync` which I would like to view in gdb. However, currently the core dump shows me only this much info Debug logs for core.1385 (Generated on Jul 26 10:17 BST) [Thread debugging using libthread_db enabled] Core was generated by `corosync -f'. Program terminated with signal 6, A...
I got a crash in corosync which I would like to view in gdb. However, currently the core dump shows me only this much info Debug logs for core.1385 (Generated on Jul 26 10:17 BST) [Thread debugging using libthread_db enabled] Core was generated by `corosync -f'. Program terminated with signal 6, Aborted. #0 0x00007f68b2783495 in raise () from /lib64/libc.so.6 #0 0x00007f68b2783495 in raise () from /lib64/libc.so.6 #1 0x00007f68b2784c75 in abort () from /lib64/libc.so.6 #2 0x00007f68b277c60e in __assert_fail_base () from /lib64/libc.so.6 #3 0x00007f68b277c6d0 in __assert_fail () from /lib64/libc.so.6 #4 0x00007f68b3530f2c in ?? () from /usr/lib64/libtotem_pg.so.4 #5 0x00007f68b3534eaf in ?? () from /usr/lib64/libtotem_pg.so.4 #6 0x00007f68b3535259 in ?? () from /usr/lib64/libtotem_pg.so.4 #7 0x00007f68b352f108 in rrp_deliver_fn () from /usr/lib64/libtotem_pg.so.4 #8 0x00007f68b352be2a in ?? () from /usr/lib64/libtotem_pg.so.4 #9 0x00007f68b3524482 in poll_run () from /usr/lib64/libtotem_pg.so.4 #10 0x00000000004079b6 in main () I guess I need to install the debug info packages for corosync and whatever is libtotem_pg.so.4. How to do this?
Serge Rogatch (167 rep)
Jul 26, 2018, 04:05 PM • Last activity: Jul 26, 2018, 04:59 PM
2 votes
1 answers
2841 views
Unable to mount gfs2 file system on Debian Stretch, probable dlm mis-config?
I am experimenting with gfs2 on Debian Stretch, and having some difficulties. I am a reasonably experienced Linux admin, but new to shared-disk and parallel file systems. My immediate project is to mount a gfs2-formatted iscsi-exported device on multiple clients as a shared file system. For the mome...
I am experimenting with gfs2 on Debian Stretch, and having some difficulties. I am a reasonably experienced Linux admin, but new to shared-disk and parallel file systems. My immediate project is to mount a gfs2-formatted iscsi-exported device on multiple clients as a shared file system. For the moment, I am not interested in HA or fencing, although this may be important later on. The iscsi part is fine, I am able to log in to the target, format it as an xfs file system, and also mount it on multiple clients and verify that it shows up with the same blkid. To do the gfs2 business, I am following the scheme on the Debian stretch "gfs2" man page, modified for my config, and embellished slightly by various searches and so forth. Man page is here: https://manpages.debian.org/stretch/gfs2-utils/gfs2.5.en.html The actual error is, when I attempt to mount my gfs2 file system, the mount command returns with mount: mount(2) failed: /mnt: No such file or directory ... where /mnt is the desired mount point, which certainly does exist. (If you attempt to mount to a nonexistent mount point the error is "mount: mount point /wrong does not exist"). Related, at each mount attempt, dmesg reports: gfs2: can't find protocol lock_dlm I briefly went down the path of assuming the problem was that Debian packages do not provide "/sbin/mount.gfs2", and looked for that, but I think that was an incorrect guess. I have a five-machine cluster (of Raspberry Pis, in case it matters), named, somewhat idiosyncratically, pio, pi, pj, pk, and pl. They all have fixed static IP addresses, and there's no domain. I have installed the Debian gfs2, corosync, and dlm-controld packages. For the corosync step, my corosync config is (e.g. for pio, intended to be the master of the cluster): totem { version: 2 cluster_name: rpitest token: 3000 token_retransmits_before_loss_const: 10 clear_node_high_bit: yes crypto_cipher: none crypto_hash: none nodeid: 17 interface { ringnumber: 0 bindnetaddr: 192.168.0.17 mcastport: 5405 ttl: 1 } } nodelist { node { ring0_addr: 192.168.0.17 nodeid: 17 } node { ring0_addr: 192.168.0.11 nodeid: 1 } node { ring0_addr: 192.168.0.12 nodeid: 2 } node { ring0_addr: 192.168.0.13 nodeid: 3 } node { ring0_addr: 192.168.0.14 nodeid: 4 } } logging { fileline: off to_stderr: no to_logfile: no to_syslog: yes syslog_facility: daemon debug: off timestamp: on logger_subsys { subsys: QUORUM debug: off } } quorum { provider: corosync_votequorum expected_votes: 5 } This file is present on all the nodes, with appropriate node-specific changes to the nodeid and bindnetaddr fields in the totem section. The corosync tool starts without error on all nodes, and all the nodes also have sane-looking output from corosync-quorumtool, thus: root@pio:~# corosync-quorumtool Quorum information ------------------ Date: Sun Apr 22 11:04:13 2018 Quorum provider: corosync_votequorum Nodes: 5 Node ID: 17 Ring ID: 1/124 Quorate: Yes Votequorum information ---------------------- Expected votes: 5 Highest expected: 5 Total votes: 5 Quorum: 3 Flags: Quorate Membership information ---------------------- Nodeid Votes Name 1 1 192.168.0.11 2 1 192.168.0.12 3 1 192.168.0.13 4 1 192.168.0.14 17 1 192.168.0.17 (local) The dlm-controld package was installed, and /etc/dlm/dlm.conf created with the following simple config. Again, I am skipping fencing for now. The dlm.conf file is the same on all the nodes. enable_fencing=0 lockspace rpitest nodir=1 master rpitest node=17 I am unclear on whether or not the DLM "lockspace" name is supposed to match the corosync cluster name or not. I see the same behavior either way. The dlm-controld service starts without errors, and the the output of "dlm_tool status" appears sane: root@pio:~# dlm_tool status cluster nodeid 17 quorate 1 ring seq 124 124 daemon now 1367 fence_pid 0 node 1 M add 31 rem 0 fail 0 fence 0 at 0 0 node 2 M add 31 rem 0 fail 0 fence 0 at 0 0 node 3 M add 31 rem 0 fail 0 fence 0 at 0 0 node 4 M add 31 rem 0 fail 0 fence 0 at 0 0 node 17 M add 7 rem 0 fail 0 fence 0 at 0 0 The gfs2 file system was created by: mkfs -t gfs2 -p lock_dlm -j 5 -t rpitest:one /path/to/device Subsequent to this, "blkid /path/to/device" reports: /path/to/device: LABEL="rpitest:one" UUID= TYPE="gfs2" It looks the same on all the iscsi clients. At this point, I feel like I should be able to mount the gfs2 file system on any/all of the clients, but here is where I get the error above -- the mount command reports a "no such file or directory", and dmesg and syslog report "gfs2: can't find protocol lock_dlm". There are several other gfs2 guides out there, but many of them seem to be RH/CentOS specific, and for other cluster-management schemes besides corosync, like cman or pacemaker. Those aren't necessarily deal-breakers, but it's high-value to me to have this work on nearly-stock Debian Stretch. It also seems likely to me that this is probably a pretty simple dlm misconfiguration, but I can't seem to nail it down. Additional clues: When I try to "join" a lockspace via dlm_tool join ... I get a dmesg output: dlm cluster name 'rpitest' is being used without an application provided cluster name This happens independently of whether the lockspace I am joining is "rpitest" or not. This suggests that lockspace names and cluster names are indeed the same thing, and/but that the dlm is evidently not aware of the corosync config?
Andrew Reid (53 rep)
Apr 22, 2018, 04:44 PM • Last activity: Apr 24, 2018, 06:09 AM
Showing page 1 of 20 total questions