Unix & Linux Stack Exchange
Q&A for users of Linux, FreeBSD and other Unix-like operating systems
Latest Questions
0
votes
1
answers
2303
views
Secondary DRBD node does not auto-start in Pacemaker+Corosync setup
I am trying to set up a 2-PC cluster with shared resources: `ClusterIP`, `ClusterSamba`, `ClusterNFS`, `DRBD` (cloned resource), and a `DRBDFS`. The beginning of the project followed the [Clusters from Scratch](https://clusterlabs.org/pacemaker/doc/en-US/Pacemaker/2.0/html/Clusters_from_Scratch/inde...
I am trying to set up a 2-PC cluster with shared resources:
ClusterIP
, ClusterSamba
, ClusterNFS
, DRBD
(cloned resource), and a DRBDFS
.
The beginning of the project followed the [Clusters from Scratch](https://clusterlabs.org/pacemaker/doc/en-US/Pacemaker/2.0/html/Clusters_from_Scratch/index.html) guide. When everything in this guide is done, it works without problems.
So, I wanted to use parts of that guide and build my own setup:
I created one shared IP (ClusterIP
) that is automatically assigned to one node, and (here is where it gets tricky) on that node, I mount my /dev/drbd1
device to /exports
and then share this mount through **SAMBA** and **NFS**.
When I start the cluster, all resources come up as they should, _but DRBD does not go up on the secondary node_ (Primary/Unknown
). If I bring it up manually, it syncs and works. Also, when I stop the cluster (or forcibly reboot the first node), all resources transfer to the other node and everything works, _except DRBD on the other node goes into an Unknown state_.
### So now, here is the problem:
**Why does DRBD go down on the secondary node when I stop the cluster? Or why doesn't it start in the Secondary role on the secondary node?**
Sorry if my description is bad.
---
## Here are the commands I used
# apt install -y pacemaker pcs psmisc policycoreutils-python-utils drbd-utils samba nfs-kernel-server
# systemctl start pcsd.service
# systemctl enable pcsd.service
# passwd hacluster
# pcs host auth alice bob
# pcs cluster setup myCluster alice bob --force
# pcs cluster start --all
# pcs property set stonith-enabled=false
# pcs property set no-quorum-policy=ignore
# modprobe drbd
# echo drbd >/etc/modules-load.d/drbd.conf
# drbdadm create-md r0
# drbdadm up r0
# drbdadm primary r0 --force
# mkfs.ext4 /dev/drbd1
# systemctl disable smbd
# systemctl disable nfs-kernel-server.service
# mkdir /exports
# vi /etc/samba/smb.conf
# vi /etc/exports
# pcs resource create ClusterIP ocf:heartbeat:IPaddr2 ip=10.1.1.30 cidr_netmask=24 op monitor interval=30s
# pcs resource defaults resource-stickiness=100
# pcs resource op defaults timeout=240s
# pcs resource create ClusterSamba lsb:smbd op monitor interval=60s
# pcs resource create ClusterNFS ocf:heartbeat:nfsserver op monitor interval=60s
# pcs resource create DRBD ocf:linbit:drbd drbd_resource=r0 op monitor interval=60s
# pcs resource promotable DRBD promoted-max=1 promoted-node-max=1 clone-max=2 clone-node-max=1 notify=true
# pcs resource create DRBDFS Filesystem device="/dev/drbd1" directory="/exports" fstype="ext4"
# pcs constraint order ClusterIP then ClusterNFS
# pcs constraint order ClusterNFS then ClusterSamba
# pcs constraint order promote DRBD-clone then start DRBDFS
# pcs constraint order DRBDFS then ClusterNFS
# pcs constraint order ClusterIP then DRBD-clone
# pcs constraint colocation ClusterSamba with ClusterIP
# pcs constraint colocation add ClusterSamba with ClusterIP
# pcs constraint colocation add ClusterNFS with ClusterIP
# pcs constraint colocation add DRBDFS with DRBD-clone INFINITY with-rsc-role=Master
# pcs constraint colocation add DRBD-clone with ClusterIP
# pcs cluster stop --all && sleep 2 && pcs cluster start --all
---
## Configs and stats
### /etc/drbd.d/r0.res
resource r0 {
device /dev/drbd1;
disk /dev/sdb;
meta-disk internal;
net {
allow-two-primaries;
}
on alice {
address 10.1.1.31:7788;
}
on bob {
address 10.1.1.32:7788;
}
}
---
### /etc/corosync/corosync.conf
totem {
version: 2
cluster_name: myCluster
transport: knet
crypto_cipher: aes256
crypto_hash: sha256
}
nodelist {
node {
ring0_addr: alice
name: alice
nodeid: 1
}
node {
ring0_addr: bob
name: bob
nodeid: 2
}
}
quorum {
provider: corosync_votequorum
two_node: 1
}
logging {
to_logfile: yes
logfile: /var/log/corosync/corosync.log
to_syslog: yes
timestamp: on
}
---
### pcs status
Cluster name: myCluster
Stack: corosync
Current DC: alice (version 2.0.1-9e909a5bdd) - partition with quorum
Last updated: Fri May 15 12:28:30 2020
Last change: Fri May 15 11:04:50 2020 by root via cibadmin on bob
2 nodes configured
6 resources configured
Online: [ alice bob ]
Full list of resources:
ClusterIP (ocf::heartbeat:IPaddr2): Started alice
ClusterSamba (lsb:smbd): Started alice
ClusterNFS (ocf::heartbeat:nfsserver): Started alice
Clone Set: DRBD-clone [DRBD] (promotable)
Masters: [ alice ]
Stopped: [ bob ]
DRBDFS (ocf::heartbeat:Filesystem): Started alice
Daemon Status:
corosync: active/disabled
pacemaker: active/disabled
pcsd: active/enabled
---
### pcs constraint --full
Location Constraints:
Ordering Constraints:
start ClusterIP then start ClusterNFS (kind:Mandatory) (id:order-ClusterIP-ClusterNFS-mandatory)
start ClusterNFS then start ClusterSamba (kind:Mandatory) (id:order-ClusterNFS-ClusterSamba-mandatory)
promote DRBD-clone then start DRBDFS (kind:Mandatory) (id:order-DRBD-clone-DRBDFS-mandatory)
start DRBDFS then start ClusterNFS (kind:Mandatory) (id:order-DRBDFS-ClusterNFS-mandatory)
start ClusterIP then start DRBD-clone (kind:Mandatory) (id:order-ClusterIP-DRBD-clone-mandatory)
start ClusterIP then promote DRBD-clone (kind:Mandatory) (id:order-ClusterIP-DRBD-clone-mandatory-1)
Colocation Constraints:
ClusterSamba with ClusterIP (score:INFINITY) (id:colocation-ClusterSamba-ClusterIP-INFINITY)
ClusterNFS with ClusterIP (score:INFINITY) (id:colocation-ClusterNFS-ClusterIP-INFINITY)
DRBDFS with DRBD-clone (score:INFINITY) (with-rsc-role:Master) (id:colocation-DRBDFS-DRBD-clone-INFINITY)
DRBD-clone with ClusterIP (score:INFINITY) (id:colocation-DRBD-clone-ClusterIP-INFINITY)
Ticket Constraints:
---
### /proc/drbd
version: 8.4.10 (api:1/proto:86-101)
srcversion: 983FCB77F30137D4E127B83
1: cs:WFConnection ro:Primary/Unknown ds:UpToDate/DUnknown C r-----
ns:0 nr:4 dw:8 dr:17 al:1 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f oos:4
Miki
(31 rep)
May 15, 2020, 11:12 AM
• Last activity: Jun 19, 2025, 10:03 PM
2
votes
1
answers
2579
views
After failover Pacemaker moves resource back when node comes back
I'm using Pacemaker & Corosync for my cluster. When a node dies pacemaker moving my resources to another online node. Everything ok here. But when the dead node comes back, Pacemaker moving the resource back. I don't have any "location" line in my config and also I tried with "unmove" command but no...
I'm using Pacemaker & Corosync for my cluster.
When a node dies pacemaker moving my resources to another online node. Everything ok here.
But when the dead node comes back, Pacemaker moving the resource back.
I don't have any "location" line in my config and also I tried with "unmove" command but nothing changed.
I failed at somewhere and need to find the reason.
**crm configure sh**
node 1: DEV1
node 2: DEV2
primitive poolip IPaddr2 \
params ip=10.1.60.33 nic=enp2s0f0 cidr_netmask=24 \
meta migration-threshold=2 target-role=Started \
op monitor interval=20 timeout=20 on-fail=restart
primitive gui systemd:gui \
op monitor interval=20s \
meta target-role=Started
primitive gui-ip IPaddr2 \
params ip=10.1.60.35 nic=enp2s0f0 cidr_netmask=24 \
meta migration-threshold=2 target-role=Started \
op monitor interval=20 timeout=20 on-fail=restart
colocation cluster-gui inf: gui gui-ip
order gui-after-ip Mandatory: gui-ip gui
property cib-bootstrap-options: \
have-watchdog=false \
dc-version=2.0.0-1-8cf3fe749e \
cluster-infrastructure=corosync \
cluster-name=mycluster \
stonith-enabled=false \
no-quorum-policy=ignore \
last-lrm-refresh=1545920437
rsc_defaults rsc-options: \
migration-threshold=10 \
resource-stickiness=100
**pcs resource defaults**
migration-threshold=10
resource-stickiness=100
**pcs resource show gui**
Resource: gui (class=systemd type=gui)
Meta Attrs: target-role=Started
Operations: monitor interval=20s (gui-monitor-20s)
Ozbit
(439 rep)
Jan 2, 2019, 08:58 AM
• Last activity: Jun 14, 2025, 09:07 PM
1
votes
2
answers
8573
views
pcs stonith not working
i have 2 virtual centos7 nodes , root can login passwordless among themself, i have configured stonith like this but the services are not coming up, fencing is not happening , im new to this, could someone help me rectify issue~ [root@node1 cluster]# pcs stonith create nub1 fence_virt pcmk_host_list...
i have 2 virtual centos7 nodes , root can login passwordless among themself,
i have configured stonith like this but the services are not coming up, fencing is not happening , im new to this, could someone help me rectify issue~
[root@node1 cluster]# pcs stonith create nub1 fence_virt pcmk_host_list="node1"
[root@node1 cluster]# pcs stonith create nub2 fence_virt pcmk_host_list="node2"
[root@node1 cluster]# pcs stonith show
nub1 (stonith:fence_virt): Stopped
nub2 (stonith:fence_virt): Stopped
[root@node1 cluster]#
[root@node1 cluster]#
[root@node1 cluster]#
[root@node1 cluster]#
[root@node1 cluster]# pcs status
Cluster name: mycluster
Stack: corosync
Current DC: node2 (version 1.1.15-11.el7_3.5-e174ec8) - partition with quorum
Last updated: Tue Jul 25 07:03:37 2017 Last change: Tue Jul 25 07:02:00 2017 by root via cibadmin on node1
2 nodes and 3 resources configured
Online: [ node1 node2 ]
Full list of resources:
ClusterIP (ocf::heartbeat:IPaddr2): Started node1
nub1 (stonith:fence_virt): Stopped
nub2 (stonith:fence_virt): Stopped
Failed Actions:
* nub1_start_0 on node1 'unknown error' (1): call=56, status=Error, exitreason='none',
last-rc-change='Tue Jul 25 07:01:34 2017', queued=0ms, exec=7006ms
* nub2_start_0 on node1 'unknown error' (1): call=58, status=Error, exitreason='none',
last-rc-change='Tue Jul 25 07:01:42 2017', queued=0ms, exec=7009ms
* nub1_start_0 on node2 'unknown error' (1): call=54, status=Error, exitreason='none',
last-rc-change='Tue Jul 25 07:01:26 2017', queued=0ms, exec=7010ms
* nub2_start_0 on node2 'unknown error' (1): call=60, status=Error, exitreason='none',
last-rc-change='Tue Jul 25 07:01:34 2017', queued=0ms, exec=7013ms
Daemon Status:
corosync: active/enabled
pacemaker: active/enabled
pcsd: active/enabled
[root@node1 cluster]# pcs stonith fence node2
Error: unable to fence 'node2'
Command failed: No route to host
[root@node1 cluster]# pcs stonith fence nub2
Error: unable to fence 'nub2'
Command failed: No such device
[root@node1 cluster]# ping node2
PING node2 (192.168.100.102) 56(84) bytes of data.
64 bytes from node2 (192.168.100.102): icmp_seq=1 ttl=64 time=0.247 ms
64 bytes from node2 (192.168.100.102): icmp_seq=2 ttl=64 time=0.304 ms
^C
--- node2 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1001ms
rtt min/avg/max/mdev = 0.247/0.275/0.304/0.032 ms
Mohammed Ali
(691 rep)
Jul 25, 2017, 11:10 AM
• Last activity: Feb 10, 2024, 02:01 AM
0
votes
1
answers
11525
views
DRBD - 'node1' not defined in your config (for this host) - Error when setting Primary
I am getting the following error when trying to set the Primary node for DRBD. 'node1' not defined in your config (for this host). I know this is related to DNS/Hostname/Hosts and the config clusterdb.res. I know this because I originally got an error when trying to start clusterdb.res if node1 didn...
I am getting the following error when trying to set the Primary node for DRBD.
'node1' not defined in your config (for this host).
I know this is related to DNS/Hostname/Hosts and the config clusterdb.res. I know this because I originally got an error when trying to start clusterdb.res if node1 didn't resolve correctly. So what confuses me is that I can start the clusterdb.res if either use:
*I have used this command on the hosts*
hostnamectl set-hostname $(uname -n | sed s/\\..*//)
To make the hostname resolve to node1 instead of node1.localdomain
Or add node1.localdomain to the config, either works. But I have tried all combinations and can't seem to get this command to take :
drbdadm primary --force node1 && cat /proc/drbd
**My Configs**
/etc/drbd.d/clusterdb.res
resource clusterdb{
protocol C;
meta-disk internal;
device /dev/drbd0;
startup {
wfc-timeout 30;
outdated-wfc-timeout 20;
degr-wfc-timeout 30;
}
net {
cram-hmac-alg sha1;
shared-secret sync_disk;
}
syncer {
rate 10M;
al-extents 257;
on-no-data-accessible io-error;
verify-alg sha1;
}
on node1 {
disk /dev/sda3;
address 192.168.1.216:7788;
}
on node2 {
disk /dev/sda3;
address 192.168.1.217:7788;
}
}
/etc/hosts :
127.0.0.1 localhost localhost.localdomain localhost4 localhost4.localdomain4
::1 localhost localhost.localdomain localhost6 localhost6.localdomain6
192.168.1.216 node1
192.168.1.217 node2
/etc/hostname
node#
My full write up ATM (wip)
**Edits :**
[root@node1 ~]# hostname
node1
[root@node1 ~]# cat /etc/hosts
127.0.0.1 localhost localhost.localdomain localhost4 localhost4.localdomain4
::1 localhost localhost.localdomain localhost6 localhost6.localdomain6
127.0.1.1 node1
192.168.1.216 node1
192.168.1.217 node2
[root@node1 ~]#
Update: I have gotten this to work with LVM following this guide exactly, so I think my issue actually lies with the following lines of code. But for now I think i will stick with LVM since it works, unless somebody else really wants to work on this. (My working LVM writeup)
device /dev/drbd0;
or
device /dev/drbd0;
The reason I say this, is I used the same hosts/hostname/shortname/ip_addr but LVM and it worked, but maybe I missed something the first time, I fixed in my new VM Template (I started from scratch to build LVM)
FreeSoftwareServers
(2682 rep)
May 1, 2016, 01:59 AM
• Last activity: Mar 8, 2023, 02:50 AM
0
votes
1
answers
195
views
How to increase a number of nfsd threads in pcs/corosync environment?
I have NFS server bound with *pcs/corosync* to provide stability and HA. The default number of nfsd threads is 8, wich I find too little as I can observe that mostly all 8 are 90% busy all the time. The nfs/corosync/pcs system consists of 3 servers and a dist storage. How can I increase a number of...
I have NFS server bound with *pcs/corosync* to provide stability and HA. The default number of nfsd threads is 8, wich I find too little as I can observe that mostly all 8 are 90% busy all the time.
The nfs/corosync/pcs system consists of 3 servers and a dist storage. How can I increase a number of threads: should I modify */etc/sysconfig/nfs* file on all NFS nodes or should I make some changes somewhere else?
Sorry for newbie question but I have nobody to ask for it. Appreciate any help, thank you.
OS: Centos 7.
Mirimat
(33 rep)
Sep 13, 2022, 09:16 AM
• Last activity: Oct 4, 2022, 01:28 PM
0
votes
1
answers
200
views
Convert puppet manifest config to hiera
I installed corosync-pacemaker cluster via puppet. Now I would to like keep my data into hiera file. How should I convert cs_primitive section into yaml file? cs_primitive { 'nfsshare_fs': primitive_class => 'ocf', primitive_type => 'Filesystem', provided_by => 'heartbeat', parameters => { 'device'...
I installed corosync-pacemaker cluster via puppet. Now I would to like keep my data into hiera file. How should I convert cs_primitive section into yaml file?
cs_primitive { 'nfsshare_fs':
primitive_class => 'ocf',
primitive_type => 'Filesystem',
provided_by => 'heartbeat',
parameters => { 'device' => '/dev/disk/lvname', 'directory' => '/share', 'fstype' => 'ext4' },
}->
I tried the below code but it didn't work.
corosync::cs_primitive:
'nfsshare_fs':
primitive_class: 'ocf'
primitive_type: 'Filesystem'
provided_by: 'heartbeat'
parameters:
device: '/dev/disk/by-id/lvname'
directory: '/share'
fstype: 'ext4'
Thanks.
fortunate1357
(1 rep)
Apr 4, 2022, 06:21 PM
• Last activity: Jul 14, 2022, 07:27 PM
0
votes
1
answers
26
views
VM managed by corosync not detecting new CPUs
I have an HA cluster managed by corosync, and I need to increase the CPU allocation to one of the VMs. I have done the following: * `pcs resource disable myVM` * Wait for VM to stop * Edit the xml file (confirmed the correct file by `pcs sources show --full`) - within the `cpu` section I changed the...
I have an HA cluster managed by corosync, and I need to increase the CPU allocation to one of the VMs.
I have done the following:
*
pcs resource disable myVM
* Wait for VM to stop
* Edit the xml file (confirmed the correct file by pcs sources show --full
) - within the cpu
section I changed the entry: `` to change the number of cores to 8.
* Make sure that xml file is synced across all physical hosts
* pcs resource enable myVM
But when the VM comes back up, /proc/cpuinfo
shows that it still has only 4 cores (I don't have hot plug CPUs enabled / am not sure how to enable this). There are plenty of CPU cores available on the physical hosts.
Can anyone tell me what I'm doing wrong that's preventing the VM from starting up with 8 cores instead of 4? It must be something obvious but I can't see it!
Phil Evans
(101 rep)
Jun 9, 2022, 10:57 AM
• Last activity: Jun 17, 2022, 07:55 AM
0
votes
0
answers
459
views
HA-Cluster / corosync / pacemaker: Active-Active cluster with service ip / service ip is not switching
How to configure crm to migrate the ServiceIP if one Service is failed? node 1: web01a \ attributes standby=off node 2: web01b \ attributes standby=off primitive Apache2 systemd:apache2 \ operations $id=Apache2-operations \ op start interval=0 timeout=100 \ op stop interval=0 timeout=100 \ op monito...
How to configure crm to migrate the ServiceIP if one Service is failed?
node 1: web01a \
attributes standby=off
node 2: web01b \
attributes standby=off
primitive Apache2 systemd:apache2 \
operations $id=Apache2-operations \
op start interval=0 timeout=100 \
op stop interval=0 timeout=100 \
op monitor interval=15 timeout=100 start-delay=15 \
meta
primitive PHP-FPM systemd:php7.4-fpm \
operations $id=PHP-FPM-operations \
op start interval=0 timeout=100 \
op stop interval=0 timeout=100 \
op monitor interval=15 timeout=100 start-delay=15 \
meta
primitive Redis systemd:redis-server \
operations $id=Redis-operations \
op start interval=0 timeout=100 \
op stop interval=0 timeout=100 \
op monitor interval=15 timeout=100 start-delay=15 \
meta
primitive ServiceIP IPaddr2 \
params ip=1.2.3.4 \
operations $id=ServiceIP-operations \
op monitor interval=10 timeout=20 start-delay=0 \
op_params migration-threshold=1 \
meta
primitive lsyncd systemd:lsyncd \
op start interval=0 timeout=100 \
op stop interval=0 timeout=100 \
op monitor interval=15 timeout=100 start-delay=15 \
meta target-role=Started
group ActiveNode ServiceIP lsyncd
group WebServer Apache2 PHP-FPM Redis
clone cl_WS WebServer \
meta clone-max=2 notify=true interleave=true
colocation col_cl_WS_ActiveNode 100: cl_WS ActiveNode
property cib-bootstrap-options: \
have-watchdog=false \
dc-version=2.0.3-4b1f869f0f \
cluster-infrastructure=corosync \
cluster-name=debian \
stonith-enabled=false \
no-quorum-policy=ignore \
startup-fencing=false \
maintenance-mode=false \
last-lrm-refresh=1622628525 \
start-failure-is-fatal=true
These services should always be started
- Apache2
- PHP-FPM
- Redis
If one of these services is not running, the node is unhelthy.
The **ServiceIP** and **lsyncd** should switch to an healthy node.
When I killed the apache2 process, the IP is not switched.
FaxMax
(726 rep)
Jun 2, 2021, 12:29 PM
2
votes
1
answers
5355
views
Pacemaker - Corosync - HA - Simple Custom Resource Testing - Status flapping - Started - Failed - Stopped - Started
I am testing using the OCF:Heartbeat:Dummy script and I want to make a very basic setup just to know it works and build on that. The only information I can find was this web blog here. https://raymii.org/s/tutorials/Corosync_Pacemaker_-_Execute_a_script_on_failover.html It has some typos but basical...
I am testing using the OCF:Heartbeat:Dummy script and I want to make a very basic setup just to know it works and build on that.
The only information I can find was this web blog here.
https://raymii.org/s/tutorials/Corosync_Pacemaker_-_Execute_a_script_on_failover.html
It has some typos but basically worked for me.
The script currently just contains the following :
sudo nano /usr/local/bin/failover.sh && sudo chmod +x /usr/local/bin/failover.sh
#!/bin/sh
touch /tmp/testfailover.sh
Here is my setup :
cp /usr/lib/ocf/resource.d/heartbeat/Dummy /usr/lib/ocf/resource.d/heartbeat/FailOverScript
sudo nano /usr/lib/ocf/resource.d/heartbeat/FailOverScript
dummy_start() {
dummy_monitor
/usr/local/bin/failover.sh
if [ $? = $OCF_SUCCESS ]; then
return $OCF_SUCCESS
fi
touch ${OCF_RESKEY_state}
}
sed -i 's/Dummy/FailOverScript/g' /usr/lib/ocf/resource.d/heartbeat/FailOverScript
sed -i 's/dummy/FailOverScript/g' /usr/lib/ocf/resource.d/heartbeat/FailOverScript
pcs resource create FailOverScript ocf:heartbeat:FailOverScript op monitor interval="30"
The only testing I can really do :
[root@node2 ~]# /usr/lib/ocf/resource.d/heartbeat/FailOverScript start ; echo $?
DEBUG: default start : 0
0
ocf-tester doesn't seem to exist in the latest HA Software Suite, not really sure how to manually install it, but the script "half works".
**The script doesn't need monitoring, its supposed to be very basic, but it seems to be flapping and giving me the following error code. Any idea's what to do?**
FailOverScript (ocf::heartbeat:FailOverScript): Started
node2
Failed Actions:
* FailOverScript_monitor_30000 on node2 'not running' (7): call=
24423, status=complete, exitreason='none',
last-rc-change='Tue Aug 16 15:53:50 2016', queued=0ms, exec=
9ms
**Example of what I want to do:**
Cluster start
Script runs "start.sh"
Cluster fails over to node2.
On node1 script runs "fail.sh"
On node2 script runs "start.sh"
and vis versa if it fails the other direction.
Note: The script does work, I get /tmp/testfailover.sh. I even tried putting another script under dummy_stop to remove the file and that worked, but it just keeps flapping along removing/adding/removing/adding file and starting/failing/stoping/starting etc etc.
Thanks for reading!
FreeSoftwareServers
(2682 rep)
Aug 16, 2016, 07:56 PM
• Last activity: Dec 21, 2020, 06:56 AM
1
votes
0
answers
603
views
Cannot seem to start pcs cluster (NFS Cluster) disk_fencing trouble
For the life of me, I can't find a clear answer on how to start my NFS active / passive cluster. I have two nodes, node1 and node2 and followed the guide here: https://www.linuxtechi.com/configure-nfs-server-clustering-pacemaker-centos-7-rhel-7/ Here are my logs: May 25 10:35:59 node1 stonith-ng[392...
For the life of me, I can't find a clear answer on how to start my NFS active / passive cluster. I have two nodes, node1 and node2 and followed the guide here: https://www.linuxtechi.com/configure-nfs-server-clustering-pacemaker-centos-7-rhel-7/
Here are my logs:
May 25 10:35:59 node1 stonith-ng: notice: Couldn't find anyone to fence (on) node1 with any device
May 25 10:35:59 node1 stonith-ng: error: Operation on of node1 by for crmd.3928@node1.97f683f8: No route to host
May 25 10:35:59 node1 crmd: notice: Stonith operation 142/2:72:0:f3e078bf-24f5-4160-95c1-0eeeea0e5e12: No route to host (-113)
May 25 10:35:59 node1 crmd: notice: Stonith operation 142 for node1 failed (No route to host): aborting transition.
May 25 10:35:59 node1 crmd: warning: Too many failures (71) to fence node1, giving up
May 25 10:35:59 node1 crmd: notice: Transition aborted: Stonith failed
May 25 10:35:59 node1 crmd: error: Unfencing of node1 by failed: No route to host (-113)
May 25 10:35:59 node1 stonith-ng: notice: Couldn't find anyone to fence (on) node2 with any device
May 25 10:35:59 node1 stonith-ng: error: Operation on of node2 by for crmd.3928@node1.2680795a: No route to host
May 25 10:35:59 node1 crmd: notice: Stonith operation 143/1:72:0:f3e078bf-24f5-4160-95c1-0eeeea0e5e12: No route to host (-113)
May 25 10:35:59 node1 crmd: notice: Stonith operation 143 for node2 failed (No route to host): aborting transition.
May 25 10:35:59 node1 crmd: warning: Too many failures (71) to fence node2, giving up
May 25 10:35:59 node1 crmd: error: Unfencing of node2 by failed: No route to host (-113)
Here is the status:
[root@node1 ~]# pcs status
Cluster name: nfs_cluster
Stack: corosync
Current DC: node1 (version 1.1.20-5.amzn2.0.2-3c4c782f70) - partition with quorum
Last updated: Mon May 25 10:45:56 2020
Last change: Sun May 24 21:04:55 2020 by root via cibadmin on node1
2 nodes configured
5 resources configured
Online: [ node1 node2 ]
Full list of resources:
disk_fencing (stonith:fence_scsi): Stopped
Resource Group: nfsgrp
nfsshare (ocf::heartbeat:Filesystem): Stopped
nfsd (ocf::heartbeat:nfsserver): Stopped
nfsroot (ocf::heartbeat:exportfs): Stopped
nfsip (ocf::heartbeat:IPaddr2): Stopped
Failed Fencing Actions:
* unfencing of node2 failed: delegate=, client=crmd.3928, origin=node1,
last-failed='Mon May 25 10:35:59 2020'
* unfencing of node1 failed: delegate=, client=crmd.3928, origin=node1,
last-failed='Mon May 25 10:35:59 2020'
Daemon Status:
corosync: active/enabled
pacemaker: active/enabled
pcsd: active/enabled
[root@node1 ~]#
The disk_fencing is set to scsi, but not sure if that is the best options for two AWS ec2 instances. Perhaps I can't get disk_fencing to work so it can't start? I can ping node1 from node 2 and vice versa. Open to ideas...
jasontt33
(11 rep)
May 25, 2020, 10:49 AM
0
votes
0
answers
685
views
Unmount data volume hosting NFS exports
So I've got a customer with two NFS server setup in a pacemaker cluster active/standby configuration. It is a legacy system running RHEL6. On the servers there is /mnt/data1 which is an xfs mountpoint on a drbd mirrored disk. The mount is active on one node at a time and controlled by pacemaker (so...
So I've got a customer with two NFS server setup in a pacemaker cluster active/standby configuration. It is a legacy system running RHEL6. On the servers there is /mnt/data1 which is an xfs mountpoint on a drbd mirrored disk. The mount is active on one node at a time and controlled by pacemaker (so is drbd for that matter).
My problem is in critical cases where I need to move active services to the other server without shutting down the nfs clients first. I can shutdown the NFS services, but no matter what I try, I can't unmount the /mnt/data1 filesystem as it reports as 'busy'.
I tried changing the daemon stop sequence on the nodes. Right now I have the following sequence:
- rpc.mountd
- nfsd
- exportfs -au
- rpc.statd
Both 'lsof /mnt/data1' and 'fuser -mv /mnt/data1' do not report any open files on the mountpoint and I can verify no terminal sessions are there either. Short of having to shutdown the box (which kills any debugging I would like to do), I can't get the filesystem unmounted to allow pacemaker to cleanly move the filesystem mount to the other node. I assume that there are some hanging file locks, but I'm not sure how else to kill them.
Any ideas are appreciated.
pbrunnen
(113 rep)
Nov 10, 2019, 05:36 PM
• Last activity: Nov 10, 2019, 05:41 PM
1
votes
1
answers
2580
views
Pacemaker: Primary node is rebooted and comes back is primary instead of standby
We are using pacemaker, corosync to automate failovers. We noticed one behaviour- when primary node is rebooted, the standby node takes over as primary - which is fine. When the node comes back online and services are started on it, it takes back the role of Primary. It should ideally start as stand...
We are using pacemaker, corosync to automate failovers. We noticed one behaviour- when primary node is rebooted, the standby node takes over as primary - which is fine.
When the node comes back online and services are started on it, it takes back the role of Primary. It should ideally start as standby.
Are we missing any configuration?
> pcs resource defaults
O/p:
resource-stickiness: INFINITY
migration-threshold: 0
Stickiness is set to INFINITY. Please suggest.
Adding Config details:
======================
[root@Node1 heartbeat]# pcs config show –l Cluster Name: cluster1 Corosync Nodes: Node1 Node2 Pacemaker Nodes: Node1 Node2 Resources: Master: msPostgresql Meta Attrs: master-node-max=1 clone-max=2 notify=true master-max=1 clone-node-max=1 Resource: pgsql (class=ocf provider=heartbeat type=pgsql) Attributes: master_ip=10.70.10.1 node_list="Node1 Node2" pgctl=/usr/pgsql-9.6/bin/pg_ctl pgdata=/var/lib/pgsql/9.6/data/ primary_conninfo_opt="keepalives_idle=60 keepalives_interval=5 keepalives_count=5" psql=/usr/pgsql-9.6/bin/psql rep_mode=async restart_on_promote=true restore_command="cp /var/lib/pgsql/9.6/data/archivedir/%f %p" Meta Attrs: failure-timeout=60 Operations: demote interval=0s on-fail=stop timeout=60s (pgsql-demote-interval-0s) methods interval=0s timeout=5s (pgsql-methods-interval-0s) monitor interval=4s on-fail=restart timeout=60s (pgsql-monitor-interval-4s) monitor interval=3s on-fail=restart role=Master timeout=60s (pgsql-monitor-interval-3s) notify interval=0s timeout=60s (pgsql-notify-interval-0s) promote interval=0s on-fail=restart timeout=60s (pgsql-promote-interval-0s) start interval=0s on-fail=restart timeout=60s (pgsql-start-interval-0s) stop interval=0s on-fail=block timeout=60s (pgsql-stop-interval-0s) Group: master-group Resource: vip-master (class=ocf provider=heartbeat type=IPaddr2) Attributes: cidr_netmask=24 ip=10.70.10.2 Operations: monitor interval=10s on-fail=restart timeout=60s (vip-master-monitor-interval-10s) start interval=0s on-fail=restart timeout=60s (vip-master-start-interval-0s) stop interval=0s on-fail=block timeout=60s (vip-master-stop-interval-0s) Resource: vip-rep (class=ocf provider=heartbeat type=IPaddr2) Attributes: cidr_netmask=24 ip=10.70.10.1 Meta Attrs: migration-threshold=0 Operations: monitor interval=10s on-fail=restart timeout=60s (vip-rep-monitor-interval-10s) start interval=0s on-fail=stop timeout=60s (vip-rep-start-interval-0s) stop interval=0s on-fail=ignore timeout=60s (vip-rep-stop-interval-0s) Stonith Devices: Fencing Levels: Location Constraints: Ordering Constraints: promote msPostgresql then start master-group (score:INFINITY) (non-symmetrical) demote msPostgresql then stop master-group (score:0) (non-symmetrical) Colocation Constraints: master-group with msPostgresql (score:INFINITY) (rsc-role:Started) (with-rsc-role:Master) Ticket Constraints: Alerts: No alerts defined Resources Defaults: resource-stickiness: INFINITY migration-threshold: 0 Operations Defaults: No defaults set Cluster Properties: cluster-infrastructure: corosync cluster-name: cluster1 cluster-recheck-interval: 60 dc-version: 1.1.19-8.el7-c3c624ea3d have-watchdog: false no-quorum-policy: ignore start-failure-is-fatal: false stonith-enabled: false Node Attributes: Node1: pgsql-data-status=STREAMING|ASYNC Node2: pgsql-data-status=LATEST Quorum: Options:Thanks !
User2019
(11 rep)
Sep 12, 2019, 09:30 AM
• Last activity: Sep 16, 2019, 06:18 PM
0
votes
1
answers
887
views
corosync 2Node vs two_node flags
is the **`2node`** equivalent to the **`two_node`** flag ? Is it the same ? [root@srv1 ~]# corosync-quorumtool -s Quorum information ------------------ Date: Wed Mar 20 04:49:10 2019 Quorum provider: corosync_votequorum Nodes: 2 Node ID: 1 Ring ID: 1/464 Quorate: Yes Votequorum information ---------...
is the **
2node
** equivalent to the **two_node
** flag ? Is it the same ?
[root@srv1 ~]# corosync-quorumtool -s
Quorum information
------------------
Date: Wed Mar 20 04:49:10 2019
Quorum provider: corosync_votequorum
Nodes: 2
Node ID: 1
Ring ID: 1/464
Quorate: Yes
Votequorum information
----------------------
Expected votes: 2
Highest expected: 2
Total votes: 2
Quorum: 1
Flags: 2Node Quorate WaitForAll
Membership information
----------------------
Nodeid Votes Name
1 1 srv1cr1 (local)
2 1 srv2cr1
http://people.redhat.com/ccaulfie/docs/Votequorum_Intro.pdf
I was not able to find an answer to my question and the documentation is referencing **two_node
** only.
blabla_trace
(385 rep)
Mar 20, 2019, 08:57 AM
• Last activity: Aug 9, 2019, 02:42 PM
2
votes
1
answers
3241
views
PCS Stonith (fencing) will kill two node cluster if first is down
I have configured a two node physical server cluster (HP ProLiant DL560 Gen8) using pcs (corosync/pacemaker/pcsd). I have also configured fencing on them using fence_ilo4. The weird thing will happen if one node goes down (under DOWN i mean power OFF), the second node will die as well. Fencing will...
I have configured a two node physical server cluster (HP ProLiant DL560 Gen8) using pcs (corosync/pacemaker/pcsd). I have also configured fencing on them using fence_ilo4.
The weird thing will happen if one node goes down (under DOWN i mean power OFF), the second node will die as well. Fencing will kill itself causing both servers to be offline.
How do i correct this behavior?
The thing i tried is to add "
wait_for_all: 0
" and "expected_votes: 1
" in /etc/corosync/corosync.conf
under quorum
section. But it will still kill it.
At some point, some maintenance is to be performed on one of those servers, and it will have to be shutdown. I don't want for the other node to go down if this happens.
Here are some outputs
[root@kvm_aquila-02 ~]# pcs quorum status
Quorum information
------------------
Date: Fri Jun 28 09:07:18 2019
Quorum provider: corosync_votequorum
Nodes: 2
Node ID: 2
Ring ID: 1/284
Quorate: Yes
Votequorum information
----------------------
Expected votes: 2
Highest expected: 2
Total votes: 2
Quorum: 1
Flags: 2Node Quorate
Membership information
----------------------
Nodeid Votes Qdevice Name
1 1 NR kvm_aquila-01
2 1 NR kvm_aquila-02 (local)
[root@kvm_aquila-02 ~]# pcs config show
Cluster Name: kvm_aquila
Corosync Nodes:
kvm_aquila-01 kvm_aquila-02
Pacemaker Nodes:
kvm_aquila-01 kvm_aquila-02
Resources:
Clone: dlm-clone
Meta Attrs: interleave=true ordered=true
Resource: dlm (class=ocf provider=pacemaker type=controld)
Operations: monitor interval=30s on-fail=fence (dlm-monitor-interval-30s)
start interval=0s timeout=90 (dlm-start-interval-0s)
stop interval=0s timeout=100 (dlm-stop-interval-0s)
Clone: clvmd-clone
Meta Attrs: interleave=true ordered=true
Resource: clvmd (class=ocf provider=heartbeat type=clvm)
Operations: monitor interval=30s on-fail=fence (clvmd-monitor-interval-30s)
start interval=0s timeout=90s (clvmd-start-interval-0s)
stop interval=0s timeout=90s (clvmd-stop-interval-0s)
Group: test_VPS
Resource: test (class=ocf provider=heartbeat type=VirtualDomain)
Attributes: config=/shared/xml/test.xml hypervisor=qemu:///system migration_transport=ssh
Meta Attrs: allow-migrate=true is-managed=true priority=100 target-role=Started
Utilization: cpu=4 hv_memory=4096
Operations: migrate_from interval=0 timeout=120s (test-migrate_from-interval-0)
migrate_to interval=0 timeout=120 (test-migrate_to-interval-0)
monitor interval=10 timeout=30 (test-monitor-interval-10)
start interval=0s timeout=300s (test-start-interval-0s)
stop interval=0s timeout=300s (test-stop-interval-0s)
Stonith Devices:
Resource: kvm_aquila-01 (class=stonith type=fence_ilo4)
Attributes: ipaddr=10.0.4.39 login=fencing passwd=0ToleranciJa pcmk_host_list="kvm_aquila-01 kvm_aquila-02"
Operations: monitor interval=60s (kvm_aquila-01-monitor-interval-60s)
Resource: kvm_aquila-02 (class=stonith type=fence_ilo4)
Attributes: ipaddr=10.0.4.49 login=fencing passwd=0ToleranciJa pcmk_host_list="kvm_aquila-01 kvm_aquila-02"
Operations: monitor interval=60s (kvm_aquila-02-monitor-interval-60s)
Fencing Levels:
Location Constraints:
Ordering Constraints:
start dlm-clone then start clvmd-clone (kind:Mandatory)
Colocation Constraints:
clvmd-clone with dlm-clone (score:INFINITY)
Ticket Constraints:
Alerts:
No alerts defined
Resources Defaults:
No defaults set
Operations Defaults:
No defaults set
Cluster Properties:
cluster-infrastructure: corosync
cluster-name: kvm_aquila
dc-version: 1.1.19-8.el7_6.4-c3c624ea3d
have-watchdog: false
last-lrm-refresh: 1561619537
no-quorum-policy: ignore
stonith-enabled: true
Quorum:
Options:
wait_for_all: 0
[root@kvm_aquila-02 ~]# pcs cluster status
Cluster Status:
Stack: corosync
Current DC: kvm_aquila-02 (version 1.1.19-8.el7_6.4-c3c624ea3d) - partition with quorum
Last updated: Fri Jun 28 09:14:11 2019
Last change: Thu Jun 27 16:23:44 2019 by root via cibadmin on kvm_aquila-01
2 nodes configured
7 resources configured
PCSD Status:
kvm_aquila-02: Online
kvm_aquila-01: Online
[root@kvm_aquila-02 ~]# pcs status
Cluster name: kvm_aquila
Stack: corosync
Current DC: kvm_aquila-02 (version 1.1.19-8.el7_6.4-c3c624ea3d) - partition with quorum
Last updated: Fri Jun 28 09:14:31 2019
Last change: Thu Jun 27 16:23:44 2019 by root via cibadmin on kvm_aquila-01
2 nodes configured
7 resources configured
Online: [ kvm_aquila-01 kvm_aquila-02 ]
Full list of resources:
kvm_aquila-01 (stonith:fence_ilo4): Started kvm_aquila-01
kvm_aquila-02 (stonith:fence_ilo4): Started kvm_aquila-02
Clone Set: dlm-clone [dlm]
Started: [ kvm_aquila-01 kvm_aquila-02 ]
Clone Set: clvmd-clone [clvmd]
Started: [ kvm_aquila-01 kvm_aquila-02 ]
Resource Group: test_VPS
test (ocf::heartbeat:VirtualDomain): Started kvm_aquila-01
Daemon Status:
corosync: active/enabled
pacemaker: active/enabled
pcsd: active/enabled
Marko Todoric
(437 rep)
Jun 28, 2019, 07:14 AM
• Last activity: Jun 28, 2019, 02:38 PM
1
votes
2
answers
9752
views
Corosync error "No interfaces defined" in a cluster member
I am having an error starting corosync on a cluster member: May 16 00:53:32 neftis corosync[19741]: [MAIN ] Corosync Cluster Engine ('2.3.4'): started and ready to provide service. May 16 00:53:32 neftis corosync[19741]: [MAIN ] Corosync built-in features: dbus systemd xmlconf snmp pie relro bindnow...
I am having an error starting corosync on a cluster member:
May 16 00:53:32 neftis corosync: [MAIN ] Corosync Cluster Engine ('2.3.4'): started and ready to provide service.
May 16 00:53:32 neftis corosync: [MAIN ] Corosync built-in features: dbus systemd xmlconf snmp pie relro bindnow
May 16 00:53:32 neftis corosync: [MAIN ] parse error in config: No interfaces defined
May 16 00:53:32 neftis corosync: [MAIN ] Corosync Cluster Engine exiting with status 8 at main.c:1278.
May 16 00:53:32 neftis corosync: Starting Corosync Cluster Engine (corosync): [FALL�]
May 16 00:53:32 neftis systemd: corosync.service: control process exited, code=exited status=1
May 16 00:53:32 neftis systemd: Failed to start Corosync Cluster Engine.
May 16 00:53:32 neftis systemd: Unit corosync.service entered failed state.
May 16 00:53:32 neftis systemd: corosync.service failed.
May 16 00:54:06 neftis systemd: Cannot add dependency job for unit firewalld.service, ignoring: Unit firewalld.service is masked.
May 16 00:54:06 neftis systemd: Starting Corosync Cluster Engine...
May 16 00:54:06 neftis corosync: [MAIN ] Corosync Cluster Engine ('2.3.4'): started and ready to provide service.
May 16 00:54:06 neftis corosync: [MAIN ] Corosync built-in features: dbus systemd xmlconf snmp pie relro bindnow
May 16 00:54:06 neftis corosync: [MAIN ] parse error in config: No interfaces defined
May 16 00:54:06 neftis corosync: [MAIN ] Corosync Cluster Engine exiting with status 8 at main.c:1278.
May 16 00:54:06 neftis corosync: Starting Corosync Cluster Engine (corosync): [FALL�]
May 16 00:54:06 neftis systemd: corosync.service: control process exited, code=exited status=1
May 16 00:54:06 neftis systemd: Failed to start Corosync Cluster Engine.
May 16 00:54:06 neftis systemd: Unit corosync.service entered failed state.
Here is my config on the three nodes but it is failing just in netfis which I added recently.
totem {
version: 2
secauth: off
cluster_name: cluster-osiris
transport: udpu
}
nodelist {
node {
ring0_addr: isis.localdoamin
nodeid: 1
}
node {
ring0_addr: horus.localdoamin
nodeid: 2
}
node {
ring0_addr: netfis.localdoamin
nodeid: 3
}
}
quorum {
provider: corosync_votequorum
}
logging {
to_syslog: yes
}
I am running a pacemaker, corosync, pcs cluster on CentOS 7.1 64. bits.
I searched on internet but It is not clear what is going on.
Could you help me?
mijhael3000
(85 rep)
May 16, 2016, 04:28 AM
• Last activity: Jun 18, 2019, 08:42 AM
1
votes
0
answers
788
views
Adding (cifs, samba)Filesystem resource to PCS
I am trying to create _PCS_ resource _Filesystem_ on Samba share(cifs) file system type. Here is a command that I have used to create pcs resource. ``` root@shaunak-VirtualBox:~# pcs resource show SMBDiskResourceName Resource: SMBDiskResourceName (class=ocf provider=heartbeat type=Filesystem) Attrib...
I am trying to create _PCS_ resource _Filesystem_ on Samba share(cifs) file system type.
Here is a command that I have used to create pcs resource.
root@shaunak-VirtualBox:~# pcs resource show SMBDiskResourceName
Resource: SMBDiskResourceName (class=ocf provider=heartbeat type=Filesystem)
Attributes: device=//192.168.1.6/my_data_share directory=/var/opt/my/data fstype=cifs options="vers=3.0,username=myuser,password=myuser,uid= 998,gid=998,file_mode=0777,dir_mode=0777"
Here is a error I am getting while starting resource,
root@shaunak-VirtualBox:~# pcs resource debug-start SMBDiskResourceName
Error performing operation: Operation not permitted
Operation start for SMBDiskResourceName (ocf:heartbeat:Filesystem) returned 1
> stderr: INFO: Running start for //192.168.1.6/mssql_data_share on /var/opt/mssql/data
> stderr:
> stderr: Usage:
> stderr: mount [-lhV]
> stderr: mount -a [options]
> stderr: mount [options] [--source] | [--target]
> stderr: mount [options]
> stderr: mount []
> stderr:
> stderr: Mount a filesystem.
> stderr:
.....
> stderr:
> stderr: For more details see mount(8).
> stderr: ocf-exit-reason:Couldn't mount filesystem //192.168.1.6/my_data_share on /var/opt/my/data
Above error says, mount is not happening on /192.168.1.6/my_data_share
but where I run mount command, I am able to do mounting.
Note sure what I am missing here. Here is a my mount command which I executed successfully.
mount -t cifs //192.168.1.6/my_data_share /var/opt/my/data -o vers=3.0,username=myuser,password=myuser,uid=998,gid=998,file_mode=0777,dir_mode=0777
Shaunak Patel
(111 rep)
Mar 26, 2019, 05:24 PM
• Last activity: Mar 26, 2019, 06:34 PM
3
votes
1
answers
15321
views
quorum in a two-node cluster with pacemaker
I have two node active-passive cluster. [Clusters_from_Scratch][1] > If a cluster splits into two (or more) groups of nodes that can no > longer communicate with each other (aka. partitions), quorum is used > to prevent resources from starting on more nodes than desired, which > would risk data corr...
I have two node active-passive cluster.
Clusters_from_Scratch
> If a cluster splits into two (or more) groups of nodes that can no
> longer communicate with each other (aka. partitions), quorum is used
> to prevent resources from starting on more nodes than desired, which
> would risk data corruption. A cluster has quorum when more than half
> of all known nodes are online in the same partition
>
> By the above definition, a two-node cluster would only have quorum
> when both nodes are running. This would make the creation of a
> two-node cluster pointless, but corosync has the ability to treat
> two-node clusters as if only one node is required for quorum. The pcs
> cluster setup command will automatically configure two_node: 1 in
> corosync.conf, so a two-node cluster will "just work".
Here's my config:
So how can the cluster now decide which one has quorum?

blabla_trace
(385 rep)
Feb 22, 2019, 09:09 PM
• Last activity: Feb 22, 2019, 10:53 PM
1
votes
1
answers
306
views
When I start corosync all servers panics with core dumps
I upgraded my servers. Then I started corosync service one by one on my servers. I started first on 3 server and I wait 5 min. Then I started next 4 corosync on other servers and 7 server crashed in same time. I'm using corosync since 5 years. I was using; Kernel: 4.14.32-1-lts Corosync 2.4.2-1 Pace...
I upgraded my servers. Then I started corosync service one by one on my servers. I started first on 3 server and I wait 5 min. Then I started next 4 corosync on other servers and 7 server crashed in same time.
I'm using corosync since 5 years. I was using;
Kernel: 4.14.32-1-lts
Corosync 2.4.2-1
Pacemaker 1.1.18-1
and I never saw this before.
I guess something is broken in new corosync version really really bad!
Kernel: 4.14.70-1-lts
Corosync 2.4.4-3
Pacemaker 2.0.0-1
-
**This is my corosync.conf: https://paste.ubuntu.com/p/7KCq8pHKn3/**
**Can you tell me how can I find the reason of the problem?**
Sep 25 08:56:03 SRV-2 corosync: [TOTEM ] A new membership (10.10.112.10:56) was formed. Members joined: 7
Sep 25 08:56:03 SRV-2 corosync: [VOTEQ ] Waiting for all cluster members. Current votes: 7 expected_votes: 28
Sep 25 08:56:03 SRV-2 corosync: [VOTEQ ] Waiting for all cluster members. Current votes: 7 expected_votes: 28
Sep 25 08:56:03 SRV-2 corosync: [VOTEQ ] Waiting for all cluster members. Current votes: 7 expected_votes: 28
Sep 25 08:56:03 SRV-2 corosync: [VOTEQ ] Waiting for all cluster members. Current votes: 7 expected_votes: 28
Sep 25 08:56:03 SRV-2 corosync: [QUORUM] Members: 1 2 3 4 5 6 7
Sep 25 08:56:03 SRV-2 corosync: [MAIN ] Completed service synchronization, ready to provide service.
Sep 25 08:56:03 SRV-2 corosync: [VOTEQ ] Waiting for all cluster members. Current votes: 7 expected_votes: 28
Sep 25 08:56:03 SRV-2 systemd: Created slice system-systemd\x2dcoredump.slice.
Sep 25 08:56:03 SRV-2 systemd: Started Process Core Dump (PID 43798/UID 0).
Sep 25 08:56:03 SRV-2 systemd: corosync.service: Main process exited, code=dumped, status=11/SEGV
Sep 25 08:56:03 SRV-2 systemd: corosync.service: Failed with result 'core-dump'.
Sep 25 08:56:03 SRV-2 kernel: watchdog: watchdog0: watchdog did not stop!
Sep 25 08:56:03 SRV-2 systemd-coredump: Process 29089 (corosync) of user 0 dumped core.
Stack trace of thread 29089:
#0 0x0000000000000000 n/a (n/a)
Write failed: Broken pipe
coredumpctl info
PID: 23658 (corosync)
UID: 0 (root)
GID: 0 (root)
Signal: 11 (SEGV)
Timestamp: Mon 2018-09-24 09:50:58 +03 (1 day 3h ago)
Command Line: corosync
Executable: /usr/bin/corosync
Control Group: /system.slice/corosync.service
Unit: corosync.service
Slice: system.slice
Boot ID: 79d67a83f83c4804be6ded8e6bd5f54d
Machine ID: 9b1ca27d3f4746c6bcfcdb93b83f3d45
Hostname: SRV-1
Storage: /var/lib/systemd/coredump/core.corosync.0.79d67a83f83c4804be6ded8e6bd5f54d.23658.153777185>
Message: Process 23658 (corosync) of user 0 dumped core.
Stack trace of thread 23658:
#0 0x0000000000000000 n/a (n/a)
PID: 5164 (corosync)
UID: 0 (root)
GID: 0 (root)
Signal: 11 (SEGV)
Timestamp: Tue 2018-09-25 08:56:03 +03 (4h 9min ago)
Command Line: corosync
Executable: /usr/bin/corosync
Control Group: /system.slice/corosync.service
Unit: corosync.service
Slice: system.slice
Boot ID: 2f49ec6cdcc144f0a8eb712bbfbd7203
Machine ID: 9b1ca27d3f4746c6bcfcdb93b83f3d45
Hostname: SRV-1
Storage: /var/lib/systemd/coredump/core.corosync.0.2f49ec6cdcc144f0a8eb712bbfbd7203.5164.1537854963>
Message: Process 5164 (corosync) of user 0 dumped core.
Stack trace of thread 5164:
#0 0x0000000000000000 n/a (n/a)
I cant find more log so I can't dig the problem.
Ozbit
(439 rep)
Sep 25, 2018, 10:03 AM
• Last activity: Oct 11, 2018, 12:29 PM
1
votes
1
answers
848
views
How to install debug symbols for corosync package on CentOS?
I got a crash in `corosync` which I would like to view in gdb. However, currently the core dump shows me only this much info Debug logs for core.1385 (Generated on Jul 26 10:17 BST) [Thread debugging using libthread_db enabled] Core was generated by `corosync -f'. Program terminated with signal 6, A...
I got a crash in
corosync
which I would like to view in gdb. However, currently the core dump shows me only this much info
Debug logs for core.1385 (Generated on Jul 26 10:17 BST)
[Thread debugging using libthread_db enabled]
Core was generated by `corosync -f'.
Program terminated with signal 6, Aborted.
#0 0x00007f68b2783495 in raise () from /lib64/libc.so.6
#0 0x00007f68b2783495 in raise () from /lib64/libc.so.6
#1 0x00007f68b2784c75 in abort () from /lib64/libc.so.6
#2 0x00007f68b277c60e in __assert_fail_base () from /lib64/libc.so.6
#3 0x00007f68b277c6d0 in __assert_fail () from /lib64/libc.so.6
#4 0x00007f68b3530f2c in ?? () from /usr/lib64/libtotem_pg.so.4
#5 0x00007f68b3534eaf in ?? () from /usr/lib64/libtotem_pg.so.4
#6 0x00007f68b3535259 in ?? () from /usr/lib64/libtotem_pg.so.4
#7 0x00007f68b352f108 in rrp_deliver_fn () from /usr/lib64/libtotem_pg.so.4
#8 0x00007f68b352be2a in ?? () from /usr/lib64/libtotem_pg.so.4
#9 0x00007f68b3524482 in poll_run () from /usr/lib64/libtotem_pg.so.4
#10 0x00000000004079b6 in main ()
I guess I need to install the debug info packages for corosync
and whatever is libtotem_pg.so.4
. How to do this?
Serge Rogatch
(167 rep)
Jul 26, 2018, 04:05 PM
• Last activity: Jul 26, 2018, 04:59 PM
2
votes
1
answers
2841
views
Unable to mount gfs2 file system on Debian Stretch, probable dlm mis-config?
I am experimenting with gfs2 on Debian Stretch, and having some difficulties. I am a reasonably experienced Linux admin, but new to shared-disk and parallel file systems. My immediate project is to mount a gfs2-formatted iscsi-exported device on multiple clients as a shared file system. For the mome...
I am experimenting with gfs2 on Debian Stretch, and having some difficulties. I am a reasonably experienced Linux admin, but new to shared-disk and parallel file systems.
My immediate project is to mount a gfs2-formatted iscsi-exported device on multiple clients as a shared file system. For the moment, I am not interested in HA or fencing, although this may be important later on.
The iscsi part is fine, I am able to log in to the target, format it as an xfs file system, and also mount it on multiple clients and verify that it shows up with the same blkid.
To do the gfs2 business, I am following the scheme on the Debian stretch "gfs2" man page, modified for my config, and embellished slightly by various searches and so forth.
Man page is here:
https://manpages.debian.org/stretch/gfs2-utils/gfs2.5.en.html
The actual error is, when I attempt to mount my gfs2 file system, the mount command returns with
mount: mount(2) failed: /mnt: No such file or directory
... where /mnt is the desired mount point, which certainly does
exist. (If you attempt to mount to a nonexistent mount point the
error is "mount: mount point /wrong does not exist").
Related, at each mount attempt, dmesg reports:
gfs2: can't find protocol lock_dlm
I briefly went down the path of assuming the problem was that Debian packages do not provide "/sbin/mount.gfs2", and looked for that, but I think that was an incorrect guess.
I have a five-machine cluster (of Raspberry Pis, in case it matters), named, somewhat idiosyncratically, pio, pi, pj, pk, and pl. They all have fixed static IP addresses, and there's no domain.
I have installed the Debian gfs2, corosync, and dlm-controld packages.
For the corosync step, my corosync config is (e.g. for pio, intended to be the master of the cluster):
totem {
version: 2
cluster_name: rpitest
token: 3000
token_retransmits_before_loss_const: 10
clear_node_high_bit: yes
crypto_cipher: none
crypto_hash: none
nodeid: 17
interface {
ringnumber: 0
bindnetaddr: 192.168.0.17
mcastport: 5405
ttl: 1
}
}
nodelist {
node {
ring0_addr: 192.168.0.17
nodeid: 17
}
node {
ring0_addr: 192.168.0.11
nodeid: 1
}
node {
ring0_addr: 192.168.0.12
nodeid: 2
}
node {
ring0_addr: 192.168.0.13
nodeid: 3
}
node {
ring0_addr: 192.168.0.14
nodeid: 4
}
}
logging {
fileline: off
to_stderr: no
to_logfile: no
to_syslog: yes
syslog_facility: daemon
debug: off
timestamp: on
logger_subsys {
subsys: QUORUM
debug: off
}
}
quorum {
provider: corosync_votequorum
expected_votes: 5
}
This file is present on all the nodes, with appropriate node-specific changes to the nodeid and bindnetaddr fields in the totem section.
The corosync tool starts without error on all nodes, and all the
nodes also have sane-looking output from corosync-quorumtool, thus:
root@pio:~# corosync-quorumtool
Quorum information
------------------
Date: Sun Apr 22 11:04:13 2018
Quorum provider: corosync_votequorum
Nodes: 5
Node ID: 17
Ring ID: 1/124
Quorate: Yes
Votequorum information
----------------------
Expected votes: 5
Highest expected: 5
Total votes: 5
Quorum: 3
Flags: Quorate
Membership information
----------------------
Nodeid Votes Name
1 1 192.168.0.11
2 1 192.168.0.12
3 1 192.168.0.13
4 1 192.168.0.14
17 1 192.168.0.17 (local)
The dlm-controld package was installed, and /etc/dlm/dlm.conf created with
the following simple config. Again, I am skipping fencing for now.
The dlm.conf file is the same on all the nodes.
enable_fencing=0
lockspace rpitest nodir=1
master rpitest node=17
I am unclear on whether or not the DLM "lockspace" name is supposed to match the corosync cluster name or not. I see the same behavior either way.
The dlm-controld service starts without errors, and the the output of "dlm_tool status" appears sane:
root@pio:~# dlm_tool status
cluster nodeid 17 quorate 1 ring seq 124 124
daemon now 1367 fence_pid 0
node 1 M add 31 rem 0 fail 0 fence 0 at 0 0
node 2 M add 31 rem 0 fail 0 fence 0 at 0 0
node 3 M add 31 rem 0 fail 0 fence 0 at 0 0
node 4 M add 31 rem 0 fail 0 fence 0 at 0 0
node 17 M add 7 rem 0 fail 0 fence 0 at 0 0
The gfs2 file system was created by:
mkfs -t gfs2 -p lock_dlm -j 5 -t rpitest:one /path/to/device
Subsequent to this, "blkid /path/to/device" reports:
/path/to/device: LABEL="rpitest:one" UUID= TYPE="gfs2"
It looks the same on all the iscsi clients.
At this point, I feel like I should be able to mount the gfs2 file system on any/all of the clients, but here is where I get the error above -- the mount command reports a "no such file or directory", and dmesg and syslog report "gfs2: can't find protocol lock_dlm".
There are several other gfs2 guides out there, but many of them seem to be RH/CentOS specific, and for other cluster-management schemes besides corosync, like cman or pacemaker. Those aren't necessarily deal-breakers, but it's high-value to me to have this work on nearly-stock Debian Stretch.
It also seems likely to me that this is probably a pretty simple dlm misconfiguration, but I can't seem to nail it down.
Additional clues: When I try to "join" a lockspace via
dlm_tool join
... I get a dmesg output:
dlm cluster name 'rpitest' is being used without an application provided cluster name
This happens independently of whether the lockspace I am joining is "rpitest" or not. This suggests that lockspace names and cluster names are indeed the same thing, and/but that the dlm is evidently not aware of the corosync config?
Andrew Reid
(53 rep)
Apr 22, 2018, 04:44 PM
• Last activity: Apr 24, 2018, 06:09 AM
Showing page 1 of 20 total questions