Unable to mount gfs2 file system on Debian Stretch, probable dlm mis-config?

2 votes
1 answer
2842 views
debian distributed-filesystem shared-disk corosync
                          I am experimenting with gfs2 on Debian Stretch, and having some difficulties. I am a reasonably experienced Linux admin, but new to shared-disk and parallel file systems.

My immediate project is to mount a gfs2-formatted iscsi-exported device on multiple clients as a shared file system.  For the moment, I am not interested in HA or fencing, although this may be important later on.

The iscsi part is fine, I am able to log in to the target, format it as an xfs file system, and also mount it on multiple clients and verify that it shows up with the same blkid.

To do the gfs2 business, I am following the scheme on the Debian stretch "gfs2" man page, modified for my config, and embellished slightly by various searches and so forth.

Man page is here:
https://manpages.debian.org/stretch/gfs2-utils/gfs2.5.en.html 

The actual error is, when I attempt to mount my gfs2 file system, the mount command returns with

    mount: mount(2) failed: /mnt: No such file or directory
  
... where /mnt is the desired mount point, which certainly does
exist.  (If you attempt to mount to a nonexistent mount point the
error is "mount: mount point /wrong does not exist").

Related, at each mount attempt, dmesg reports:

    gfs2: can't find protocol lock_dlm

I briefly went down the path of assuming the problem was that Debian packages do not provide "/sbin/mount.gfs2", and looked for that, but I think that was an incorrect guess.

I have a five-machine cluster (of Raspberry Pis, in case it matters), named, somewhat idiosyncratically, pio, pi, pj, pk, and pl.  They all have fixed static IP addresses, and there's no domain. 

I have installed the Debian gfs2, corosync, and dlm-controld packages.

For the corosync step, my corosync config is (e.g. for pio, intended to be the master of the cluster):

    totem {
          version: 2
          cluster_name: rpitest
          token: 3000
          token_retransmits_before_loss_const: 10
          clear_node_high_bit: yes
          crypto_cipher: none
          crypto_hash: none
          nodeid: 17
          interface {
                  ringnumber: 0
                  bindnetaddr: 192.168.0.17
                  mcastport: 5405
                  ttl: 1
          }
    }
    nodelist { 
          node {
                  ring0_addr: 192.168.0.17
                  nodeid: 17
          }
          node {
                  ring0_addr: 192.168.0.11
                  nodeid: 1
          }
          node {
                  ring0_addr: 192.168.0.12
                  nodeid: 2
          }
          node {
                  ring0_addr: 192.168.0.13
                  nodeid: 3
          }
          node {
                  ring0_addr: 192.168.0.14
                  nodeid: 4
          }
    }
    logging {
          fileline: off
          to_stderr: no
          to_logfile: no
          to_syslog: yes
          syslog_facility: daemon
          debug: off
          timestamp: on
          logger_subsys {
                  subsys: QUORUM
                  debug: off
          }
    }
    quorum {
          provider: corosync_votequorum
          expected_votes: 5
    }

This file is present on all the nodes, with appropriate node-specific changes to the nodeid and bindnetaddr fields in the totem section.

The corosync tool starts without error on all nodes, and all the
nodes also have sane-looking output from corosync-quorumtool, thus:

    root@pio:~# corosync-quorumtool 
    Quorum information
    ------------------
    Date:             Sun Apr 22 11:04:13 2018
    Quorum provider:  corosync_votequorum
    Nodes:            5
    Node ID:          17
    Ring ID:          1/124
    Quorate:          Yes
     
    Votequorum information
    ----------------------
    Expected votes:   5
    Highest expected: 5
    Total votes:      5
    Quorum:           3  
    Flags:            Quorate 
    
    Membership information
    ----------------------
    Nodeid      Votes Name
             1          1 192.168.0.11
             2          1 192.168.0.12
             3          1 192.168.0.13
             4          1 192.168.0.14
            17          1 192.168.0.17 (local)

The dlm-controld package was installed, and /etc/dlm/dlm.conf created with
the following simple config.  Again, I am skipping fencing for now.

The dlm.conf file is the same on all the nodes.

    enable_fencing=0
    
    lockspace rpitest nodir=1
    master rpitest node=17

I am unclear on whether or not the DLM "lockspace" name is supposed to match the corosync cluster name or not.  I see the same behavior either way.  

The dlm-controld service starts without errors, and the the output of "dlm_tool status" appears sane:

    root@pio:~# dlm_tool status
    cluster nodeid 17 quorate 1 ring seq 124 124
    daemon now 1367 fence_pid 0 
    node 1 M add 31 rem 0 fail 0 fence 0 at 0 0
    node 2 M add 31 rem 0 fail 0 fence 0 at 0 0
    node 3 M add 31 rem 0 fail 0 fence 0 at 0 0
    node 4 M add 31 rem 0 fail 0 fence 0 at 0 0
    node 17 M add 7 rem 0 fail 0 fence 0 at 0 0

The gfs2 file system was created by:

    mkfs -t gfs2 -p lock_dlm -j 5 -t rpitest:one /path/to/device

Subsequent to this, "blkid /path/to/device" reports:

    /path/to/device: LABEL="rpitest:one" UUID= TYPE="gfs2"

It looks the same on all the iscsi clients.

At this point, I feel like I should be able to mount the gfs2 file system on any/all of the clients, but here is where I get the error above -- the mount command reports a "no such file or directory", and dmesg and syslog report "gfs2: can't find protocol lock_dlm".

There are several other gfs2 guides out there, but many of them seem to be RH/CentOS specific, and for other cluster-management schemes besides corosync, like cman or pacemaker.  Those aren't necessarily deal-breakers, but it's high-value to me to have this work on  nearly-stock Debian Stretch.  

It also seems likely to me that this is probably a pretty simple dlm misconfiguration, but I can't seem to nail it down.

Additional clues:  When I try to "join" a lockspace via

    dlm_tool join 

... I get a dmesg output:

    dlm cluster name 'rpitest' is being used without an application provided cluster name

This happens independently of whether the lockspace I am joining is "rpitest" or not.  This suggests that lockspace names and cluster names are indeed the same thing, and/but that the dlm is evidently not aware of the corosync config?

                        
Asked by Andrew Reid (53 rep)
Apr 22, 2018, 04:44 PM
Last activity: Apr 24, 2018, 06:09 AM
Unable to mount gfs2 file system on Debian Stretch, probable dlm mis-config?

Related Questions