20120730

Cluster Building, Ubuntu 12.04 - REVISED

This is an updated post about building a Pacemaker server on Ubuntu 12.04 LTS.

I've learned a great deal since my last post, as many intervening posts will demonstrate.  Most of my machines are still on 11.10.  I have finally found some time to work on getting 12.04 to cooperate.

Our goals today will be a Pacemaker+CMAN cluster running DRBD and OCFS2.  This should cover most of the "difficult" stuff that I know anything about.

For those who have tried and failed to get a stable Pacemaker cluster running on 12.04, you might find that having the DLM managed by Pacemaker is not advisable.   In fact, it's not allowable.  I filed a formal bug report and was then informed that the DLM was, indeed, managed by CMAN.  Configuring it to be also managed by Pacemaker caused various crashes every time I put a node into standby.

Installation


Start with a clean, new Ubuntu 12.04 Server and make sure everything is up-to-date.
A few packages are for the good of the nodes themselves:
apt-get install ntp

Pull down the necessary packages for the cluster:
apt-get install cman pacemaker fence-agents openais

and the necessary packages for DRBD:
apt-get install drbd8-utils

and the necessary packages for OCFS2:
apt-get install ocfs2-tools ocfs2-tools-cman ocfs2-tools-pacemaker


Configuration, Part 1

CMAN

Configure CMAN to ignore quorum if you have a two-node cluster...or don't want to wait for quorum on startup:

echo "CMAN_QUORUM_TIMEOUT=0" >> /etc/default/cman

For the cluster.conf, there are some good things to know:
  • The cluster multicast address is,  by default, generated as a hash of the cluster name - make this name unique if you run multiple clusters on the same subnet.  You can configure it manually, though I have not yet tried.
  • The interface element under the totem element appears to be "broken," or useless, and aside from that the Ubuntu docs suggest that any configuration values specified here will be overruled by whatever is under the clusternodes element.  Don't bother trying to set the bind-address here for the time being.
  • If you specify host names for each cluster node, reverse-resolution will attempt to determine what the bind address should be.  This will cause a bind to the loopback adapter unless you either (a) use IP addresses instead of the node names, or (b) remove the 127.0.1.1 address line from /etc/hosts!!  A symptom of this condition is that you bring both nodes up, and each node thinks it's all alone.
  • The two_node="1" attribute reportedly causes CMAN to ignore a loss of quorum for two-node clusters.
  • For added security, generate a keyfile with corosync-keygen and configure CMAN to pass it to Corosync - make sure to distribute it to all member nodes.
  • Always run ccs_config_validate before trying to launch the cman service.
  • Refer to /usr/share/cluster/cluster.rng for more (extremely detailed) info about cluster.conf

I wanted to put my cluster.conf here, but the XML is raising hell with Blogger.  Anyone who really wants to see it may email me.

Corosync

The Corosync config file is ignored when launching via CMAN.  cluster.conf is where those options live now.

 

Configuration, Part 2

By this time, if you have started CMAN and Pacemaker (in that order), both nodes should be visible to one another and should show up in crm_mon.  Make sure there are no monitor failures, as this will likely mean you're missing some packages on the reported node(s).

DRBD

I tend to place as much as I can into the /etc/drbd.d/global_common.conf, so as to save a lot of extra typing when creating new resources on my cluster.  This may not be best practice, but it works for me.  For my experimental cluster, I have two nodes: l9 and l10.  Here's a slimmed-down global_common.conf, and a single resource called "share".

/etc/drbd.d/global_common.conf
global {
    usage-count no;
}

common {
    protocol C;

    handlers {
        pri-on-incon-degr "/usr/lib/drbd/notify-pri-on-incon-degr.sh; /usr/lib/drbd/notify-emergency-reboot.sh; echo b > /proc/sysrq-trigger ; reboot -f";
        pri-lost-after-sb "/usr/lib/drbd/notify-pri-lost-after-sb.sh; /usr/lib/drbd/notify-emergency-reboot.sh; echo b > /proc/sysrq-trigger ; reboot -f";
        local-io-error "/usr/lib/drbd/notify-io-error.sh; /usr/lib/drbd/notify-emergency-shutdown.sh; echo o > /proc/sysrq-trigger ; halt -f";
    }

    startup {
          wfc-timeout 15;
      degr-wfc-timeout 60;
    }

    disk {
      on-io-error detach;
      fencing resource-only;
    }


    net {
           data-integrity-alg sha1;
           cram-hmac-alg sha1;
           # This isn't the secret you're looking for...
           shared-secret "234141231231234551";

           sndbuf-size 0;

           allow-two-primaries;

           ### Configure automatic split-brain recovery.
           after-sb-0pri discard-zero-changes;
           after-sb-1pri discard-secondary;
           after-sb-2pri disconnect;
    }

    syncer {
           rate 35M;
           use-rle;
           verify-alg sha1;
           csums-alg sha1;
    }
}
 
/etc/drbd.d/share.res
resource share  {
  device             /dev/drbd0;
  meta-disk          internal;

  on l9   {
    address   172.18.1.9:7788;
    disk      /dev/l9/share;
  }

  on l10  {
    address   172.18.1.10:7788;
    disk      /dev/l10/share;
  }
}
 
Those of you with a keen eye will note I've used an LVM volume as my backing storage device for DRBD.  Use whatever works for you.  Now, on both nodes:

drbdadm create-md share
drbdadm up share

And on only one node:
drbdadm -- -o primary share

It's probably best to let the sync finished, but I'm in a rush, so...on both nodes:
drbdadm down share
   and
service drbd stop
update-rc.d drbd disable

on both nodes.  The last line is particularly important, so I highlighted it.  DRBD cannot be allowed to crank up on its own - it will be Pacemaker's job to do this for us.   The same goes for O2CB and OCFS2:

update-rc.d o2cb disable
update-rc.d ocfs2 disable

OCFS2 also requires a couple of kernel parameters to be set.  Apply these to /etc/sysctl.conf:

echo "kernel.panic = 30" >> /etc/sysctl.conf
echo "kernel.panic_on_oops = 1" >> /etc/sysctl.conf
sysctl -p

With that done, we can go into crm and start configuring our resources.  What follows will be a sort-of run-of-the-mill configuration for a dual-primary resource.  YMMV.  I have used both single-primary and dual-primary configurations.  Use what suits the need.  Here I have a basic cluster configuration that will enable me to format my OCFS2 target:

node l10 \
        attributes standby="off"
node l9 \
        attributes standby="off"
primitive p_drbd_share ocf:linbit:drbd \
        params drbd_resource="share" \
        op monitor interval="15s" role="Master" timeout="20s" \
        op monitor interval="20s" role="Slave" timeout="20s" \
        op start interval="0" timeout="240s" \
        op stop interval="0" timeout="100s"
primitive p_o2cb ocf:pacemaker:o2cb \
        params stack="cman" \
        op start interval="0" timeout="90" \
        op stop interval="0" timeout="100" \
        op monitor interval="10" timeout="20"
ms ms_drbd_share p_drbd_share \
        meta master-max="2" notify="true" interleave="true" clone-max="2"
clone cl_o2cb p_o2cb \
        meta interleave="true" globally-unique="false"
property $id="cib-bootstrap-options" \
        dc-version="1.1.6-9971ebba4494012a93c03b40a2c58ec0eb60f50c" \
        cluster-infrastructure="cman" \
        stonith-enabled="false" \
        no-quorum-policy="ignore"

Of special note - we must specify the stack="cman" parameter for o2cb to function properly, otherwise you will see startup failures for that resource.  To round out this example, a usable store would be help.  After a format...

mkfs.ocfs2 /dev/drbd/by-res/share
mkdir /srv/share

Our mount target will be /srv/share.  Make sure to create this directory on both/all applicable nodes.  I have highlighted the modification to the above configuration to add the OCFS2 resource:
node l10 \
    attributes standby="off"
node l9 \
    attributes standby="off"
primitive p_drbd_share ocf:linbit:drbd \
    params drbd_resource="share" \
    op monitor interval="15s" role="Master" timeout="20s" \
    op monitor interval="20s" role="Slave" timeout="20s" \
    op start interval="0" timeout="240s" \
    op stop interval="0" timeout="100s"
primitive p_fs_share ocf:heartbeat:Filesystem \
    params device="/dev/drbd/by-res/share" directory="/srv/share" fstype="ocfs2" \
    op start interval="0" timeout="60" \
    op stop interval="0" timeout="60" \
    op monitor interval="20" timeout="40"
primitive p_o2cb ocf:pacemaker:o2cb \
    params stack="cman" \
    op start interval="0" timeout="90" \
    op stop interval="0" timeout="100" \
    op monitor interval="10" timeout="20"
ms ms_drbd_share p_drbd_share \
    meta master-max="2" notify="true" interleave="true" clone-max="2"
clone cl_fs_share p_fs_share \
    meta interleave="true" notify="true" globally-unique="false"
clone cl_o2cb p_o2cb \
    meta interleave="true" globally-unique="false"
colocation colo_share inf: cl_fs_share ms_drbd_share:Master
cl_o2cb
order o_o2cb inf: cl_o2cb cl_fs_share
order o_share inf: ms_drbd_share:promote cl_fs_share
property $id="cib-bootstrap-options" \
    dc-version="1.1.6-9971ebba4494012a93c03b40a2c58ec0eb60f50c" \
    cluster-infrastructure="cman" \
    stonith-enabled="false" \
    no-quorum-policy="ignore"

A couple of notes here, as well: not ordering the handling of O2CB correctly could wreak havoc when putting nodes into standby.  In this case I've ordered it with the file system mount, but a different approach may be more appropriate if we had multiple OCFS2 file systems to deal with.  Toying with the ordering of the colocations may also have an effect on things.  Read up on all applicable Pacemaker documentation. 

To test my cluster, I put each node in standby and brought it back a few times, then put the whole cluster in standby and rebooted all the nodes (all two of them).  Bringing them all back online should happen without incident.  In my case, I had to make one change:

order o_share inf: ms_drbd_share:promote cl_fs_share:start


Finally, the one missing piece to this configuration is proper STONITH devices and primitives.  These are a MUST for OCFS2, even if you're running it across virtual machines.   A single downed node will hang the entire cluster until the downed node is fenced.  Adding fencing is an exercise left to the reader, though I will be sharing my own experiences very soon.

No comments:

Post a Comment