Gluster cluster split in two?

Star InactiveStar InactiveStar InactiveStar InactiveStar Inactive
 

It can happen that a Gluster cluster gets divided in two parts. I'm not talking a about a volume split brain here but a whole cluster. Something might have gone wrong when probing a node. Or as in our case when adding aliases for nodes, the peer info file was corrupted (seems to be a maximum name length for nodes) which caused some nodes to believe they where in another cluster.

The solution to this is to first decide which nodes you consider as the proper cluster. Running gluster peer status will show you what other nodes are considered to be in the same group as the node you run the status command on. Nodes that are in state "Peer Rejected State" might thing they are part of another cluster. If most of the nodes are in "Peer Rejected State", then probably you should run the command on one of those nodes in rejected state and you will see that most nodes there will be in ok state.

On all those nodes in rejected state, run following procedure:

  1. Stop glusterd
  2. Remove all files from /var/lib/glusterd except the glusterd.info-file
  3. Start glusterd again
  4. Run a gluster peer probe to a member node.
  5. Restart glusterd again

Other lessons learned:

Do make sure that you save the glusterd.info file, if not a new one will be created and effectively you will be creating a new node, with the same name. To solve this, stop the glusterd daemon on all nodes, remove the faulty uuid from /var/lib/glusterfs/peers and restart glusterd on all nodes again.
I did not find this error immediately and I was strugling with a lot of locking errors in glusterd.log file and any "gluster volume status" command would just hang for ever.