In an active/passive environment with two nodes, under normal operating conditions, all incoming request are served by the master, while the slave maintains content synchronization with the master but does not itself serve any requests. Ideally, if the master node fails, the slave, having identical content to the master, would automatically jump in and start serving requests, with no noticeable downtime.
Of course, there are many cases where this ideal behavior is not possible, and a manual intervention by an adminstrator is required. The reason for this is that, in general, the slave cannot know with certainty what type of failure has occured. And the type of failure (and there many types) dictates the appropriate response. Hence, intelleigent intervention is usually required.
For example, imagine a two node active/passive cluster with master A and slave B. A and B keep track of each other's state through "heartbeat" signal. That is, they periodically ping each other and wait for a response.
Now, if B finds that its pings are going unanswered for some relatively long period, there are, generally speaking, two possible reasons:
- A is not responding because it is inoperable.
- A is still operating normally and the lack of response is due to some other reason, most likely a failure of the network connection between the two nodes.
If (1) is the case then the logical thing is for B to become the master and for requests to be redirected to and served by B. In this situation, if B does not take on A's former role, the service will be down; an emergency situation.
But if (2) is the case then the logical thing to do is for B to simply wait and continue pinging until the network connection is restored, at which point it will resynchronize with the master, effecting all changes that occured on the master during the down time. In this situation, if B instead assumes that A is down and takes over, there would be two functioning master nodes, in other words a "split-brain" situation where there is a possibility that the two nodes become desynchronized.
Because these two scenarios are in conflict and the slave cannot distinguish between them, the default setting in CRX is that upon repeated failures to reconnect to a non-responsive master the slave does not become the master. However, see below for a case where you may wish alter this default behavior.
Assuming this default behavior, at this point that a manual intervention is needed.
Your first priority is to ensure that at least one node is up and serving requests. Once you have accomplished this, you can worry about starting the other nodes, synchronizing them, and so forth. Here is procedure to follow:
- First, determine which of the scenarios above applies.
- If the master node A is still responding and serving requests then you have scenario (2) and the problem lies in the connection between master slave.
- In this case no emergency measures need to be taken, since the service is still functional, you must simply reestablish the cluster.
- First, stop the slave B.
- Ensure that the network problem has been solved.
- Restart B. It should automatically rejoin the cluster. If it is out-sync you can troubleshoot it according to the tips given in Recovering an Out-Of-Sync Cluster Node.
- On the other hand, if A is down then you have scenario (1). In this case your first priority shgould be to get functioning instance up and running and serving requests as soon as possible.
- You have two choices: Restart A and keep it as master or redirect requests to B and make it the new master. Which one you choose should depend on which can be achieved most quickly and with the highest likelyhood of success.
- If the problem with A is easy to fix then restart it and ensure that it is functioning properly. Then restart B and ensure that it properly rejoins the cluster.
- If the problem with the master looks more involved, it may be easier to redirect incoming requests to the slave and restart the slave making it he new master.
- To do this you must first stop B and remove the file
crx-quickstart/repository/clustered.txt. This is a flag file that CRX creates to keep track of whether a restarted system should regard itself as master or slave. The presence f the file inidicates to the system that before restart it as a slave, so it iwll attempt to automatically rejoin its cluster. The absence of the file indicates that the instance was, before restart, a matsre (or a lone un-clusterd instance), in which case it does not attempt to rejoin any clusters.
- Now restart B.
- Once you have confirmed that B is up and running and serving requests you can work on fixing A. When as is in working order you can join it to B, except this time A will be the slave and B the master. Alternatively you may switch back to the original arrangement. Just don't forget abit the clustered.txt file!
- It may be possible that A now reports that it is out of sync with cluster node B. See Recovering an Out-Of-Sync Cluster Node for information on fixing this.