Issues:
SQL failover cluster automatically failing over.
Windows Failover Cluster Log:
Cluster node xxxxx was removed from the active failover cluster membership. The Cluster service on this node may have stopped. This could also be due to the node having lost communication with other active nodes in the failover cluster. Run the Validate a Configuration wizard to check your network configuration. If the condition persists, check for hardware or software errors related to the network adapters on this node. Also check for failures in any other network components to which the node is connected such as hubs, switches, or bridges. The Cluster service is shutting down because quorum was lost. This could be due to the loss of network connectivity between some or all nodes in the cluster, or a failover of the witness disk.
Window Event Viewer:
The Cluster Service service terminated unexpectedly. It has done this 2 time(s). The following corrective action will be taken in 15000 milliseconds: Restart the service. The Cluster Service service terminated with the following service-specific error: A quorum of cluster nodes was not present to form a cluster. Unable to move the replacement file to the file to be replaced. The file to be replaced has been renamed using the backup name. The description for Event ID 1135 from source Microsoft-Windows-FailoverClustering cannot be found. Either the component that raises this event is not installed on your local computer or the installation is corrupted. You can install or repair the component on the local computer. If the event originated on another computer, the display information had to be saved with the event.
Validation Cluster Report:
Failed while verifying removal of any Persistent Reservation on physical disk at node xxxxxx. The cluster is not configured with a quorum witness. As a best practice, configure a quorum witness to help achieve the highest availability of the cluster. This quorum configuration can be changed using the Configure Cluster Quorum wizard. This wizard can be started from the Failover Cluster Manager console by selecting the cluster name in the left hand pane, then in the right "actions" pane selecting "More Actions..." and then selecting "Configure Cluster Quorum Settings...". This resource does not have all the nodes of the cluster listed as Possible Owners. The clustered role that this resource is a member of will not be able to start on any node that is not listed as a Possible Owner. This resource is marked with a state of 'Failed' instead of 'Online'. This failed state indicates that the resource had a problem either coming online or had a failure while it was online. The event logs and cluster logs may have information that is helpful in identifying the cause of the failure. The following servers have updates applied which are pending a reboot to take effect. It is recommended to reboot the servers to complete the patching process.
Cause:
The cluster logs indicate that DB01 is losing connection with the other nodes, leading to automatic failovers. This disconnection appears to be due to the cluster service shutting down as a result of quorum loss.
Recommendations:
Based on our analysis of the cluster validation report:
- As a best practice, configure a Quorum Witness to ensure it remains accessible to all nodes at all times.
- Designate all nodes as possible owners for DB01 and DB02 to ensure the cluster resource comes online automatically after a failover.
- The validation report indicates a potential persistent physical disk issue on DB01; a reboot may be required to resolve this, or it can be cleared via the command line.
- The report also shows that a reboot is pending for the nodes.
Comments