WSFC fixing Errors for AlwaysOn Microsoft SQL Server (SQL Clustering)

As a SQL Server DBA, there will be times when you will cross over into the system administrator realm of responsibilities or at a minimum telling them how to fix errors so your SQL environment runs better. WSFC (windows server failover cluster) is setup for AlwaysOn (AO) Availability Group (AG) but setup with no shared disk resources. If WSFC is having issues, your Availability Group will not function properly or will cause you a lot of heartache trying to figure out why you have so many issues.

Most of the time WSFC errors do not occur until AO is setup; however, you should make sure no errors exist in the WSFC logs before setting up AO. You can look at the event viewer or look within Failover Cluster Manager for errors. Fix errors before setting up AO or have the SA fix the errors.

Only add nodes within the Failover Cluster Manager that are part of the AlwaysOn Availability Group failover. Adding other servers that will not be part of the AG will cause issues if those nodes have problems. If other servers are part of the WSFC, make sure those servers do not have a separate AG that is part of the WSFC. If they do then the AG will have to be deleted (verify the AG name is no longer under Roles in the Failover Cluster Manager for the cluster) and the nodes evicted from the WSFC. After that is done, a new WSFC will have to be created and the AG recreated. If those servers do not have an AG created, they should be evicted from the WSFC. Do this during a maintenance window in case of something going wrong.

WSFC – Here are some common errors and how to fix them.

Error: The file share witness recourse “failed to arbitrate for the files share “\\servername\share”. Please ensure that file share \\servername\share exists and is accessible by the cluster.

Fix: To fix the error, an admin needs to give EVERYONE FULL control to share \\servername\share. This is a share that the cluster uses within WSFC and needs access to it. Nothing is in this share.

Error: The cluster service is shutting down because quorum was lost. This could be due to the loss of network connectivity between some or all nodes in the cluster or a failover of the witness disk…

Fix: This can be fixed by changing the cluster threshold and delay settings. More details on how to change this can be found here – https://virtual-dba.com/alwayson-changing-cluster-configuration/

Error(s): Cluster is offline.

Clustered role ‘Cluster Group’ has exceeded its failover threshold. It has exhausted the configured number of failover attempts within the failover period of time allotted to it and will be left in a failed state. No additional attempts will be made to bring the role online or fail it over to another node in the cluster…

The Cluster service failed to bring clustered role ‘Cluster Group’ completely online or offline. One or more resources may be in a failed state. This may impact the availability of the clustered role.

Cluster resource ‘Clustered IP Address XXX.XXX.XXX.XXX of type ‘IP Address; in clustered role ‘Cluster Group’ failed…

Encountered a failure when attempting to create new NetBIOS interface while bringing resource ‘Cluster IP Address XXX.XXX.XXX.XXX online (error code ‘1450’). The maximum number of NetBIOS names may have been exceeded.

Fix: After validating the WSFC had no errors, the problem was a duplicate IP address conflict issue. The SA needs to fix this. Verify DNS has the IP address for the cluster node. If the IP address is changed, make sure DNS is fixed. If the IP does not respond to ping, flush the ARP cache to remove old information or you can just remove the one bad entry.

How to flush the whole ARP cache or just remove one bad entry: http://www.techrepublic.com/blog/windows-and-office/quick-tips-flush-the-arp-cache-in-windows-7/

Error: The computer object associated with cluster network name resource ‘’ could not be updated. The cluster identity ‘Name$’ may lack permissions required to update the object. Please work with your domain administrator to ensure that the cluster identity can update computer objects in the domain.

Fix: https://support.microsoft.com/en-us/help/2770582/event-id-1222-when-you-create-a-windows-server-2012-failover-cluster

Error: Cluster network name resource ‘SQL Network Name (SQLClusterName)’ failed registration of one or more associated DNS name(s) for the following reason: DNS operation refused. Ensure that the network adapters associated with dependent IP address resources are configured with at least one assessable DNS server.

Fix:

Open DNS Manager, find the record (SQLClusterName) (Host(A) record) for the SQLClusterName resource.
Go to properties for that record
In the Security tab, make sure the WindowsClusterName is included if not add it.
Make sure the WindowsClusterName (will have $ after the name) has Write, Read and Special permissions checked under Allow
Click Advance, locate WindowsClusterName, and click Edit
Make sure that Write all properties, Read permissions, All Validated Writes are selected
Click OK three times to exit.

Error: Cluster network name resource ‘name’ cannot be brought online. The computer object associated with the resource could not be updated in ‘domainname’ for the following reason: Unable to update password for computer account… The cluster identity ‘windowsclustername$’ may lack permissions required to update the object. Please work with your domain administrator to ensure that the cluster identity can update computer objects in the domain.

Fix:

Within AD, look for the Listener name.
Go to Properties of the computer (listener name), then click on the security tab.
1. If you do not see the security tab close the properties window for the listener, click on View then check Advanced Features. This will allow you to see the Security tab of the listener within Computers.
Within the security tab, give the WindowsClusterName (it will have a $ after the name) FULL CONTROL permissions.

Error: No matching network interface found for resource ‘AGName_XXX.XXX.XXX.XXX’ IP address ‘XXX.XXX.XXX.XXX’ (return code was ‘5035’). If your cluster nodes span different subnets, this may be normal.

The Cluster service failed to bring clustered role ‘AGName’ completely online or offline. One or more resources may be in a failed state. This may impact the availability of the clustered role.

Cluster resource ‘AGName_XXX.XXX.XXX.XXX’of type ‘IP Address’ in clustered role ‘AGName’ failed. Based on the failure policies for the resource and role, the cluster service may try to bring the resource online on this node or move the group to another node of the cluster and then restart it. Check the resource and group state using Failover Cluster Manager or the Get-ClusterResource Windows PowerShell cmdlet.

Fix: You will see these errors if you try to configure your listener and you either have multiple computers on different subnets or all your servers on the same subnet.

If your servers that you want to be part of the AG are on the same subnet, make sure your primary NIC (look at all NICs settings) subnet mask are set the same for all servers. Once you fix that, you will be able to create a listener.
If you have multiple server on different subnets, make sure you have an IP address for every subnet your computer is attached too.