The curious case of a Failover Cluster

                                A good Vishu morning, apart from the fact that I had to dog-walk my bike over a couple of kms to get the puncture fixed, only to realize that I was tricked/lured into getting a new tube for the rear; Login to skype after an hour-long dreaded journey in the company bus, and I get a casual 'Hey man' from a second-line engineer saying Drain roles isn't working completely. Ya, just one VM fails to get migrated to another host. Checking the resources, the destination host has enough RAM left.
Oops, forgot the intro, like every other time; The infra in question is a 4-node 2012 R2 Datacenter Hyper-V Cluster(HP Proliant DL380 Gen9, 192 GB Memory, Xeon E5-2650 @2.3Ghz). The VMs were heavily mismanaged, and I had to spend a good 2 days to restructure the VMs with a logical preferred owner setup. Each Hyper-V host has a 32-GB Exchange server(DAG'd), at-least one 24 GB RDS server (Connection Broker or Gateway), and some 8 Gig file & application servers.

Cut back to the issue: The VM is fairly new, and was built/commissioned by me. Two big clues here: One, the VM might not have ever live migrated to another host, and most probably not an issue with the host, but something to do with configuration. Two, built by me, so definitely something I missed/messed up.

Failover cluster events from the FCM console gives me absolutely no hint; Just two events:
  1. Event ID 1155, a Warning 'A pending move for the role 'xx' did not complete'
  2. Event ID 1137, an Error 'Move of cluster role 'xx' for drain could not be completed.  The operation failed with error code 0x32.'

Google mama failed to answer these events.
Checking the Failover clustering operational logs, gives me some idea:
'Cluster resource 'Virtual Machine xx' in clustered role 'xx' rejected a move request to node 'Node-2'. The error code was '0x300035'.  Cluster resource 'Virtual Machine xx' may be busy or in a state where it cannot be moved.  The cluster service may automatically retry the move.'


When something is busy, the general rule of thumb is to retry. I did that just to realize the general rules don't apply.
Time for some sanity:
  • Create empty role, live migrate it freely around nodes - ✅
  • Live migrate another role to the affected node(not really), and back - ✅
  • Check VM settings - ✅
Ah, there's a snapshot, which might have hindered the failover. Deleted the snapshot without thinking much, and while the merge was in progress, I started questioning myself; The VM is in a CSV, and so are the snaps. Why would it stop the VM from live migrating to another host? 

Checking the VM settings, pointed me back to two of my design/commissioning flaws:
Checkpoint File location & Smart Paging file location were at 'C:\ProgramData\Microsoft\Windows\Hyper-V'. The default location assigned by the Hyper-V host.

By now, the snapshot merge has completed. Immediately powered-down the VM, and updated the checkpoint & paging file locations to point to CSV.
Powered up the VM, and confidently tried live migrated the VM to host-2, only to see it fail again. Now that's a setback indeed!
I could literally feel moving parts inside my brain churning out grey-matter.

At this point, the issue has now changed its course to a challenge!

Do or DO:

Removed the role from cluster, added it back(Yes, its a production environment, but I need to walk the path to know the path)
Walking the path did nothing - The live migration still fails; The piling up of cluster events makes my knees weaker.
So, at this point, I realize its something to do with the cluster setting of the VM, and not with the Failover Cluster or actual VM.
Applying the thumb rule again, but calmer, with attention to detail(Breathe in, Breathe out):

  • Remove the role from cluster - ✅
  • B-I-B-O - ✅
  • Add the role from Cluster - ✅
  • Before casually hitting the 'OK' button, pause and read between the lines - ✅✅
Ah, there's a warning, my brain ignored the last time. That's a sign.
Opens the report, and finds something fishy!


Apparently, the VM config files are still at the default location, of the Hyper-V host, and not on a CSV.
It's a shame, adding such a VM to cluster ended with 'just' a  warning, and not in a big red error  message, that'd have saved me hours.
Microsoft, that's lame 💁

Furthermore:

So, there seems to be no way to find the location of the VM configuration file from GUI. 

Enter the dragon Powershell:
“There's something PS cant do, and that something is literally nothing” - Author, a PS fanboy

Code Snippet: (Get-VM <VM_name>).ConfigurationLocation
Tada! So there's that. Now its all about damage-control and not a mere remediation.

The Fix:

  1. Open the Failover Cluster Manager
  2. Under the VM, select Move > Virtual Machine Storage
  3. DRAG-AND-DROP the Current Configuration to the required location in CSV
It took me some time to realize it was 'Drag and drop'! Some good time.

Now I was able to migrate the VM to another host, smooth as butter.
A good day, with much more take aways.

Moral:
  1. Never neglect those . They may not be as tiny as they look.
  2. Read everything before hit every 'Next' (Of-course, unless its the 'I have read the agreement' stuff)
  3. Blog your experiences, and help others.

Best
Arjun

Comments

  1. There you go! Congrats on your first write up. Cheers!

    ReplyDelete
  2. Wonderful write up Arjun. Keep it going.

    ReplyDelete

Post a Comment

Popular Posts