Technology.Life.Insight.

VMware vSphere HA Mode for vGPU

The following post was written and scenario tested by my colleagues and friends Keith Keogh and Andrew Breedy of the Limerick, Ireland CCC engineering team. I helped with editing and art. Enjoy.

 

VMWare has added High Availability (HA) for virtual machines (VM’s) with an NVIDIA GRID vGPU shared pass-through device in vSphere 6.5. This feature allows any vGPU VM to automatically restart on another available ESXi host with an identical NVIDIA GRID vGPU profile if the VM’s hosting server has failed. 
To better understand how this feature would work in practice we decided to create some failure scenarios. We tested how HA for vGPU virtual machines performed in the following situations:
1)  Network outage: Network cables get disconnected from the virtual machine host.
2)  Graceful shutdown: A host has a controlled shut down via the ESXi console.
3)  Power outage: A host has an uncontrolled shutdown, i.e. the server loses AC power.
4)  Maintenance mode: A host is put into maintenance mode.


The architecture of the scenario used for testing was laid out as follows:
  

ao

 

Viewed another way, we have 3 hosts, 2 with GPUs, configured in an HA VSAN cluster with a pool of Instant Clones configured in Horizon.

clus1
 

1)  For this scenario, we used three Dell vSan Ready Node R730 servers configured per our C7 spec, with the ESXi 6.5 hypervisor installed, clustered and managed through a VMWare vCenter 6.5 appliance. image
2)  VMWare VSAN 6.5 was configured and enabled to provide shared storage.
3)  Two of the three servers in the cluster had an NVIDIA M60 GPU card and the appropriate drivers installed in these hosts. The third compute host did not have a GPU card.
4)  16 vGPU virtual machines were placed on one host. The vGPU VM’s were a pool of linked clones created using VMWare Horizon 7.1. The virtual machine Windows 10 master image had the appropriate NVIDIA drivers installed and an NVIDIA GRID vGPU shared PCI pass through device added.
5)  All 4GB of allocated memory was reserved on the VM’s and the M60-1Q profile assigned which has a maximum of 8 vGPU’s per physical GPU. Two vCPU’s were assigned to each VM.
6)  High Availability was enabled on the vSphere cluster.


Note that we intentionally didn’t install an M60 GPU card in the third server of our cluster. This allowed us to test the mechanism that allows the vGPU VM’s to only restart on a host with the correct GPU hardware installed and not attempt to restart the VM’s on a host without a GPU. In all cases this worked flawlessly, the VM’s always restarted on the correct backup host with the M60 GPU installed.

 

s1

Scenario One


To test the first scenario of a network outage we simply unplugged the network cables from the vGPU VM’s host server. The host showed as not responding in the vSphere client and the VM’s showed as disconnected within a minute of unplugging the cables.  After approximately another 30 seconds all the VM’s came back online and restarted normally on the other host in the cluster that contained the M60 GPU card.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Scenario Twos2


A controlled shutdown down of the host through ESXi worked in a similar manner. The host showed as not responding in the vSphere client and the VM’s showed as disconnected as the ESXi host began shutting down the VSAN services as part of ESXi’s shutdown procedure. Once the host had shut down, the VM’s restarted quite quickly on the backup host. In total, the time from starting the shutdown of the host until the VM’s restarted on the backup host was approximately a minute and a half.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

s3

Scenario Three


To simulate a power outage, we simply pulled the power cord of the vGPU VM’s host. After approximately one minute the host showed as not responding and the VM’s showed as disconnected in the vSphere client. Roughly 10 seconds later the vGPU VM’s had all restarted on the back up host.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

s4

Scenario Four


Placing the host in maintenance mode worked as expected in that any powered on VMs must first be powered off on that host before completion can occur. Once powered off, the VM’s were then moved to the backup host but not powered on (the ‘Move powered-off and suspended virtual machines to other hosts in the cluster’ option was left ticked when entering maintenance mode). The vGPU-enabled VM’s were moved to the correct host each time we tested this: the host with the M60 GPU card.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

In conclusion, the HA failover of vGPU virtual machines worked flawlessly in all cases we tested. In the context of speed of recovery, we made the following observations with regard to the scenarios tested. The unplanned power outage test recovered the quickest at 1 minute 10 seconds followed by the controlled shutdown, with the network outage taking slightly longer again to recover.

Failure Scenario Approximate Recovery Time
Network Outage 1 min, 30 sec
Controlled Shutdown 1 min, 30 sec
Power Outage 1 min, 10 sec


In a real world scenario the HA failover for vGPU VM’s would of course require an unused GPU card in a backup server for it to work. In most vGPU environments you would probably want to fill the GPU to its maximum capacity of vGPU based VM’s to gain maximum value from your hardware. In order to provide a backup for a fully populated vGPU host, it would mean having a fully unused server and GPU card lying idle until circumstances required it. It is also possible to split the vGPU VM’s between two GPU hosts, placing, for example 50% of the VM’s on each host. In the event of a failure the VM’s will failover to the other host. In this scenario though, each GPU would be used to only 50% of its capacity.


An unused back-up GPU host may be ok in larger or mission critical deployments but may not be very cost effective for smaller deployments. For smaller deployments it would be recommended to leave a resource buffer on each GPU server so that if there was a host failure, there is room on the GPU enabled server to facilitate these vGPU Enabled VM’s.

The Native MSFT Stack: S2D & RDS – Part 3

Part 1: Intro and Architecture

Part 2: Prep & Configuration

Part 3: Performance & Troubleshooting (you are here)

Make sure to check out my series on RDS in Server 2016 for more detailed information on designing and deploying RDSH or RDVH.

 

Performance

This section will give an idea of disk activity and performance given this specific configuration as it relates to S2D and RDS.

Here is the 3-way mirror in action during the collection provisioning activity, 3 data copies, 1 per host, all hosts are active, as expected:

 

Real-time disk and RDMA stats during collection provisioning:

 

RDS provisioning isn’t the speediest solution in the world by default as it creates VMs one by one. But it’s fairly predictable at 1 VM per minute, so plan accordingly.

 

Optionally, you can adjust the concurrency level to adjust the number of VMs RDS can create in parallel by using the Set-RDVirtualDesktopConcurrency CMDlet. Server 2012R2 supported a max of 20 concurrent operations but make sure the infrastructure can handle whatever you set here.

Set-RDVirtualDesktopConcurrency –ConnectionBroker “RDCB name” –ConcurrencyFactor 20

 

To illustrate the capacity impact of 3-way mirroring, take my lab configuration which provides my 3-node cluster with 14TB of total capacity, each node contributing 4.6TB. I have 3 volumes fixed provisioned totaling 4TB with just under 2TB remaining. The math here is that of my raw capacity, 30% is usable, as expected. This equation gets better at larger scales, starting with 4 nodes, so keep this in mind when building your deployment.

 

Here is the disk performance view from one of the Win10 VMs within the pool (VDI-23) running Crystal DiskMark, which runs Diskspd commands behind the scenes. Read IOPS are respectable at an observed max of 25K during the sequential test.

image_thumb[3]

 

Write IOPS are also respectable at a max observed of nearly 18K during the Sequential test.

image_thumb[4]

 

Writing to the same volume, running the same benchmark but from the host itself, I saw some wildly differing results. Similar bandwidth numbers but poorer showing on most tests for some reason. One run yielded some staggering numbers on the random 4K tests which I was unable to reproduce reliably.

image_thumb[5]

Running Diskspd tells a slightly different story with more real-world inputs specific to VDI: write intensive. Here given the punishing workload, we see impressive disk performance numbers and almost 0 latency. The following results were generated from a run generating a 500MB file, 4K blocks weighted at 70% random writes using 4 threads with cache enabled.

Diskspd.exe -b4k -d60 -L -o2 -t4 -r -w70 -c500M c:\ClusterStorage\Volume1\RDSH\io.dat

image_thumb[6]

 

Just for comparison, here is another run with cache disabled. As expected, not so impressive.

image_thumb[8]

 

Troubleshooting

Here are a few things to watch out for along the way.

When deploying your desktop collection, if you get an error about a node not having a virtual switch configured:

 

One possible cause is that the default SETSwitch was not configured for RDMA on that particular host. Confirm and then enable per the steps above using Enable-NetAdapterRDMA.

 

On the topic of VM creation within the cluster, here is something kooky to watch out for. If working on one of your cluster nodes locally and you try to add a new VM to be hosted on a different node, you may have trouble. When you get to the OS install portion where the UI offers to attach an ISO to boot from. If you point to an ISO sitting on a network share, you may get the following error: Failing to add the DVD device due to lack of permissions to open attachment.

 

You will also be denied if you try to access any of the local paths such as desktop or downloads.

 

The reason for this is that the dialog is using the remote browser for file access on the server you are creating the VM. Counter-intuitive perhaps but the work around is to either log into the host where the VM is to be created directly or copy your ISO local to that host. 

 

Watch this space, more to come…

 

Part 1: Intro and Architecture

Part 2: Prep & Configuration

Part 3: Performance & Troubleshooting (you are here)

 

Resources

S2D in Server 2016

S2D Overview

Working with volumes in S2D

Cache in S2D

The Native MSFT Stack: S2D & RDS – Part 2

Part 1: Intro and Architecture

Part 2: Prep & Configuration (you are here)

Part 3: Performance & Troubleshooting

Make sure to check out my series on RDS in Server 2016 for more detailed information on designing and deploying RDSH or RDVH.

 

Prep

The number 1 rule of clustering with Microsoft is homogeneity. All nodes within a cluster must have identical hardware and patch levels. This is very important. The first step is to check all disks installed to participate in the storage pool, bring all online and initialize. This can be done via PoSH or UI.

Once initialized, confirm that all disks can pool:

Get-PhysicalDisk -CanPool $true | sort model

 

Install Hyper-V and Failover Clustering on all nodes:

Install-WindowsFeature -name Hyper-V, Failover-Clustering -IncludeManagementTools -ComputerName InsertName -Restart

 

Run cluster validation, note this command does not exist until the Failover Clustering feature has been installed. Replace portions in red with your custom inputs. Hardware configuration and patch level should be identical between nodes or you will be warned in this report.

Test-Cluster -Node node1, node2, etc -include "Storage Spaces Direct", Inventory, Network, "System Configuration"

 

The report will be stored in c:\users\<username>\AppData\Local\Temp

Networking & Cluster Configuration

We'll be running the new Switch Embedded Team (SET) vSwitch feature, new for Hyper-V in Server 2016.

 

Repeat the following steps on all nodes in your cluster. First, I recommend renaming the interfaces you plan to use for S2D to something easy and meaningful. I'm running Mellanox cards so called mine Mell1 and Mell2. There should be no NIC teaming applied at this point!

 

Configure the QoS policy, first enable the datacenter bridging (DCB) feature which will allow us to prioritize certain services the traverse the Ethernet.

Install-WindowsFeature -Name Data-Center-Bridging

 

Create a new policy for SMB Direct with a priority value of 3 which marks this service as “critical” per the 802.1p standard:

New-NetQosPolicy "SMBDirect" -NetDirectPortMatchCondition 445 -PriorityValue8021Action 3

 

Allocate at least 30% of the available bandwidth to SMB for the S2D solution:

New-NetQosTrafficClass "SMBDirect" -Priority 3 -BandwidthPercentage 30 -Algorithm ETS

 

Create the SET switch using the adapter names you specified previously. Items in red should match your choices or naming standards. Note that this SETSwitch is ultimately a vNIC and can receive an IP address itself:

New-VMSwitch -Name SETSwitch -NetAdapterName "Mell1","Mell2" -EnableEmbeddedTeaming $true

 

Add host vNICs to the SET switch you just created, these will be used by the management OS. Items in red should match your choices or naming standards. Assign static IPs to these vNICs as required as these interfaces are where RDMA will be enabled.

Add-VMNetworkAdapter –SwitchName SETSwitch –name SMB_1 –managementOS

Add-VMNetworkAdapter –SwitchName SETSwitch –name SMB_2 –managementOS

 

Once created, the new virtual interface will be visible in the network adapter list by running get-netadapter.

image_thumb

 

Optional: Configure VLANs for the new vNICs which can be the same or different but IP them uniquely. If you don’t intend to tag VLANs in your cluster or have a flat network with one subnet, skip this step.

Set-VMNetworkAdapterVLAN –VMNetworkAdapterName “SMB_1” –VlanId 00 –Access –ManagementOS

Set-VMNetworkAdapterVLAN –VMNetworkAdapterName “SMB_2” –VlanId 00 –Access –ManagementOS

 

Verify your VLANs and vNICs are correct. Notice mine are all untagged since this demo is in a flat network.

Get-VMNetworkAdapterVlan –ManagementOS

image_thumb1

 

Restart each vNIC to activate the VLAN assignment.

Restart-NetAdapter “vEthernet (SMB_1)”

Restart-NetAdapter “vEthernet (SMB_2)”

 

Enable RDMA on each vNIC.

Enable-NetAdapterRDMA “vEthernet (SMB_1)”, “vEthernet (SMB_2)”

 

Next each vNIC should be tied directly to a preferential physical interface within the SET switch. In this example we have 2 vNICs and 2 physical NICs for a 1:1 mapping. Important to note: Although this operation essentially designates a vNIC assignment to a preferential pNIC, should the assigned pNIC fail, the SET switch will still load balance vNIC traffic across the surviving pNICs. It may not be immediately obvious that this is the resulting and expected behavior.

Set-VMNetworkAdapterTeamMapping -VMNetworkAdapterName "SMB_1" -ManagementOS –PhysicalNetAdapterName “Mell1”

Set-VMNetworkAdapterTeamMapping -VMNetworkAdapterName "SMB_2" -ManagementOS –PhysicalNetAdapterName “Mell2”

 

To quickly prove this, I have my vNIC “SMB_1” preferentially tied to the pNIC “Mell1”. SMB_1 has the IP address 10.50.88.82

image_thumb3

 

Notice that even though I have manually disabled Mell1, the IP still responds to a ping from another host as SMB_1’s traffic is temporarily traversing Mell2:

image_thumb4

 

Verify the RDMA capabilities of the new vNICs and associated physical interfaces. RDMA Capable should read true.

Get-SMBClientNetworkInterface

image_thumb2

 

Build the cluster with a dedicated IP and slash subnet mask. This command will default to /24 but still might fail unless explicitly specified.

New-Cluster -name "Cluster Name" -Node Node1, Node2, etc -StaticAddress 0.0.0.0/24

 

Checking the report output stored in c:\windows\cluster\reports\, it flagged not having a suitable disk witness. This will auto-resolve later once the cluster is up.

 

The cluster will come up with no disks claimed and if there is any prior formatting, they must first be wiped and prepared. In PowerShell ISE, run the following script, enter your cluster name in the red text.

icm (Get-Cluster -Name S2DCluster | Get-ClusterNode) {

Update-StorageProviderCache

Get-StoragePool | ? IsPrimordial -eq $false | Set-StoragePool -IsReadOnly:$false -ErrorAction SilentlyContinue

Get-StoragePool | ? IsPrimordial -eq $false | Get-VirtualDisk | Remove-VirtualDisk -Confirm:$false -ErrorAction SilentlyContinue

Get-StoragePool | ? IsPrimordial -eq $false | Remove-StoragePool -Confirm:$false -ErrorAction SilentlyContinue

Get-PhysicalDisk | Reset-PhysicalDisk -ErrorAction SilentlyContinue

Get-Disk | ? Number -ne $null | ? IsBoot -ne $true | ? IsSystem -ne $true | ? PartitionStyle -ne RAW | % {

$_ | Set-Disk -isoffline:$false

$_ | Set-Disk -isreadonly:$false

$_ | Clear-Disk -RemoveData -RemoveOEM -Confirm:$false

$_ | Set-Disk -isreadonly:$true

$_ | Set-Disk -isoffline:$true

}

Get-Disk |? Number -ne $null |? IsBoot -ne $true |? IsSystem -ne $true |? PartitionStyle -eq RAW | Group -NoElement -Property FriendlyName

} | Sort -Property PsComputerName,Count

Once successfully completed, you will see an output with all nodes and all disk types accounted for.

 

Finally, it’s time to enable S2D. Run the following command and select “yes to all” when prompted. Make sure to use your cluster name in red.

Enable-ClusterStorageSpacesDirect –CimSession S2DCluster

 

Storage Configuration

Once S2D is successfully enabled, there will be a new storage pool created and visible within Failover Cluster Manager. The next step is to create volumes within this pool. S2D will make some resiliency choices for you depending on how many nodes are in your cluster. 2 nodes = 2-way mirroring, 3 nodes = 3-way mirroring, if you have 4 or more nodes you can specify mirror or parity. When using a hybrid configuration of 2 disk types (HDD and SSD), the volumes reside within the HDDs as the SSDs simply provide caching for reads and writes. In an all-flash configuration only the writes are cached. Cache drive bindings are automatic and will adjust based on the number of each disk type in place. In my case, I have 4 SSDs + 5 HDDs per host. This will net a 1:1 cache:capacity map for 3 pairs of disks and a 1:2 ratio for the last 3. Microsoft’s recommendation is to make the number of cache drives a multiple of the number of capacity drives, for simple symmetry. If a host experiences a cache drive failure, the cache to capacity mapping will readjust to heal. This is why a minimum of 2 cache drives per host are recommended.

 

Volumes can be created using PowerShell or Failover Cluster Manager by selecting the “new disk” option. This one simple PowerShell command does three things: creates the virtual disk, places a new volume on it and makes it a cluster shared volume. For PowerShell the syntax is as follows:

New-Volume –FriendlyName “Volume Name” –FileSystem CSVFS_ReFS –StoragePoolFriendlyName S2D* –size xTB

 

CSVFS_ReFS is recommended but CSVFS_NTFS may also be used. Once created these disks will be visible under the Disks selection within Failover Cluster Manager. Disk creation within the GUI is a much longer process but gives meaningful information along the way, such as showing that these disks are created using the capacity media (HDD) and that the 3-server default 3-way mirror is being used.

 

The UI also shows us the remaining pool capacity and that storage tiering is enabled.

 

Once the virtual disk is created, next we need to create a volume within using another wizard.  Select the vDisk created in the last step:

 

Storage tiering dictates that the volume size must match the size of the vDisk:

 

Skip the assignment of a drive letter, assign a label if you desire, confirm and create.

 

 

The final step is to add this new disk to a Cluster Shared Volume via Failover Cluster Manager. You’ll notice that the owner node will be automatically assigned:

 

Repeat this process to create as many volumes as required.

 

RDS Deployment

Install the RDS roles and features required and build a collection. See this post for more information on RDS installation and configuration. There is nothing terribly special that changes this process for a S2D cluster. As far as RDS is concerned, this is an ordinary Failover Cluster with Cluster Shared Volumes. The fact that RDS is running on S2D and “HCI” is truly inconsequential.

As long as the RDVH or RDSH roles are properly installed and enabled within the RDS deployment, the RDCB will be able to deploy and broker connections to them. One of the most important configuration items is ensuring that all servers, virtual and physical, that participate in your RDS deployment, are listed in the Servers tab. RDS is reliant on Server Manager and Server Manager has to know who the players are. This is how you tell it.

image_thumb[1]

 

To use the S2D volumes for RDS, point your VM deployments at the proper CSVs within the C:\ClusterStorage paths. When building your collection, make sure to have the target folder already created within the CSV or collection creation will fail. Per the example below, the folder “Pooled1” needed to be manually created prior to collection creation.

 

During provisioning, you can select a specific balance of VMs to be deployed the RDVH-enabled hosts you select, as can be seen below.

 

RDS is cluster aware in that the VMs it creates can be created as HA within the cluster and the RD Connection Broker (RDCB) remains aware as to which host is running which VMs should they move. In the event of a failure, VMs will be moved to a surviving host by the cluster service and the RDCB will keep track.

 

Part 1: Intro and Architecture

Part 2: Prep & Configuration (you are here)

Part 3: Performance & Troubleshooting

 

Resources

S2D in Server 2016

S2D Overview

Working with volumes in S2D

Cache in S2D

Recent Comments

Popular Posts

Powered by Blogger.

Twitter Feed

.

.