Migrate an existing cluster to Network ATC

This article is contributed. See the original author and article here.

Since Azure Stack HCI 21H2, customers have used Network ATC to:

Reduce host networking deployment time, complexity, and errors

Deploy the latest Microsoft validated and supported best practices

Ensure configuration consistency across the cluster

Eliminate configuration drift

Network ATC has led to HUGE reductions in customer support cases which means increased uptime for your business applications and less headaches for you! But what if you already deployed your cluster? How do you take advantage now that you’re travelled through that trepidatious train of thought against taking on new technology?

With minimal alliteration, this article will show you how to migrate an existing cluster to Network ATC so you can take advantage of all the benefits mentioned above. Once completed, you could easily cookie-cut this configuration across all new deployments using our previous blog; so this would be a one-time migration, and all new clusters will gain the benefits!

Before you begin

Since this is a live cluster with running VMs, we’ll take some precautions to ensure we’re never working on a host with a running VM on it. If you don’t have running workloads on these nodes, you don’t need these instructions. Just add your intent command as if this was a brand-new cluster.

As some background, Network ATC stores information in the cluster database which is then replicated to other nodes in the cluster. The Network ATC service on the other nodes in the cluster see the change in the cluster database and implements the new intent. So we setup the cluster to receive a new intent, but we can also control the rollout of the new intent by stopping or disabling the Network ATC service on nodes that have virtual machines on them.

Procedure

Step 1: Install the Network ATC feature

First, let’s install Network ATC on EVERY node in the cluster using the following command. This does not require a reboot.

Install-WindowsFeature -Name NetworkATC

Step 2: Pause one node in the cluster

Pause one node in the cluster. This node will be migrated to Network ATC. We’ll repeat this step later for other nodes in the cluster too. As a result of this pause, all workloads will migrate to other nodes in the cluster leaving this machine available for changes. To do this, you can use the command:

Suspend-ClusterNode

Step 3: Stop the Network ATC service

For all nodes that are not paused, stop and disable the Network ATC service. As a reminder, this is to prevent Network ATC from implementing the intent while there are running virtual machines. To do this, you can use the commands:

Set-Service  -Name NetworkATC -StartupType Disabled
Stop-Service -Name NetworkATC

Step 4: Remove existing configuration

Next, we’ll remove any previous configurations that might interfere with Network ATC’s ability to implement the intent. An example of this might be a Data Center Bridging (NetQos) policy for RDMA traffic. Network ATC will also deploy this, and if it sees a conflicting policy, Network ATC is wise enough not to interfere with it until you make it clear which policies you want to keep. While Network ATC will attempt to “adopt” the existing configuration if the names match (whether it be NetQos or other settings) it’s far simpler to just remove the existing configuration and let Network ATC redeploy.

Network ATC deploys a lot more than these items, but these are the items that need to be resolved before implementing the new intent.

VMSwitch

If you have more than one VMSwitch on this system, ensure you specify the switch attached to the adapters that will be used in this intent.

Get-VMSwitch -Name  | Remove-VMSwitch -force

Data Center Bridging Configuration

Remove the existing DCB Configurations.

Get-NetQosTrafficClass | Remove-NetQosTrafficClass
Get-NetQosPolicy       | Remove-NetQosPolicy -Confirm:$false
Get-NetQosFlowControl  | Disable-NetQosFlowControl

LBFO

If you accidentally deployed an LBFO team, we’ll need to remove that as well. As you might have read, LBFO is not supported on Azure Stack HCI at all. Don’t worry, Network ATC will prevent these types of accidental oversights in the future as it will never deploy a solution that we do not support.

Get-NetLBFOTeam | Remove-NetLBFOTeam -Confirm:$true

SCVMM

If the nodes were configured via VMM, these configuration objects may need to be removed from VMM as well.

Step 5: Add the Network ATC intent

It’s now time to add a Network ATC intent. You’ll only need to do this once since Network ATC intents are implemented cluster wide. However, we have taken some precautions to control the speed of the rollout. In step 2, we paused this node so there are no running workloads on it. In step 3, we stopped and disabled the Network ATC service on nodes where there are running workloads.

If you stopped and disabled the Network ATC service, you should start this service on this node only. To do this, run the following command:

Set-Service   -Name NetworkATC -StartupType Automatic
Start-Service -Name NetworkATC

Now, add your Network ATC intent(s). There are some example intents listed on our documentation here.

Step 6: Verify deployment on one node

To verify that the node has successfully deployed the intent submitted in step 5, use the Get-NetIntentStatus command as shown below.

Get-NetIntentStatus -Name

The Get-NetIntentStatus command will show the deployment status of the requested intents. Eventually, there will be one object per intent returned from each node in the cluster. As a simple example, if you had a 3-node cluster with 2 intents, you would see 6 objects returned by this command, each with their own status.

Before moving on from this step, ensure that each intent you added has an entry for the host you’re working on, and the ConfigurationStatus shows Success. If the ConfigurationStatus shows “Failed” you should look to see if the Error message indicates why it failed. We have some quick resolutions listed in our documentation here.

Step 7: Rename the VMSwitch on other nodes

Now that one node is deployed with Network ATC, we’ll get ready to move on to the next node. To do this, we’ll migrate the VMs off the next node. This requires that the nodes have the same VMSwitch name as the node deployed with Network ATC. This is a non-disruptive change and can be done on all nodes at the same time.

Rename-VMSwitch -Name 'ExistingName' -NewName 'NewATCName'

Why don’t we change the Network ATC VMSwitch? Two reasons, the first is that Network ATC ensures that all nodes in the cluster have the same name to ensure live migrations and symmetry. The second is that you really shouldn’t need to worry about the VMSwitch name. This is simply a configuration artifact and just one more thing you’d need to ensure is perfectly deployed. Instead of that, Network ATC implements and controls the names of configuration objects.

Step 8: Resume the cluster node

This node is now ready to re-enter the cluster. Run this command to put it back into service:

Resume-ClusterNode

Step 9: Rinse and Repeat

Each node will need to go through the procedure outlined above. To complete the migration to Network ATC across the cluster, repeat steps 1 – 4, 6 and 8.

Summary

Migrating your existing clusters to Network ATC can be a game-changer for your cluster infrastructure and management. By automating and simplifying your network management, Network ATC can help you save time, increase efficiency, improve overall performance and avoid cluster downtime.

If you have any further questions or would like to learn more about Network ATC, please don’t hesitate to reach out to us!

Dan “Advanced Technology Coordinator” Cuomo

Brought to you by Dr. Ware, Microsoft Office 365 Silver Partner, Charleston SC.