I don’t know about you, … but vSphere vMotion has always been my favourite vSphere feature, since 2003! The 1rst live vMotion was just stupefying, I mean breathtaking! If you wanna know the whole story of vMotion, go’n check the VMotion, the story and confessions from Duncan Epping.
With vSphere 7, VMware needed to review the vMotion process and look closely at howto improve vMotion to support today’s workloads.
For instance, VMs with a large memory & CPU footprint, like Monster VMs, SAP HANA and Oracle database backends, had real challenges being live-migrated using vMotion. like performance impact and the switch-over time.
At a high level, vMotion is comprised of several processes.
So, several of those processes have been improved to mitigate vMotion issues for larger VMs.
As a start, you have to understand the foundations of vMotion to figure out what VMware has changed.
This is the Basis vMotion workflow:
Between step 3 and 5, this is what is called the switch-over time or stun time. Specifically, the goal is to keep it under 1sec!
In order to get illustrated and demonstrated in a more detailed approach, GoTo the blog The vMotion Process Under the Hood – posted July 2019. For that purpose, you’ll get a comprehensive read and detailed dive into the vMotion process in-depth.
Back to what’s been improved in vSphere 7, at a high level, vMotion is comprised of several processes.
Several of those processes have been revamped, with a new logic behind, to mitigate vMotion issues for larger VMs: that’s Memory paging and Memory copy.
The Memory Paging process uses page tracers where vMotion keeps track of memory paging activity during a migration. In other words, it keeps track of all the memory pages that are changing and the page fires; which is whenever the guest OS is writing or trying to write to memory. This is detailed-explained in this video.
Now to do that, installing the page tracer, we need to stop the CPU and we’re talking microseconds. But still, we need to stop it, so … Um? … The VM, and the application running inside the VM, can no longer use that specific vCPU.
Prior to vSphere 7, page tracing happens for each and every vCPU within a VM. That could cause impact on the VM and its workload to be resource constrained by the migration itself.
With vSphere 7, a 1-only dedicated vCPU is used for page tracing. Put differently, the VM and its applications can keep working while the vMotion processes are occurring. No more overall vCPU stop!
Another process that was improved was the memory copy. When nearly all memory is copied to the destination host, vMotion is ready to switch-over to the destination ESXi host.
In this last phase of vMotion, the VM on the source ESXi host is suspended and the checkpoint data is sent to the destination host. Remember that we want the switch-over or stun time to be kept under 1 second.
The only sending the bitmap could take like 2 seconds or more depending on network latency, etc … So this can quicky be a challenge with monster VMs.
Now with vSphere 7.0, only a compacted memory bitmap needs to be transferred. This results in a reduced stun time, even for the largest of workloads as of 24Gb memory VMs!
In an upcoming update release of vSphere 7, Memory maximums will reach 24TB. Right ! That means now vMotion is already ready for those new sizing!
Recap: What has changed for vMotion in vSphere 7
Now there should be much less hesitancy and full confidence to vMotion monster VMs.