Over at vinternals Stu asks if linked clones are the panacea that a lot of people are claiming about the storage problem with VDI? I say yes, however we are moving from designing for capacity to designing for performance, and VMware have given us some good tools to manage it. Let me explain a bit further.
Stu essentially raises two issues.
First, delta disks grow more than you think. Stu considers that growth is going to be a lot more than people expect, citing that NTFS typically writes to zero'd blocks before deleted ones and there is lots of activity on the system disk, even if you have done a reasonable job at locking it down.
Second SCSI reservations. People are paranoid about SCSI reservations and avoid snapshot longevity as much as possible. With a datastore just full of delta disks that continually grow, are we setting up ourselves for an "epic fail"?
These are good questions. I think what this highlights is that the with Composer the focus for storage for VDI has shifted from an issue of capacity management to performance management. Where before we were concerned with how to deliver a couple of TB of data now we are concerned with how to deliver a few hundred GB of data at a suitable rate.
In regards to the delta disk growth issue. Yes, these disks are going to grow, however this is why we have the automated desktop refresh to take the machine back to the clean delta disk. The refresh can be performed on demand, as a timed event or when the delta disk reaches a certain size. What this means it that the problem can be easily managed and designed for. We can plan for storage over commit and set the pools up to manage themselves.
To me the big storage problem we had was preparing for the worse case scenario. Every desktop would consume either 10G or 20G even though most only consumed much less than 10GB. Why? Just in case! Just in case one or two machines do lots of activity and because we had NO easy means of resizing them we also had to be conservative about the starting point. With Composer we can start with a 10GB image but only allocate used space. If we install new applications and decide we really do need the capacity to grow to 12GB, we can create a new master and perform a recomposition of the machines. Now we are no long building for worse case but managing for used space only. This is a significant shift.
So happens today there was a blog posting about Project Minty Fresh. This installation has a problem with maintaining the integrity of their desktops. As a result they are putting a policy in place to refresh the OS every 5 days. This will not only maintain their SOE integrity but also keep their storage overcommit it check.
In regards to SCSI reservations. I do believe that the delta disks do still grow at 16MB and not some larger size. So when the delta disks are growing there will be reservations, and you will have many on the one datastore. Is this a problem? I think not.
In the VMware world we have always been concerned about SCSI reservations because of server work loads. For server work loads we want to ensure fast and more importantly predictable performance. If we have lots of snapshots that SQL database system which usually runs fine now starts to behave a little differently. Predictability or consistency in performance is sometimes more important than the actual speed. My estimation is that desktop workloads are going to be quiet different. In our favor we have concurrency and users. All those users and going to have a lower concurrency of activity, given the right balance we should have a manageable amount of SCSI reservations, if not we rebalance our datastores, same space, just more LUNs. Also unlike servers, will users be able to perceive any SCSI reservation hits as they go about their activity. Given the nature of users work profile and that any large IOs should be redirected not into the OS disk but into their network shares or user drives the problem may not be as relevant as we may expect.
What Stu did not mention and we do need to be careful of because it can be the elephant in the room is IO storms. This is where we really do have some potential risk. If a particular activity causes a high currency of IO activity things could get very interesting.
Lastly, as Stu points out, statelessness is the goal for VDI deployments. Using application virtualisation, locking down the OS to a suitable level and redirecting file activity to appropriate user or networked storage is going to make a big impact on the IO profile. These are activities we want to undertake in any event, so the effort has multiple benefits.
I too believe you need to try this out in your environment, not just for the storage requirements, but also for the CPU, user experience, device capabilities and operational changes. VDI has come a long way with this release and I do strongly believe it will enable impactful storage savings.
What I really want is the offline feature to become supported rather than just being experimental. Plus I want it to support the Composer based pools. There is no reason why it can't and until then, there is still some way to go before we can address the breadth of use cases. However there are plenty of use cases now, which form the bulk, to sink our teeth into.
Rodos
Good article Rodos. Although I would still like to see an environment with 2000+ VDI desktops and check their storage setup when using linked clones. If you want to prevent these problems you will need to have a really good design and well balanced VMFS/VM ratio.
ReplyDeleteYeh, another nice post Rod :-). In line with Duncan's comment, I'd also like to see those 2000+ VM's painlessly taken back to a clean state on a schedule to "clean up the delta's". We considered this approach for the array based snaps too, turns out the support cost associated with re-personalisation every 9-12 months would've wiped out any storage savings. And that's what this is all about - overall cost of VDI, not just storage cost. The cost in lost productivity is even more, for example if you have thousands of 3rd party developers accessing VDI VM's in your environment. I have images for my machines at home, whenever I rebuild them the imaging process itself takes about 2 minutes, the app reinstallation etc takes hours. The idea of refresh is nowhere near as trivial as simply hitting a reset button. As we say, statelessness is the only thing that will allow this, but it's a few years away yet.
ReplyDeleteWhat is the support cost in a refresh when its automated and transparent to the user (just alongside a reboot). There should be no app reinstallation! It should already be in the base image or be running virtually, thinapp. It would be painful to deploy them in via say Altiris and then have a refresh of any frequency. I think good enough "statelessness" is just about here. Its not for all use cases, in typical environments you will still have your pools of fully thick provisioned machines.
ReplyDeleteBut as we all say, one will need to test and walk with open eyes in your own environment.
I agree that there _should_ be no app reinstall, but that's nowhere near reality in the enterprise yet. There are still waaay too many apps that either don't work or aren't supported when virtualised (VI client for example ;-) for it to be a viable proposition in the sorts of companies I have worked for / talk to regularly.
ReplyDeleteIt may well be totally different in smaller companies (I have only worked for 1 company that had less than 10K users and that was a loooong time ago), I completely agree the technology for good enough statelessness is available but in the enterprise it was a good 2 years away from wide deployment _before_ the global recession. It's probably a lot further away now :(
I have to agree with Stu. There are far too many applications out there that write data and registry settings for customizations to areas non-user specific. Unfortunately many still write files out to a windows sub-folder or HKLM. Any refresh would cause those settings to be lost.
ReplyDeleteI work for a hospital where we currently have over a 1000 VDI sessions. Licensing alone prevents us from having one master image to work from along with over 400 applications across our environment - many that do not work together.
ThinApp has some promise, but it is still not faster than local. It also uses a snapshot type of technology to determine changes for an install to then deploy. There are many application installers out there that write unique settings (such as registration information) based on the hardware being installed to or user information. Thinapp would not work for these installations.
VMWare also now recommends pushing all of these user settings to a network share or other storage. Many companies use the same storage array for their VMWare as they do for their file-servers. This method for VMWare just brings in a file server to also be blamed for performance issues when contention is still back at the array level. With this strategy, If you are holding 10 pounds of crap in one hand and then move 5 pounds to the other, you are still holding 10 pounds of crap...
@Anonymous. Lets not throw the baby out with the bath water. If the storage is not sufficient in the environment then thats something to fix, not a reason to avoid clones. For 1000 desktops and lots of applications there will be groups that will work well as clones and other groups where it won't. Where it does not work you can use fat provisioned. However many of the application issues raised will remain, which ever way you do it. I think many of the issues being raised are more general VDI issues that about clones. But it does sound like we are all moving towards the same goal.
ReplyDeleteChad Sakac posted a follow up article to this on his blog with some interesting information thats worth reading for the continuing discussion
ReplyDeletehttp://virtualgeek.typepad.com/virtual_geek/2008/12/vmware-view-composerlinked-clones---they-are-not-a-panacea.html