Several years ago I worked for EMC as a vSpecialist which meant that I played an evangelist role as a pre-sales resource able to help the account and marketing teams. I would travel to customers, trade shows, user groups and whomever else wanted to hear the gospel according to EMC and pontificate on the benefits of running VMware virtualization technology on top of EMC storage products as opposed to the other guys platform. Now there’s nothing wrong with that and in fact it’s a great way to make a living especially if you happened to believe (as I did) that a) EMC DID in fact understand what people needed to succeed in the world of virtualization and b) EMC wanted to give it to them. And because I’m a reasonably good nerd with plenty of hands on experience and a love of the spotlight, it wasn’t very hard to stand up in front of an audience and talk convincingly about what makes a great infrastructure and how to architect it. It’s never hard to talk about something you’re passionate about and believe in and I loved the way EMC focused on selling the strengths of their own solutions rather than the weaknesses of the other guy.
Now fast forward a few years and some of the concepts we used to demo are coming to light as actual products, not just from EMC but from myriad vendors in multiple flavors. Software defined networking, mobile computing and storage tiering are all available as off the shelf components ready to be bolted onto infrastructures large and small. But one of the coolest of these was ‘long distance vMotion” otherwise known as stretched clusters.
The idea behind stretched clusters is simple. You get two sites with a really fat pipe between them and put half your compute nodes on one side and half on the other. You put all these nodes in a single cluster that spans both sites and bingo! You can vMotion between sites and enable vMware HA for increased availability. And you know what? It works! And it does it exactly as the vendor described!
The key, that really unlocked the potential for this capability was to introduce a storage virtualization layer which abstracted away the details of the underlying storage similar to what products like FalconStore did years ago. But unlike previous products, the new generation could not only hide the details of the heterogeneous arrays behind the presentation layer but it could replicate data between sites and use advanced caching so that the virtualization layer believed the storage it was accessing was local to the current site. Even if your data was in fact at the other end of the WAN pipe, the new tools could produce response times fast enough and throughput big enough that having the execution context in one site and the disk files at another wasn’t a problem anymore.
Awesome… so why write a blog post?
Because the devil’s in the details and I’m seeing folks who glossed over that part.
First the context. I’ve had two clients in the last quarter looking to implement stretched clusters either in place of SRM or alongside SRM. That’s lead to deeper discussions on disaster recovery versus disaster avoidance and where each technology fits. The second half of that conversation is around the details. Lots of details…
SRM or Stretched Clusters. Pick one
Right off the bat it’s important to understand that the design paradigm of Site Recovery Manager and Stretched Clusters are mutually exclusive meaning you protect a workload with one technology or the other but not both simultaneously. That’s because the architecture of SRM dictates each site has it’s own vCenter+SRM pair and each site communicates with the other. Clusters, in contrast, require that all nodes in a cluster be under the management of a single vCenter instance. Think about it… what’s the boundary to vmotion? The Datacenter! Do Datacenter structures in vSphere span multiple vCenters? NO! So the nodes in a cluster have to appear entirely within a single vCenter which means they could theoretically be used in conjunction with SRM as either resources at the “logical” protected or recovery side but we couldn’t divide the resources and make half the cluster appear as protected and half as recovery. Once within a vCenter instance these resources are atomic and indivisible from an SRM perspective.
Disaster Recovery versus Disaster Avoidance
Stretched Clusters and long distance vmotion works wonderfully for clients who see the disaster coming. If you’re standing on the breakwater watching the storm clouds roll in, Long Distance vMotion can definitely save your bacon. What it typically can’t do is recover well from a smoking hole in the ground you didn’t know was coming. And that’s because the typical stretched cluster lack a few key components.
Most stretched clusters don’t maintain full copies of data at both sites simultaneously. Your plan may be different but consider for a moment the implications;
- Synchronous replication is still subject to the laws of physics meaning the greatest possible distance between two sites in still about a hundred miles (the answer varies +/- 50% with the scenario). You may be able to mask the penalty during normal use via advanced caching algorithms but if the data hasn’t made it across when the disaster happens you still don’t have an RPO of zero.
- The enterprise must maintain a second array with sufficient capacity and avoid the temptation to consume part of that capacity. While many organizations commit to this concept at the time of purchase, I see time and time again where storage or vmware admins have consumed that space because ‘they had no choice’.
- Migrating the execution context from a server at one site to another at the second site doesn’t necessarily migrate the ‘authoritative’ data source from one site to another. In fact unless you explicitly configure the storage to follow the access point across sites there’s a distinct possibility nothing changes and all storage IO operations are still occurring at the original array. Now before anyone screams that their product doesn’t work like this let me point out that this is product and implementation dependent so PoC testing is the only real way to know how your environment will react and none of the clients I’ve been to in the last year have conducted a PoC.
Ok, so lets assume for a minute that your organization has met all the basic requirements. Two arrays, all the capacity reserved with replication enabled and a storage virtualization layer that masks the current data location from the stretched cluster. You’ve cleaned up your networking by stretching the guest networks and the vSphere management networks across sites and you’ve confirmed everything works exactly as you want. Now can I ditch SRM?
Most folks want to ditch SRM in favor of stretched-cluster HA because SRM is a pain to configure and maintain. Every time workloads are added or removed an admin has to go into the admin tool and update the protection groups and maybe edit the recovery plan. SRM used to play poorly with Storage vMotion and Storage DRS so architects had to pick one or the other and live with their choice but even with that fixed, folks still complain about having to maintain this “thing they never use”. My recent experience in the field is that stretched clusters are by and large perceived by customers as a set and forget solution to disaster recovery.
I don’t believe that’s the case at all.
Remember that SRM is primarily an orchestration engine meaning you define what’s going to happen BEFORE it happens. VMware HA on the other hand has almost no orchestration abilities. Oh sure, you can specify which workloads to recover if HA is invoked but can you do any of the following?
- Build dependencies such that the application server doesn’t start until the database is up?
- Build Recovery plans to bring the highest priority tier one workloads online before recovering the secondary systems?
- Shut down or suspend workloads already present at the recovery site to make space for the about-to-be-recovered production systems?
- Re-address workloads and connect them to different port groups if the enterprise elects to not stretch its networks across sites?
- Produce an audit-acceptable test result which will prove the business requirements have been met?
If you didn’t address these very real requirements I suspect that any actual disaster would result in general panic and mayhem as engineers attempt to sort out these issues on the fly. Furthermore, I would respectfully suggest that even if you could architect your environment in such a way as to meet all of the above criteria in advance of a disaster, the amount of time and effort involved in building and maintaining the resulting fragile and complex Rube Goldberg type infrastructure would far exceed the effort required to implement and maintain an SRM based solution.
Does that mean solutions such as Peer Motion and VPLEX have been oversold by sales teams? Not at all. These technologies have tremendous benefits and can dramatically improve availability and performance. They can most definitely be used for both disaster avoidance and recovery but only if architected properly. Simply drop kicking them into the datacenter and walking away isn’t likely to produce the desired result, which is why I advocate clients commit to a bake-off style proof of concept. Not for the purpose of choosing stretched clusters OR Site Recovery Manager as the winner but rather that they understand factually what these two technologies do, what their requirements are and how they are complementary rather than competing solutions.