Disaster Recovery is very quickly becoming a common use case for leveraging the cloud, especially for companies who aren’t currently hosting any critical apps or services there. To some degree, Disaster Recovery (DR) is an organization’s insurance policy; you hope you never have to use it, but you definitely want to have it in an emergency. The challenge for most companies, however, is that they must procure additional hardware and software to facilitate recovery of systems in a disaster. Oh, and usually a second remote site (possibly renting space in a colo facility), plus multiple internet or fiber circuits for data replication. This insurance policy has a pretty steep premium!
The cloud offers so many new and exciting ways to consume IT services, including the ability to only pay for those services as you use them. This fact alone makes DR in the cloud extremely appealing. Instead of using precious capital expenditure dollars on redundant hardware, software, circuitry and facilities, organizations can now simply pay just for the amount of storage replicated and the compute cycles used during a DR test or failover. It’s like you spent 15 minutes to save $15,000 a month on your policy!
Recently, a customer of ours decided this type of DR strategy was just what they needed, so they came to us for help with getting DR in Azure setup for them. But along with a strong desire to begin using the cloud, they also brought with them some interesting challenges.
Their biggest challenge: The need to not only replicate systems, but also IP address space
One of the foundations of Disaster Recovery is the ability to leverage DNS for systems to communicate with themselves once they’ve been restored in a DR site. This is primarily because DR locations will normally leverage a different IP block (CIDR) than that of the primary site. So, when servers are recovered into the DR site, they will receive new IP addresses assigned from the new DR CIDR block. If applications are configured to use DNS properly, they’ll not only be able to update the DNS server records with their new address, they’ll also be able to query other servers and their addresses from those same DNS servers. This greatly simplifies the failover process, allowing you to quickly bring up systems in a completely different CIDR range.
But what if your applications don’t leverage DNS very well? What if they either are unable to use DNS, or the applications use hard-coded IP addresses? What if you aren’t sure? This is a conundrum many companies are faced with when planning their DR. If you answered yes to one of those questions, it normally means that you will need to configure your DR network with the same IP CIDR as your primary data center network. On the surface, this may not seem like a big issue, but there is one big reason why it very well might be.
Duplicate IPs = Network Nightmare
Users typically have a standard connection method to the data center where applications are hosted. This connection method might leverage redundant fiber circuits, VPN tunnels or even MPLS WAN. In the case of a DR situation, we want to restore functionality as quickly as possible, so it makes sense to connect the DR network to users in the same standard way. However, since these topologies are typically interconnected and routable, having identical CIDR blocks on the same network is a commonly known network no-no.
DR Dilemma Demystified
So, the dilemma is this: How do we keep users quickly connected to both the primary and the DR network simultaneously without exposing them to duplicate IP ranges?
The answer: A transitive network, a DR network, and a vNet peer.
Without going too deep into how networking in Azure works, I’ll give you a small glimpse into how we solved this problem for our customer in need.
Step 1: Create an Azure vNet to match the on-premise network
The first thing we did for our customer was to create a new vNet in their Azure subscription. This new vNet used the same CIDR block that they were using in their on-premise data center. Azure allows customers to create their own virtual networks (vnets) that leverage standard RFC 1918 private IP address ranges (https://en.wikipedia.org/wiki/Private_network). Any number of those of networks can exist for customers in Azure because A) they are Layer 3 overlay networks, not Layer 2 and B) none of them are routable to each over Microsoft’s Azure hosted network. The only way customers can gain access to those networks is by assigning hosted VMs publicly accessible IP addresses, or establishing VPN or private circuit connectivity (ExpressRoute) to the network. So, by simply creating a DR vNet in Azure with the same IP range for hosting VMs, but not connecting it to the customer’s on-premise network, we’ve done no harm.
Step 2: Create a Transitive vNet
The Transitive vNet in Azure is another new vNet in the customer’s subscription, also leveraging an RFC 1918 private IP range, but not in use anywhere else in the customer’s network. This Transitive vNet will not host any VMs in a DR scenario and, for all intents and purposes, will be void of any resources. The Transitive network will operate like a connection endpoint to Azure from the on-premise network. Any locations needing access to failed-over resources during a DR will establish a connection to this vNet, either through a VPN tunnel or an ExpressRoute direct circuit to Azure.
The main benefit of the Transitive network is that allows users to be connected into both the on-premise network and Azure simultaneously without running into an IP CIDR block conflict. Also, because the transitive vnet is basically just a termination point in the network, and wont host any resources, it can be very small in size.
But you may be asking yourself, “If my users are always connected to an empty Transitive network, how do I get them connected to my DR network in the case of a disaster?”
The answer is vNet peering. A vNet peer allows two disparate Azure vNets to communicate and route traffic between themselves. These vNets do not have to have matching or similar CIDR ranges. The peer enables virtual networks to appear as one and traffic is routed through the Microsoft backbone infrastructure. So, in our example, once the Transitive vNet and DR vNet are peered together, traffic is seamlessly routed between the networks. And if remote offices or locations have active VPN connections to the Transit network, they could also reach systems in the DR network once a vNet peer is established.
When to peer for DR
But this brings up an interesting point: it would never be recommended to establish a peer between the Transitive vNet and the DR vNet until it is confirmed that the on-premise network is no longer online or available to remote or on-site users. If the vNet peer was established between DR and Transit, and the on-premise network was still online, any users connected to both on-premise and Azure would receive conflicting routes to the same IP range. This is because Azure uses BGP route propagation to advertise what routes are available to users connected to it.
BGP, however, is also what makes the failover process so easy to manage. Automatic route propagation via BGP means that no static routes have to be created or managed and then updated in the case of DR. By simply invoking a vnet peer from the DR vnet to the Transit vnet, users connected to Transit will automatically receive new routes to get to the same IP range, even though it is being hosted in Azure instead of on-premise. Pretty nifty, eh?
So, as discussed, keeping the vNet peer between DR and Transit disconnected during normal operations and ONLY connect it during an actual DR event is crucial to ensuring smooth operations. There are several methods to establishing the peer during a DR event, including manually via the portal or even executed through PowerShell. The key is leveraging security and identity management to ensure that A) only those with the knowledge of what the peer does can connect it and B) it is treated as a “break glass in case of emergency” type of action.
I hope that this post either gave you some specific actions you can take in configuring your DR networking in Azure, or at least sparked some creative ways you can duplicate IP addressing between on-premise and the cloud. And we’re always here if you need beginner or advanced DR consulting for the cloud!