MVP Technical Series
Surviving Disaster: Building Site
Resiliency for Exchange 2013
Manu Philip ( Exchange Server MVP)
New York: 631-345-5292 • Limerick: +353-61-260-101 • Hannover: +49-511-367393-0 • Singapore: +65-62222429
MVP Technical Series
• Today– Surviving Disaster: Building Site
Resiliency for Exchange 2013, Manu Philip
• 14th Mar – Lync 2010 Design Essentials, David
• 4th Apr - Ready for Your Cloud: Windows
Server 2012’s Way Designed for You, Erdal
Manu Philip
Microsoft Most Valuable
Professional (MVP): Exchange Server
• Microsoft Most Valuable Professional
(MVP) for Exchange (for the past 3
• Exchange Solution Expert, contracting
with Global enterprise clients to assist
in their Exchange deployments
• Founder of:
• "Moderator" of most of the Microsoft
TechNet Community Exchange Server
• MCC (Microsoft Community
Contributor) of Microsoft TechNet
Community Exchange Server Forums
The fundamental success of any business depends up on a high
resilient messaging system
Planning for
Resiliency in
The design involves identifying the high resilient solution which is matching with the
SLA targets.
Recovery Time Objective: How long it takes to restore the service to the users
Recovery Point Objective: How current the data is after the recovery operation
has completed.
SLA Factors determines the site resilient questions:
In the case of a primary datacenter fail, what level of service is required?
After the fail can user manage only messaging services for a time period?
How many users should be covered under the resiliency?
How the data can be availed to the users?
How soon the standby datacenter should be activated?
Moving back the service to Primary datacenter
General Considerations:
Servers in the secondary datacenter must be capable to host the failed
The necessary network configuration must be in place to support the
datacenter switchover.
A testing method should also be defined in SLA. This should be validated
Site resilience is much better in Exchange 2013 because it has
been simplified
Exchange 2013 has undergone significant architectural changes
from the ground up in order to enhance its Site Resilience
Site resiliency features can be classified under:
Storage/Database Architecture
Client Access Architecture
Transport/Routing Architecture
With the namespace simplification, Exchange 2013 provides
new site resilience options, such as the ability to use a single
global namespace.
In addition, Exchange 2013 also provides the ability to
configure the messaging service for automatic failover in
response to failures that required manual intervention in
Exchange 2010.
 Multi-site DAG configuration
What’s in
2013 to
 Datacenter Activation Coordinator
 Lagged Replication Copies
 Single Global namespace (DNS)
 Safety Net
 Multi-site SSL
New DAG version number: 2.0.
Database Availability Group (DAG) provides both High Availability
and Site Resilience features.
• DAG Improvements:
Multi-site DAG
– The configuration of DAG is now easier, and
its stability has been improved as well.
– The Transactional Log Creation code has
been completely rewritten.
– Now, we also have many new and enhanced
PowerShell cmdlets to perform various DAG
operations to suit various situations.
– The DAG relies on Windows Server
clustering services and utilizes a quorum
witness to act as tie-breaker.
Quorum can be treat as an voting process in which a majority of
voting members must be present to make a decision. The decision
in the case of a DAG is basically whether the DAG should be online
of offline.
Quorum for
Server 2013
Because a majority of votes is required for quorum there are two
different quorum models used depending on how many DAG
members you have.
For a DAG with an odd number of members the Node Majority
quorum mode is used.
For a DAG with an even number of members the Node and File
Share Majority quorum mode is used. This mode involves an
additional server referred to as the File Share Witness. It is typically
another Exchange server located in the same site as the DAG
Multi-site DAG
A DAG deployed for site resilience will span multiple datacenters.
The objectives of a Database Availability Group deployed for site
resilience are usually to provide availability of mailbox services
after the complete failure of the primary datacenter. In other
words, a true disaster.
 Update-MailboxDatabaseCopy
 Set-MailboxDatabase
 Add-DatabaseAvailabilityGroupServer
 Get-MailboxDatabaseCopyStatus
Split-Brain Syndrome: DAC prevents a DAG from automatically
mounting databases after an outage like a network fail and DAG
members can't receive heartbeat signals from each other that spans
multiple datacenters.
Example: First datacenter contains two DAG members and the
witness server, and the second datacenter contains two other DAG
members with an alternate witness server . If the first datacenter is
restored without network connectivity to the second datacenter, the
active databases within the DAG may enter a split brain condition.
DAC mode is used to control the startup database mount behavior of
a DAG after being affected with a catastrophic failure
DAC mode is disabled by default
DAC uses a protocol called Datacenter Activation Coordination
Protocol (DACP).
How DAC acts?
 After a catastrophic failure, DACP is used to determine the current
state of the DAG and whether Active Manager should attempt to
mount the databases.
 Active Manager stores a bit in memory (either a 0 or a 1) that tells the
DAG whether it's allowed to mount local databases that are assigned
as active on the server.
 When a DAG is running in DAC, each time Active Manager starts up
the bit is set to 0, meaning it isn't allowed to mount databases
 In DAC mode, the server must try to communicate with all other
members of the DAG that it knows to get another DAG member to
give it an answer as to whether it can mount local databases that are
assigned as active to it.
 If another server responds that its bit is set to 1, it means servers are
allowed to mount databases.
DAC mode also enables the use of the built-in site resilience
cmdlets used to perform datacenter switchovers.
Enabling DAC mode
DAC mode can be enabled only by using the Exchange
Management Shell by using the cmdlet
For Eg:
Set-DatabaseAvailabilityGroup -Identity DAG2
-DatacenterActivationMode DagOnly
o A lagged database copy is one that is not updated by replaying
transactions as they become available. Instead, the transaction
logs are kept for a certain period and are then replayed.
o A lagged database remains a preset time period behind the live
database (up to 7 days) and provides a recovery option in the
event that the active mailbox copy encounters corruption.
o An organization can enhance the resiliency of their database
solution by employing a combination of non-lagged and
lagged database copies in a dag.
o The best thing about a DAG is that you can achieve resilience
against failure by creating multiple copies of databases that
Exchange will keep up to date through log shipping. However,
some vendors advice exists that the second passive copy should
be lagged
• Lagged copy enhancements include integration with
Safety Net (Similar feature like Transport Dumpster in
Exchange 2010) Activating a lagged database copy
becomes significantly easier as it uses SafetyNet
Lagged copy
in Exchange
For example consider a lagged copy that has a 2-day
replay lag. In that case, you would configure Safety
Net for a period of 2 days. If you encounter a
situation in which you need to use your lagged copy
• Mount the lagged copy
• This will trigger an automatic request to
SafetyNet to redeliver the last two days of
• You get the last two days mail, minus the data
ordinarily lost on a lossy failover.
Lagged copy
in Exchange
• Lagged copies can now care for themselves by invoking
automatic log replay to play down the log files in certain
• When a low disk space threshold is reached
• When the lagged copy has physical corruption and needs
to be page patched
• When there are fewer than three available healthy copies
(active or passive) for more than 24 hours
• Lagged copy play down behavior is disabled by default, and
can be enabled by running the following command.
Set-DatabaseAvailabilityGroup <DAGName> ReplayLagManagerEnabled $true
• After being enabled, play down occurs when there are fewer
than 3 copies. You can change the default value of 3, by
modifying the following registry value :
• To enable play down for low disk space thresholds:
 The transport dumpster in Exchange 2010 has been improved
in Exchange 2013 and is now called Safety Net
 Safety Net itself is redundant, and is no longer a single point of
failure. The Safety Net is a queue that stores copies of
messages that were successfully processed by the server, in
case the processed message is corrupted in-transit or fails to
reach a destination. A Shadow Safety Net is a redundant copy
of the Primary Safety Net and is stored on another Mailbox
Server in the same site to provide further redundancy and act
when primary SafetyNet is unavailable
 You can specify how long Safety Net stores copies of the
successfully processed messages before they expire and are
automatically deleted. The default is 2 days
 For Mailbox servers that don't belong to a DAGs, Safety Net
stores copies of the delivered messages on other Mailbox
servers in the local Active Directory site.
 You can't specify a maximum size limit for Safety Net. You can
only specify how long Safety Net stores messages before
they're automatically deleted.
The length of time successfully
processed primary messages are
SafetyNetHoldTime on
2 days
stored in Primary Safety Net, and
acknowledged shadow messages are
stored in Shadow Safety Net.
The amount of time that the
Microsoft Exchange Replication
ReplayLagTime on Set- Not
service should wait before replaying
MailboxDatabaseCopy configured
log files that have been copied to the
passive database copy.
out on Set2 days
bled on Set$true
How long a message can remain in a
queue before it expires.
$true enables shadow
redundancy on all transport
servers in the organization.
$false disables shadow
redundancy on all transport
servers in the organization.
1. Mailbox01 receives a message from an SMTP server
2. Mailbox01 initiates a new SMTP session to Mailbox03 makes a
shadow copy of the message.
3. The Mailbox Transport service delivers the message to the local
mailbox database.
4. Mailbox01 queues a discard status for Mailbox03 that indicates
the primary message was successfully processed, and Mailbox01
moves a copy of the primary message into Primary Safety Net.
5. Mailbox03 periodically polls for the primary message.
6. When Mailbox01 successfully processed the primary message,
Mailbox03 moves the shadow message into the Shadow Safety Net.
7. The message is retained in Primary Safety Net and Shadow
Safety Net until the set timeout value.
 Datacentre switchover in Exchange 2010 are operationally
complex because recovery of mailbox data (DAG) and client
access (namespace) are tied together there.
Single Global
Virtual IP
 If you lose all or a significant portion of your CAS, or the VIP for
the array, or a significant portion of your DAG, you were in a
situation where you needed to do a datacentre switchover.
 In Exchange 2013, a client can receive multiple IP
Addresses from DNS for a given FQDN. Since almost all
client access in Exchange 2013 now relies on HTTP (Outlook,
Outlook Anywhere, EAS, EWS, OWA, and EAC), if the first IP
Address on a HTTP stack fails, the HTTP client will try the next
and so on. If a Virtual IP of a CAS array were to fail, the client
can automatically connect to other IPs to access the same
service in a matter of seconds, instead of waiting minutes for
DNS to failover. For Example, if a client tries one and it fails, it
waits about 20 seconds and then tries the next one in the list.
Thus, if you lose the VIP for the Client Access server array,
recovery for the clients happens automatically, and in about
21 seconds.
one VIP fails,
to alternate
VIP and
IP from
DNS puts you
in control
of in service
Secondary Datacenter
Single Global
Virtual IP
 If you lose your CAS array, you don't need to perform a
datacenter switchover. Clients are automatically redirected to
a second datacenter that has operating Client Access servers,
which remains unaffected by the outage (because you don't
do a switchover). Instead of working to recover service, the
service recovers itself and you can focus on fixing the core
 If you lose the load balancer in your primary site, you simply
turn it off (or maybe turn off the VIP) and repair or replace it.
Clients that aren't already using the VIP in the secondary
datacenter will automatically fail over to the secondary VIP
without any change of namespace, and without any change in
DNS. Not only does that mean you no longer have to perform
a switchover, but it also means that all of the time normally
associated with a datacenter switchover recovery isn't spent.
In Exchange 2013, you don't need to do that because you get
fast failover (20 seconds) of the namespace between VIPs
Single Global
Virtual IP
Example DAG Design: Because you can fail over the namespace
between datacenters, all that's needed to achieve a datacenter
failover is a mechanism for failover of the Mailbox server role
across datacenters. To get automatic failover for the DAG, you
simply architect a solution where the DAG is evenly split between
two datacenters, and then place the witness server in a third
location so that it can be arbitrated by DAG members in either
datacenter, regardless of the state of the network between the
datacenters that contain the DAG members.
Deal with intermittent failures: An intermittent failure requires
some sort of extra administrative action to be taken because it
might be the result of a replacement device being put into service.
In this scenario, the administrator can perform a namespace
switchover by simply removing the VIP for the device being
replaced from DNS. Then during that service period, no clients will
be trying to connect to it. After the replacement process has
completed, the administrator can add the VIP back to DNS, and
clients will eventually start using it.
• Minimize the number of certificates you used for the
exchange servers and reverse proxies, use a single certificate
for all the services. This minimizes the cost and complexity of
the solution
Multi-site SSL
• Single SAN certificate for each datacenter (Include multiple
hosts name in the certificate)
• Use same certificate Principle name on each certificate to
ensure outlook anywhere connectivity after a failover has
• Configure the outlook provider configuration object in AD
with the same Principal Name in Microsoft-Standard Form
(msstd) Eg: Set-OutlookProvider EXPR –certprincipalName
DAG is continuing it’s big role to support the High Availability
and Site Resiliency features of Exchange Server Product
Lagged copies associated with SafetyNet (Dumpster) is a good
move by Exchange Server Product Development Team
Single Global Namespace Support is the biggest and key feature
of Exchange Server 2013 in it’s Site wide Resiliency capability.
While the Exchange 2010 design guidelines can apply to an
Exchange 2013 organization, the additional enhanced Exchange
2013 design guidelines cannot be applied to Exchange 2010. All
of new behaviours and design options applies to Exchange 2013
About KEMP Technologies
Who We Are
• Established in 2000
Ireland, APAC-HQ |
• 470% Growth in 5 years
• Market Segment Leader
What We Do
Server Load Balancing
SSL Offload & Acceleration
Application Health Checks
High Availability
Configuration Templates
for Exchange
Pre-Authentication &
Authentication- COMING
Get a LoadMaster Today!
• Download a Virtual LoadMaster GEO
• Download a Virtual LoadMaster
• Sign-up for our Edge Security Pack BETA

similar documents