AlwaysOn Availability Groups Setup Checklist
AlwaysOn Availability Groups Setup Checklist
Hey,
we’re highly
available,
too.
BEFORE YOU RUN THE WIZARD...
The power of Microsoft SQL Server’s new AlwaysOn Availability Groups can give you both high availability and
disaster recovery, but with power comes...well, complex configuration. We’ll help you get started.
Our clients have deployed SQL Server This checklist is very much a work in
2012 and 2014 to get the new reliability progress. We’ll be the first to admit that it’s
features. Instead of using a complex just a launching pad for discussion; some
combination of clustering, mirroring, log lines are just a minute’s worth of work, and
shipping, replication, and other others represent hours of planning and
technologies, you can get both high labor. This checklist isn’t a substitute for
availability and disaster recovery in a single experience with Windows clustering,
tool. networking, SQL Server backups, and the
That’s the good news. other technologies that AlwaysOn requires.
The more challenging news is that In this early version of the checklist, the
AlwaysOn Availability Groups are brand target reader should already have
spankin’ new, and we’re still getting the experience with those technologies.
kinks ironed out. The setup isn’t You can see the version date in the We’re Here to Help
straightforward, and we have to consider a footer below. If someone gave you an out- If you’re overwhelmed, email us at
whole lot of gotchas long before we kick of-date one, go download the latest [email protected]. We offer consulting
and implementation services to relieve SQL
off the wizard. version - it’s totally free. Server, SAN, VMware, and AWS pain.
Page [1] - July 2016 Edition - Download the Latest and Learn More at BrentOzar.com/go/alwayson
Big-Picture Planning
WHO WE ARE Discuss the below issues with the business users before you get started so that the
solution we build matches their requirements.
1. Choose the number & location of replicas. For each replica, decide:
1.1. Synchronous or asynchronous replication? We get up to three synchronous
replicas (a primary and two secondaries). The rest of the replicas (up to four
replicas in total) will be asynchronous.
1.2. Automatic failover to this node? We only get two automatic failover servers
- a primary and a secondary.
1.3. Whether you want to pay for your licensing. Just kidding - you’re paying.
2. Choose a quorum model based on your chosen number of replicas. This decision
is driven by an even versus odd number of nodes, how many nodes need to be up
for the cluster to stay online, and which nodes should vote. For an introduction, see
http://technet.microsoft.com/en-us/library/cc770620%28v=ws.10%29.
3. Design your Availability Group granularity.
3.1. How many databases do you want in each group?
3.2. Do you want to move groups around for load balancing?
3.3. Do you need to separate reporting databases from the OLTP databases so
the report data can be built on a read-only replica of production OLTP?
3.4. If just one database on an instance fails, what do you want to happen?
3.5. Are there some critical databases that need to be in their own AG with
synchronous replication & automatic failover, while others can be async
with manual failover?
4. Install SSMS on the DBA and developer workstations. The SQL Server 2012/2014
tools don’t work on Windows XP, so if you’re still in the dark ages, you’ll need a
server you can remote desktop into for the admin tools.
5. If any other servers require restores from your new 2012/2014 instance, they should
get that newer version first. (Think development, disaster recovery, or reporting.)
6. If you’re transitioning from an existing SQL Server, review non-database stuff
installed on those servers. Plan for SSIS, SSAS, SSRS, DTS packages, logins,
Agent jobs, etc.
Page [2] - July 2016 Edition - Download the Latest and Learn More at BrentOzar.com/go/alwayson
10. If a shared storage failover cluster instance (FCI) will be involved, things get more
complex. We won’t touch on an entire cluster setup here, but these two are critical:
10.1.
10.2.
Reserve IPs for the cluster management, SQL Server instance, and DTC.
Decide how to configure DTC on the cluster. Cindy Gross has a great post WHAT WE DO
on this: http://blogs.msdn.com/b/cindygross/archive/2009/02/22/how-to-
configure-dtc-for-sql-server-in-a-windows-2008-cluster.aspx
11. Decide how you'll achieve network redundancy. You no longer need a separate
heartbeat network, but we'd recommend against using just a single network card in
a cluster environment. In each node, either use multiple network cards with
individual IP addresses, or use multiple network cards with software teaming and a
single IP address. Cluster validation will throw a warning if there's only one active
network card detected (teams are considered a single card), but it'll let you install
anyway. Plan for this redundancy before you install SQL.
12. Decide where backups will be performed, and the target location. Choose a
preferred and a secondary server to run the full and transaction log backups - the
secondary will be used if the preferred one is down. Make this choice for both the
primary and DR data centers, too - you need to know where backups will run if we
fail over to the DR data center.
13. Decide whether to use Windows Core. In Windows 2008R2, it's all or nothing:
every node in the cluster must be configured the same way. We don't recommend
this for Win2008R2 unless you're extremely comfortable troubleshooting via
PowerShell and cluster.exe during an outage. Thankfully, Win2012/2012R2 allows
mixed cluster nodes and let us add/remove the GUI on the fly.
14. Get the right installation bits for SQL Server 2012/2014 (the key is embedded) and
download the latest cumulative update (CU). A good guide for the latest builds is
http://SQLServerBuilds.blogspot.com - it’s usually up to date.
15. If you’ll be installing the Availability Group without AD admin privileges, consider
pre-staging the cluster name: http://technet.microsoft.com/en-us/library/
cc731002%28WS.10%29.aspx#BKMK_steps_precreating
Windows Installation
After the Windows installation, this section's steps don’t have to be done in exact order.
For example, you may run CPU-Z to check power saving after you’ve enabled Instant
File Initialization or joined the domain. All steps must be done on every node, though.
16. Install Windows Server on each node. If you’re using Windows Server 2008R2,
clustering is only included in Enterprise Edition. (Your company may have a
standard install checklist like configuring antivirus exclusions and Windows firewall.) (when we’re not helping out)
17. Join all of the nodes to the same domain. Clustering requires the nodes to be in the We love giving back to the community.
same domain, even if they're in a different data center. They can be in different We share what we’ve learned by
subnets, though, and that’s not as painful as it used to be with Windows 2003. blogging at BrentOzar.com, hosting free
webcasts every week, and curating our
18. Apply all Windows updates, but make sure Windows isn’t set to automatically apply
email newsletter of links. You can catch
updates as they come in. You don’t want your cluster restarting on its own. up with us at all the major SQL Server
19. Install the Windows recommended hotfixes for clustering. For Windows 2008R2 conferences like the PASS Summit, SQL
Intersection, SQL Rally, SQLSaturdays,
SP1: http://support.microsoft.com/kb/2545685 For Windows 2012: http://
and local user groups.
support.microsoft.com/kb/2784261
Catch our upcoming events at
20. Install Windows prerequisite hotfixes for Availability Groups - http:// BrentOzar.com/go/live.
msdn.microsoft.com/en-us/library/ff878487.aspx#WinHotfixes
Page [3] - July 2016 Edition - Download the Latest and Learn More at BrentOzar.com/go/alwayson
21. If you’re using Windows 2008R2, install the hotfix to allow nodes with 0 votes -
http://support.microsoft.com/kb/2494036 - We’re listing this separately because it's
DESIGNS
23. Do a quick performance measurement:
23.1. Run CPU-Z from http://www.CPUID.com on each node to verify that power
saving is disabled.
23.2. Run CrystalDiskMark from http://crystalmark.info/software/
Client X: Easy High Availability
CrystalDiskMark/index-e.html/ (making sure you’re getting CrystalDiskMark,
One of our clients is a high-traffic
not CrystalDiskInfo) on each node on each RAID array to get a sanity check
web site that needed easy failover
between their primary and disaster on throughput. Do 5 test passes, 4000MB test file. After each test finishes,
recovery data center. Despite click Edit, Copy and put the text results in files organized by node name.
serving millions of pages per day 24. Enable Instant File Initialization. On each node, go into secpol.msc, Local Policies,
from hundreds of databases, they
don’t have a full time database User Rights Assignment, Perform Volume Maintenance Tasks, and add the AD
administrator. They wanted an service account you created in the Planning step.
easy way to move databases
around without changing the app.
AlwaysOn Availability Groups Cluster Installation
delivered. We’d recommend finishing all of the above steps before setting up clustering. Once the
cluster has passed validation, we’re not too fond of tweaking its configuration.
Client Y: Scale-Out Reads 25. In Failover Cluster Manager, validate the proposed cluster config. You can ignore
Another client was frustrated most of the storage warnings if you're not using shared storage clustering (FCI), but
because they were paying for
hardware in two data centers, but don’t ignore the network errors about redundancy. Network connectivity is
one data center was always sitting extremely important for AlwaysOn Availability Groups. Save the cluster validation
around idle doing nothing. By report to disk to cover your rear later (especially if the Windows installation is being
implementing AlwaysOn Availability
done by a separate team.)
Groups, we were able to leverage
the disaster recovery data center 26. Create the cluster. It's a link in the last step in the validation wizard.
for near-real-time reports directly 27. If you want to override the automatic quorum configuration, now is the time to
from the production database
remove votes - http://msdn.microsoft.com/en-us/library/hh270281.aspx
schema. Instead of waiting for
nightly data warehouse loads, they 28. Document what will happen if various nodes go offline and quorum goes down. All
could get the data they wanted teams needs to understand the ramifications of rebooting machines when patching.
faster - without spending more on 29. If an FCI is involved, configure DTC per your Planning section decisions.
licensing or ETL processes. They
could use all the reports they’d
already written to go against the SQL Server Installation
DR copy of production - without
the pains of querying the This section assumes we're not doing a shared storage failover cluster instance (FCI).
production server. FCIs add a lot of complexity that we’re not covering in this short (relatively!) checklist.
30. In Failover Cluster Manager, validate the cluster and save the cluster validation
report to disk. We do this again here because sometimes the Windows cluster and
SQL cluster are done by different teams, and we need to all be completely
comfortable with any validation warnings before & after installation. If you're not
comfortable with the warnings, stop here and get all teams involved. You only get
one chance to do this right.
31. Install a standalone instance of SQL Server on each node.
31.1. Install a default instance, not a named instance. (Not a requirement, just
making administration easier.)
31.2. Use the AD service account you created during the Planning section.
31.3. Use the same collation on all nodes.
Page [4] - July 2016 Edition - Download the Latest and Learn More at BrentOzar.com/go/alwayson
31.4. For the data & log file directories, choose the drive 40. After all nodes & databases are online, configure read-only
letters & path you picked during the Planning section. routing. There's no GUI for this - only T-SQL. http://
The default will be a long X: msdn.microsoft.com/en-us/library/hh710054.aspx
\MSSQL.SOMEPATH.SQLSERVER\MSSQL\DATA - 41. Test read-only routing with SQLCMD -K ReadOnly.
consider just doing X:\MSSQL\DATA, but don't use 42. Configure the preferred backup servers according to the
the root directory outright like X:\. Planning section decisions.
32. Install the latest SQL Server cumulative update that you 42.1. Configure the preferred servers in SSMS - http://
downloaded earlier during the planning steps. msdn.microsoft.com/en-us/library/
33. Enable AlwaysOn HADR - on each node, go into SQL Server hh710053.aspx#SSMSProcedure
Configuration Manager, Services, SQL Server Service, and on 42.2. Configure the backup scripts - maintenance plans
the AlwaysOn tab, check the box for Enable HADR. Click and http://Ola.Hallengren.com expect the jobs to be
OK, and restart the SQL Server instance. run on all nodes, and they'll automatically use
34. Follow our setup best practices checklist at http:// sys.fn_hadr_backup_is_preferred_replica to decide
www.BrentOzar.com/go/setup for things like memory and on a database-by-database level whether they're the
TempDB configuration. preferred backup server, and if so, back up that
Page [5] - July 2016 Edition - Download the Latest and Learn More at BrentOzar.com/go/alwayson
5. The primary replica has to be able to bring the listener (like repopulating report tables) in a SQL Agent job, you
online. The listener is the network name you’ll normally be might consider running those on a central job server
connecting when running queries. instead, and point the jobs at the listener name. Another
option is to use Windows Task Scheduler because those
Each of these can have its own dependencies, too. For tasks can fail around with the cluster name.
example, a failover clustered instance requires the ability to 9. If you’re doing disaster recovery tests, check additional
bring its IP addresses, name, and shared drives online before it steps in the unplanned failover section below. Your
starts SQL Server. This document doesn’t go into the details of company may want to do things like move quorum votes
troubleshooting each of those parts, but we need to mention it over to the disaster recovery site in preparation for
here because you should troubleshoot each part as the cluster disconnecting the primary data center altogether.
comes online. Reacting to an Unplanned Failover
If the cluster or the SQL Server instance went bump in the night
Performing a Planned Failover without the courtesy of calling you beforehand, then we’ve got
The difference between a planned failover and an unplanned additional work to do.
failover is that during a planned failover, a lot of our How much data will we lose? If you're failing over to an
dependencies are still online. The cluster’s online with all asynchronous replica, you're probably going to lose data.
voting members able to talk to each other, and the primary Build a set of queries that show exactly how old the replica's
replica is still accepting queries. In these cases, a failover is data is - but here's the catch: you can't rely on being able to
pretty easy. Here’s how to do it with the GUI: query the primary replica. Instead, look at timestamp-based
1. Paranoid Pre-Planning: if you haven’t done this in a while tables. For example, if you've got a Sales table with a
and you want more confidence that things will work as SalesDateTime field, run this query on the replica:
planned, go into Failover Cluster Manager and validate the SELECT TOP 10 * FROM dbo.Sales
cluster. If someone’s been monkeying around with the ORDER BY SalesDateTime DESC
hardware, networking, storage, or OS, you might catch If the newest sales are 20 minutes old, you can tell
issues here before they go horribly wrong. management that they stand to lose up to 20 minutes of data if
2. Connect to the primary replica with SSMS. we fail over now. You may still be able to recover the data
3. In Object Explorer, go into AlwaysOn, Availability Groups, from the primary - more on that later. The
right-click on the availability group, and click Properties. amount of potential data loss helps you
4. Make sure that both the primary server and the soon-to- with the rest of the decisions you're
be-primary server both have Synchronous replication, not
If only SQL
Asynchronous. (If not, change them to both be synch
Server had this
and watch the AlwaysOn dashboard until they catch up.)
sign
5. Right-click on the availability group and click Failover to
start the wizard. Choose the replica you’d like to fail
over to, and make sure it doesn’t have warnings.
6. After the failover completes, note any warnings
(especially quorum warnings). Verify that you can
connect to the listener.
7. If you’re using read-only routing so that clients can
connect to replicas for read-only queries, consider
checking those replicas with tools like sp_who or DMVs
to see if the replicas are getting queries.
8. Review your SQL Agent jobs on all servers to make sure
the right jobs are running on the right servers. If you’ve
hard-coded your backup, index maintenance, or user
database jobs to run on specific replicas, you may have
to revisit those jobs. If you’re doing application logic
Page [6] - July 2016 Edition - Download the Latest and Learn More at BrentOzar.com/go/alwayson
about to make. with all hands on deck - but not necessarily all hands should
How long will we troubleshoot? If we stand to lose 20 focus on the call. For example, in multi-DBA teams, I prefer to
minutes of data, then it helps to be able to tell an executive, designate just one DBA team member as the primary point of
"We can either go live with our DR replica right now and lose contact on the call. They handle all questions from developers
20 minutes of data, or I can spend 30 minutes troubleshooting and end users while the other DBAs focus on making sure the
to find out how bad the situation is. What would you like to replica stays healthy.
do?" Put that decision in management hands as fast as you What do we do if we have multiple primary replicas? In a
can - preferably after you've done just five minutes of split-brain cluster scenario, we can have two database servers
troubleshooting. Within five minutes, you should know if it's an that both think they're the primary replica for the same
easy problem or a challenging one, and you want to let database. We need a written checklist that we can give to
management decide when the outage should end. someone who will drive to the not-supposed-to-be-primary
Who's allowed to make the go/no-go decision? If the DBA, data center, shut the servers down, and stay there until the
dev manager, or IT manager aren't around, who's responsible network comes back online to bring the server up gracefully.
for the call to fail over to the asynch replica and lose data? Will we try to recover the lost data? When the formerly-
Ideally, we've got as large of a list as possible here so that the primary server comes back online, the databases will still be
company can react fast without waiting around for a long readable. We can use data comparison tools to compare the
phone tree. This list of people must be firmed up ahead of two replicas and generate insert/update/delete scripts to bring
time. the data back into sync - in theory. In practice, identity fields
Who's allowed to perform the failover? AlwaysOn can make this impractical or impossible. I'd start by
Availability Group failover can be scripted out with PowerShell comparing timestamps in a few key tables just to see how
and T-SQL, and these scripts can be given to a 24x7 help desk much data is at stake, and then give management a rough time
rotation. I know, as a DBA, it can be scary to hand out this estimate on what it'd take to recover that data. In order to give
kind of permissions, but I'd rather empower someone else - a good estimate, you'll want to try this approach ahead of time
when armed with a manager who can make go/no-go (long before an outage) using a pair of development or QA
decisions - so that the company can be back up and running database servers. Get a feeling for how the tables are related
faster. and how much work would be involved to sync them.
Once failover starts, who's got quorum votes? If groups of There's a lot of questions here. When your team is armed with
servers, or heaven forbid, an entire data center is offline, the answers ahead of time, it'll make failovers much less
failover starts to get more complex. The team will need to painful, and you'll look like the smartest, fastest-reacting ninjas
force the quorum online without enough votes, and then in the company.
reassign voting rights using PowerShell or cluster.exe. We'll Clusters, Failures, and GUIs
probably need to change the quorum model as well, perhaps
In theory, in theory and in practice are the same.
going from node majority to node-and-file-share majority.
In practice, they are different.
These are complex topics that should be thought-through
The AlwaysOn features (both failover clustering and availability
ahead of time rather than busting out Books Online.
groups) are very complex machines with a lot of moving parts.
Do jobs need to be enabled/disabled? If we've turned a
The GUI is pretty good, but it’s a good idea to understand the
secondary replica into a primary, how does this affect our SQL
underlying machinery because it doesn’t handle all of the
Server Agent jobs that were running on the old replica? Do we
moving parts. If everything’s working perfectly, then so will
need to change any of our backup jobs? In theory, the built-in
your failover - but then you wouldn’t need a failover if things
backup preferences settings for AlwaysOn Availability Groups
were working perfectly, now, would you?
will cover this, but complex replica scenarios may have
In this guide, we only address doing things in the GUI, but if
multiple backup jobs to maintain both onsite and offsite
you want to be able to react quickly and smoothly during a
backups. We might also have aggregation jobs or ISV jobs
failure, you’d be wise to learn the PowerShell and T-SQL
that must always run on the primary, and we need to ensure
commands involved with these activities. Script out your
those get enabled.
actions ahead of time so you’ll be able to troubleshoot faster.
Who will troubleshoot connectivity problems? When there's
been a massive failure, I like having an open conference bridge
Page [7] - July 2016 Edition - Download the Latest and Learn More at BrentOzar.com/go/alwayson
Stuffed
bears: 100%
scarier than SQL
Server clusters.
environment can lead to unexpected use is forbidden without express written Email: [email protected]
permission from Brent Ozar Unlimited.
performance changes like poor Phone: 773-420-9626
performance or outages. Without a clear
Page [8] - July 2016 Edition - Download the Latest and Learn More at BrentOzar.com/go/alwayson