The Bathtub Curve and Data Center Equipment
Reliability
datacenterfrontier.com/bathtub-curve-data-center-equipment/
Voices of the Industry
When it’s a question of spending tens of thousands of dollars on a refresh, you should
evaluate your needs and access the facts to make the right decision for your
environment. (Photo: Service Express)
Jake Blough, Chief Technology Officer for Service Express, explores the Bathtub Curve
theory, its limitations, and data center equipment reliability and maintenance.
When digging into reliability engineering theories, you
will quickly find the widely used Bathtub
Curve. According to this theory, when a product is new
to the market, there are substantial rates of early
failures – which commonly result from an error with
handling or installation. As the end of
product life approaches, the rate increases due to a
second and final wave of wear-out failures. Although
the Bathtub Curve, pictured below, accurately
reflects the failure behavior of many
products, we have found it does
not universally apply to data center Jake Blough, Chief Technology
Officer, Service Express
equipment.
Examining Reliability Data
1/6
At Service Express, we’ve collected over 15 years of equipment data from over half a
million devices. The data tracks when equipment breaks, how it breaks and how often it
breaks. The common assumption is that these devices should have a higher failure rate
in their infancy and then again toward the end of life. However, in looking at non-
critical and critical server and storage failures, our data shows that equipment failure
rates do not follow the Bathtub Curve as expected.
Critical Server Failures
A critical failure occurs when something like a CPU or system board fails. Critical server
failures result in the loss of access to applications or data – impacting business
productivity. In the graph below, you will see that most machines exhibit a
failure rate between 0% and 0.2% with an outlier having an early
production issue of 0.3%. These rates stay almost identical over a 10-15-
year life span.
Non-Critical Server Failures
A non-critical failure occurs when a component like a disk drive or power supply fails.
Modern data center equipment has built-in redundancy for these components, so no
loss of data or access occurs in these instances. In the graph below, you will see a data
set tracking non-critical server failure of several models over 13 years.
2/6
You can see that the non-critical failures barely increase over time with a
failure rate of less than 0.5%; this is consistent with the number of
components installed in the system. The more components in a system, the more
chances of part vulnerability. The increase in failures toward the end of life seen here is
attributed to the number of components in the system versus the wear-out factor
associated with the Bathtub Curve. Systems in a blade form factor show much lower
non-critical instances compared to large 4U 4-CPU form factor systems.
Critical & Non-Critical Storage Failures
Storage devices are comprised of three types of components including critical, non-
critical and disk drives. Critical parts typically include storage processors, whereas non-
critical parts include cache batteries, power supplies and fans.
Storage systems are built to be incredibly resilient and tolerant of multiple failures
before data is impacted. We consider storage processors to be the most critical
components as a loss of a service processor will affect overall performance. In the graph
above you can see critical, non-critical and drive failures for a popular OEM storage
system. Note that over five years, critical storage failures occur between 0.1%
and 0.2% – resulting in about one failure out of 1,000 systems per month.
Non-critical faults are typically caused by cache battery sets which must be
replaced every 3-5 years.
3/6
Disk Drive Failures
The graph above represents data for all disk drive failures over six years. You can see
that disk drives experience a failure rate between 0.2% to 0.3%. Meaning
that over time, disk drives are far more resilient than “common
knowledge” would have you believe.
The long-term equipment reliability as illustrated by the data is a source of good news
for IT departments. This failure data counters the traditional recommendation for a
hardware refresh based on the expectation of increased failures as equipment ages. You
can factor in longer equipment reliability and cost-savings when considering the timing
of your refresh.
Your Next Data Center Refresh
Of course, there are valid reasons for taking on the cost and
time of a hardware refresh. Primary factors that should
determine when a hardware upgrade is needed include:
Software compatibility
Hardware compatibility between devices
Performance capacity has been exceeded
If your equipment is meeting your immediate needs,
consider delaying your refresh instead. Delaying an unneeded refresh can help you
reduce your CapEx spend and improve the value of your original investment.
When it’s a question of spending tens of thousands of dollars on a refresh, you should
evaluate your needs and access the facts to make the right decision for your
environment. Based on our reliability data that shows stable failure rates over time for
server and storage equipment, we recommend a refresh every 7-10 years. Your refresh
cycle should always be driven by compatibility, capacity and reliability.
Jake Blough is the Chief Technology Officer for Service Express.
4/6