Mean time between failure, or MTBF, is the average time between repairable failures of a product or system. It’s a key metric for determining the frequency of system failures and providing an overview of system reliability.

MTBF can be used to determine how successful your team is at preventing or reducing potential incidents. The higher the time between failures, the more reliable the system is.

MTBF plays a role in tracking both the reliability and availability of a component or system.

Reliability is the probability that a system or component will perform as designed over a specific period without failure. MTBF is a basic measure of a system’s reliability—the higher the MTBF, the higher the reliability of the product. Using MTBF with other failure metrics and maintenance strategies makes it easier to predict asset failures, as teams can better determine how and when to implement preventative measures before a failure occurs.

Availability is the ability of a system or component to operate as designed when needed. MTBF combined with mean time to restore (MTTR) can determine the likelihood that a system will fail within a certain time frame. The availability of a system can be calculated by dividing the MTBF by the sum of MTTR and MTBF.

Availability = MTBF / (MTBF + MTTR)

MTBF is calculated by dividing the total operational time for a specific period by the number of failures during the same period. Here’s how it’s calculated:

To determine the total operational time of a system, you’ll need to monitor the system for a specific period of time.

- The total operational time is the total time the system has been running without failure.
- The total number of failures is the number of times the system has failed within the specified period.

As an example, let’s say that during a 24-hour time frame, a system experiences three hours of downtime that occur during three separate incidents.

- Total uptime = (24 - 3) = 21 hours
- Total number of incidents = 3
- MTBF = total uptime / number of incidents
- MTBF = 21/3 = 7 hours

As described above, MTBF can be calculated by dividing total uptime by the number of failures recorded. Failure rate, on the other hand, is the inverse of MTBF and is calculated by dividing the number of failures by the total uptime.

MTBF can be calculated from the failure rate as follows: MTBF = 1 / failure rate

For instance:

- Failure rate = 25 failures / 1,000 hours of uptime
- Failure rate = 0.025
- MTBF = 1 / 0.025
- MTBF = 40

Since the time between failures for a system or component can depend on factors such as configurations, operating conditions, age, and other external factors, there isn’t one “good” MTBF metric. Instead, MTBF should be calculated for your specific assets and will become more accurate as you collect more data on them.

Of course, while there may not be a universally accepted target MTBF, it’s still true that the higher the MTBF, the better. A high MTBF shows that your system or component is highly reliable and will have fewer problems over its lifetime—and having fewer incidents tends to translate to reduced downtime and lower costs.

A low MTBF means that your system is likely to fail more frequently and the reliability of your system needs to be reviewed. A good preventative maintenance plan and the implementation of tools to monitor MTBF and other failure metrics can help improve system reliability.

Next, let’s consider some examples of low, average, and high MTBF related to a production system operating over the course of 30 days.

Let’s say the system goes down six times within 30 days (720 hours) for four hours each time, for a total disruption time of 24 hours.

- Total uptime = (720 - 24) = 696 hours
- Total number of incidents = 6
- MTBF = total uptime / number of incidents
- MTBF = 696 / 6 = 116 hours (approximately 5 days)

An outage every five days indicates an extremely unreliable system that will frequently impact business operations and customers.

Now, imagine that the system only goes down two times within the same 30 days (720 hours) for two hours each time, for a total disruption time of four hours.

- Total uptime = (720 - 4) = 716 hours
- Total number of incidents = 2
- MTBF = total uptime / number of incidents
- MTBF = 716 / 2 = 358 hours (approximately 15 days)

While this might not be an extremely high MTBF, one failure every 15 days can be acceptable for some business use cases.

Finally, consider a system that only goes down once within 30 days (720 hours) for two hours.

- Total uptime = (720 - 2) = 718 hours
- Total number of incidents = 1
- MTBF = total uptime / number of incidents
- MTBF = 718 / 1 = 718 hours (approximately 30 days)

Compared to the other scenarios described here, one failure every 30 days can be considered a high MTBF, indicating that the system is highly reliable.

MTBF is a useful reliability metric in several areas of technology. Let’s consider some scenarios for cybersecurity, incident response, and DevOps.

In cybersecurity, MTBF can indicate that a system is nearing the end of its life and that the risk of a critical outage is increasing.

For example, imagine that a cybersecurity system is observed over a 48-hour period. During that time, the system fails five times for a total downtime of eight hours or a total operational time of 40 hours.

MTBF = 40 / 5 = 8 hours

The following month, the system is again observed over 48 hours. This time, there are eight failures for a total downtime of 12 hours or a total operational time of 36 hours. The system’s MTBF is now 4.5 hours.

MTBF = 36 / 8 = 4.5 hours

If MTBF continues to fall during subsequent observations, this could suggest that an area in the system—or the entire system itself—needs to be replaced or hardened.

MTBF can also help determine how effective your incident response team is at minimizing and preventing incidents. If MTBF is too low or trending downward, the team should analyze incident data to discover recurring outages and concerning trends.

MTBF in DevOps is a measure of the frequency of failures for a feature or single component, allowing teams to predict the reliability and availability levels of a service. In this way, it can highlight weaknesses in a component’s design or the testing and maintenance process.

By monitoring MTBF, DevOps teams can discover and eliminate inefficiencies and bottlenecks that could lead to failure by improving processes and system infrastructure. As teams make improvements, MTBF increases, indicating a more reliable system.

For instance, consider an example where the total work for a code integration pipeline over five days was 100 hours. During the week, four failures occur.

- Total operation time = 100 hours
- Total number of failures = 4
- MTBF = total operation time / number of failures
- MTBF = 100 / 4 = 25 hours

With the right tools, you can boost MTBF and other maintenance metrics. These tools include infrastructure monitoring tools, service monitoring, visualization tools, application performance monitoring tools, cross-platform and data aggregation tools, and project management tools.

Yet, all these tools require fast high-performance storage that can handle massive amounts of data while maintaining maximum performance. With Pure Storage® FlashBlade®, you can create a robust, high-performance storage solution to support the advanced monitoring and observability tools needed to help you boost your MTBF metrics.

MTBF and mean time to failure (MTTF) are both used to measure time to evaluate the performance of a system or component, though the way they’re applied is different.

Dúvidas ou comentários?

Tem dúvidas ou comentários sobre produtos ou certificações da Pure? Estamos aqui para ajudar.

Agende uma demonstração

Agende uma demonstração ao vivo e veja você mesmo como a Pure pode ajudar a transformar seus dados em resultados poderosos.

**Telefone:** **55-11-2844-8366**

**Imprensa:** **pr@purestorage.com**

Sede da Pure Storage

Av. Juscelino Kubitschek, 2041

Torre B, 5º andar - Vila Olímpia

São Paulo, SP

04543-011 Brasil