Backblaze probes increased annualized failure rate for its 240,940 HDDs

Interior of hard disk drive

Backblaze’s quarterly updates on annualized failure rates (AFRs) for its arsenal of hard disk drives (HDDs) have provided unique insight into long-term storage use for over 10 years. Today, the backup and cloud storage company released Q2 2023 data, which explores an intriguing increase in AFRs.

Today’s blog post details data for 240,940 HDDs that Backblaze uses for data storage around the world. There are 31 different models, and Backblaze’s Andy Klein, who authored the blog, estimated in an email to Ars Technica that 15 percent of the HDDs in the dataset, including some of the 4, 6, and 8TB drives, are consumer-grade. The dataset doesn’t include boot drives, drives in commission for testing purposes, or drive models for which Backblaze didn’t have at least 60 units.

HDD models need at least 50,000 drive days in order for Backblaze to consider them statistically relevant.
Enlarge / HDD models need at least 50,000 drive days in order for Backblaze to consider them statistically relevant.

One of the biggest revelations from examining the drives from April 1, 2023, through June 30, 2023, was an increase in AFR from Q1 2023 (1.54 percent) to Q2 2023 (2.28 percent). Backblaze’s Q1 dataset examined 237,278 HDDs across 30 models.

Of course, that AFR increase alone isn’t enough to warrant any panic. Since quarterly AFR numbers are “volatile,” Klein told Ars Technica, Backblaze further evaluates both quarter-to-quarter and lifetime trends “to see if what happened was an anomaly or something more.”

So, Klein started digging further by grouping the drives by capacity. This is because, as Klein explained to Ars:

A Backblaze storage vault consists of 1,200 drives of the same size, with 60 drives in 20 storage servers. If we grouped the drives strictly by age and wanted to replace just the oldest drives in a given Backblaze vault, we would only replace those drives in the vault that met the old age criteria, not all the drives. Then, a year from now, we’d do it again, and the year after that, etc. By using the average age by drive size, we can, as appropriate, replace/upgrade all of the drives in a vault at once.

After eliminating drives that Backblaze considered young (under 5 years old), Backblaze came up with the below line graph, homing in on quarterly AFRs for its 4, 6, 8, and 10TB HDDs. And looking at the chart below, the lines for the 10 and 8TB models stand out:

Moving to lifetime AFRs

Digging even deeper to see if it’s truly 8TB and 10TB drives increasing the drives’ AFR, Backblaze turned to lifetime AFRs, which look at data from drives with an age of 10 years, 2 months, 10 days, with the oldest drive (a 6TB Seagate ST6000DX000) being about 10 years, 2 months old.

The lifetime AFR for Backblaze’s hard drives increased 0.05 percent from the preceding quarter (1.4 percent) to now (1.45 percent). Big drivers for that change were 10TB HDDs, as well as 8TB ones.

Backblaze has way more 8TB drives (24,891) than 10TB ones (1,124). So, Klein grouped the 8TB drives by model. Klein told Ars that each of the drive models below had over 50,000 drive days for the quarter and over 2 million drive days in their lifetime.

“For all three models, the increase of the lifetime annualized failure rate from Q1 to Q2 is 10 percent or more, which is statistically similar to the 12 percent increase for all of the 8TB drive models. If you had to select one drive model to focus on for migration, any of the three would be a good candidate,” Klein’s blog says.

What have we learned?

The executive told Ars that years of data collection like this have taught Backblaze that the failure rate of a given model does not predict the failure rate of other models of the same size or by the same manufacturer:

That’s why once we identified the 8TB drives as the potential problem, we had to dig into the model-specific numbers. In this case, all of the models were similar in their increase in failure rates, but it could have been just as likely that they weren’t.

One thing to remember is that we are looking at the change in failure rates over time, not the actual failure rates themselves. We are looking for unusual changes outside of what we would expect.

Looking at detailed drive data like this gives Backblaze an intimate look into its storage environment so it can make any necessary adjustments.

“We have a drive migration program to move from smaller drives to larger drives to improve storage density in a given Backblaze vault. For economic reasons, we start with the smallest drives and then consider other details, such as failure rates, in the process. To that end, the analysis we did is being used to help prioritize which Backblaze vaults are upgraded,” Klein told Ars.

For consumers considering a new HDD for personal use, (but not as part of a RAID array) Klein advised seeking a model they “believe fails the least.”

He added:

But the difference between a 1 percent and 2 percent failure rate is moot if you don’t back up your stuff somewhere else. Relying on a single drive, HDD or SSD, as your sole source of data storage is a ticking time bomb. Whether a drive lasts 2 years or 10 years, it will fail.

Backblaze’s complete dataset is available to the public for free on its website.