Metastable Failures in Distributed Systems
Discovered: Mar 26, 2022. 21:37 Bronson, Charapko, Aghayev, Zhu:Metastable Failures in Distributed Systems QUOTE Robustness is a fundamental goal of distributed systems research. Yet despite years of advances, there are still many
system outages in the wild. By reviewing experiences from
a decade of operating hyperscale distributed systems, we
identify a class of failures that can disrupt them, even when
there are no hardware failures, configuration errors or software bugs. These metastable failures have caused
widespread outages at large internet companies, lasting from
minutes to hours. Paradoxically, the root cause of these failures is often features that improve the efficiency or reliability
of the system.... Conclusion: Metastable failures are a class of failures that impact distributed systems. They naturally arise from optimizations
and policies that improve behavior in the common case. They
are an emergent behavior rather than a logic bug—one cannot write a unit or integration test to trigger them. As such,
they are rare, but can have catastrophic effects.
END QUOTE