While cloud-native microservice architectures have revolutionized software development, their inherent operational complexity makes failure Root Cause Analysis (RCA) a critical yet challenging task. Numerous data-driven RCA models have been proposed to address this challenge. However, we find that the benchmarks used to evaluate these models are often too simple to reflect real-world scenarios. Our preliminary study reveals that simple rule-based methods can achieve performance comparable to or even surpassing state-of-the-art (SOTA) models on four widely used public benchmarks. This finding suggests that the oversimplification of existing benchmarks might lead to an overestimation of the performance of RCA methods.
To further investigate the oversimplification issue, we conduct a systematic analysis of popular public RCA benchmarks, identifying key limitations in their fault injection strategies, call graph structures, and telemetry signal patterns. Based on these insights, we propose an automated framework for generating more challenging and comprehensive benchmarks that include complex fault propagation scenarios. Our new dataset contains 1,430 validated failure cases from 9,152 fault injections, covering 25 fault types across 6 categories, dynamic workloads, and hierarchical ground-truth labels that map failures from services down to code-level causes. Crucially, to ensure the failure cases are relevant to IT operations, each case is validated to have a discernible impact on user-facing SLIs.
Our re-evaluation of 11 SOTA models on this new benchmark shows that they achieve low Top@1 accuracies, averaging 0.21, with the best-performing model reaching merely 0.37, and execution times escalating from seconds to hours. From this analysis, we identify three critical failure patterns common to current RCA models: scalability issues, observability blind spots, and modeling bottlenecks. Based on these findings, we provide actionable guidelines for future RCA research. We emphasize the need for robust algorithms and the co-development of challenging benchmarks. To facilitate further research, we publicly release our benchmark generation framework, the new dataset, and our implementations of the evaluated SOTA models.
A systematic analysis reveals that existing public benchmarks (e.g., Nezha, Eadro, RCAEval) lack the complexity and scale required to meaningfully differentiate sophisticated RCA models.
The research identifies three critical limitations in current public benchmarks:
The framework automates the generation, collection, and validation of failure scenarios at scale, using Train Ticket (50+ services) as the system foundation.
Fig. 3: The six-stage pipeline for our benchmark dataset construction, from system selection to the final validated and labeled failure cases.
The research re-engineers and evaluates 11 recent SOTA RCA methods on the new benchmark.
The study identifies three critical modeling bottlenecks:
The study concludes that current RCA progress is inflated by simplistic benchmarks, urging a shift toward co-developing challenging benchmarks and robust algorithms. Key insights emphasize the need for fault-aware models that handle incomplete observability and complex causality, as well as metrics beyond accuracy, such as diagnostic coherence.
To facilitate community progress, the research artifacts are publicly released: