Rethinking the Evaluation of Microservice RCA with a Fault Propagation-Aware Benchmark

Aoyang Fang, Songhan Zhang, Yifan Yang, Haotong Wu, Junjielong Xu, Xuyang Wang, Rui Wang, Manyi Wang, Qisheng Lu, Pinjia He
The Chinese University of Hong Kong, Shenzhen

News

  • [2025-12-23] We are currently reorganizing the code to facilitate easy reproduction. You are also welcome to utilize the artifacts on Zenodo. If you encounter any issues, please feel free to contact the authors. Feedback is highly appreciated!

Abstract

While cloud-native microservice architectures have revolutionized software development, their inherent operational complexity makes failure Root Cause Analysis (RCA) a critical yet challenging task. Numerous data-driven RCA models have been proposed to address this challenge. However, we find that the benchmarks used to evaluate these models are often too simple to reflect real-world scenarios. Our preliminary study reveals that simple rule-based methods can achieve performance comparable to or even surpassing state-of-the-art (SOTA) models on four widely used public benchmarks. This finding suggests that the oversimplification of existing benchmarks might lead to an overestimation of the performance of RCA methods.

To further investigate the oversimplification issue, we conduct a systematic analysis of popular public RCA benchmarks, identifying key limitations in their fault injection strategies, call graph structures, and telemetry signal patterns. Based on these insights, we propose an automated framework for generating more challenging and comprehensive benchmarks that include complex fault propagation scenarios. Our new dataset contains 1,430 validated failure cases from 9,152 fault injections, covering 25 fault types across 6 categories, dynamic workloads, and hierarchical ground-truth labels that map failures from services down to code-level causes. Crucially, to ensure the failure cases are relevant to IT operations, each case is validated to have a discernible impact on user-facing SLIs.

Our re-evaluation of 11 SOTA models on this new benchmark shows that they achieve low Top@1 accuracies, averaging 0.21, with the best-performing model reaching merely 0.37, and execution times escalating from seconds to hours. From this analysis, we identify three critical failure patterns common to current RCA models: scalability issues, observability blind spots, and modeling bottlenecks. Based on these findings, we provide actionable guidelines for future RCA research. We emphasize the need for robust algorithms and the co-development of challenging benchmarks. To facilitate further research, we publicly release our benchmark generation framework, the new dataset, and our implementations of the evaluated SOTA models.

Motivation & Framework

1. Preliminary Study: The Simplicity of Current Benchmarks

A systematic analysis reveals that existing public benchmarks (e.g., Nezha, Eadro, RCAEval) lack the complexity and scale required to meaningfully differentiate sophisticated RCA models.

  • Key Finding: Rule-based Methods vs. SOTA. To quantify benchmark simplicity, a heuristic method called SimpleRCA—which uses basic threshold-based anomaly detection—was compared against state-of-the-art (SOTA) models.
  • The "Type I" Fault Issue. Analysis shows that 86% of cases in existing datasets exhibit "Type I" or "Type II" patterns, where fault symptoms are either overly localized to the root-cause service or too weak to express.

2. Identifying the "Three Sins" of Existing Datasets

The research identifies three critical limitations in current public benchmarks:

  • Observability Blind Spots: 99% of cases lack essential telemetry types (e.g., missing metrics or logs for specific fault types).
  • Shallow Call Graphs: Most benchmarks feature a maximum call depth of only 2–3 hops.
  • Static Workloads: Request patterns often lack the dynamic variability found in real-world systems.

3. The Proposed Framework: A Closed-Loop Pipeline

The framework automates the generation, collection, and validation of failure scenarios at scale, using Train Ticket (50+ services) as the system foundation.

  • State-Machine Based Workload: Models user workflows as directed graphs to create recursive dependencies and combinational execution paths.
  • Impact-Driven Validation: Uses a pragmatic oracle to filter out "silent faults." Only injections causing measurable degradation in user-facing SLIs (Success Rate, Latency) are included.
  • Hierarchical Labels: Provides ground truth mapping from the service level down to specific code functions.

Fig. 3: The six-stage pipeline for our benchmark dataset construction, from system selection to the final validated and labeled failure cases.

Results

1. Re-evaluation of 11 SOTA Models

The research re-engineers and evaluates 11 recent SOTA RCA methods on the new benchmark.

  • Overall Performance: When confronted with complex fault propagation, the performance of existing models degrades significantly. The average Top@1 accuracy across all models is only 0.21, compared to 90%+ reported on simpler datasets.
  • Fault-Specific Performance: Models show distinct blind spots, particularly in DNS, TimeSkew, and Network Corrupt scenarios.

2. Systematic Failure Mode Analysis

The study identifies three critical modeling bottlenecks:

  • Scalability Issues: Execution times for many models escalate from seconds to hours as data volume increases.
  • Observability Blind Spots: Models fail to interpret the cessation of telemetry (e.g., PodKill) or contradictory signals across logs and traces.
  • Modeling Bottlenecks: Rigid assumptions (e.g., frequency-based fault manifestation) are invalidated by complex scenarios.

Insight & Resources

The study concludes that current RCA progress is inflated by simplistic benchmarks, urging a shift toward co-developing challenging benchmarks and robust algorithms. Key insights emphasize the need for fault-aware models that handle incomplete observability and complex causality, as well as metrics beyond accuracy, such as diagnostic coherence.

To facilitate community progress, the research artifacts are publicly released:

  • Benchmark Framework: Automated toolchain for fault injection and data collection.
  • New Dataset: 1,430 validated failure cases covering 25 fault types.
  • Unified Evaluation Suite: Re-engineered and containerized versions of 11 SOTA RCA models.