Application Availability and Resilience
Resolved
Feb 23, 2026 at 5:00pm UTC
Application Availability (Uptime): The measure of the percentage of time an application is accessible and functioning correctly to users. It focuses on preventing downtime through redundancy, such as deploying across multiple servers or data centers.
Application Resilience (Recovery): The ability of a system to continue operating, or to gracefully degrade, during unplanned disruptions (e.g., component failures, network outages) and quickly recover normal functionality.
Key Aspects:
- Fault Tolerance: The design of a system to handle errors without interrupting the user experience.
- Graceful Degradation: Maintaining core functionality even when non-essential features fail.
- Automation: Using tools (like Kubernetes) to automatically self-heal and re-route traffic without manual intervention.
- Core Metrics: Measured by Service Level Objectives (SLOs), Recovery Time - Objectives (RTO—how fast you recover), and Mean Time to Recovery (MTTR).
Affected services