DEV Community

Site Reliability Engineering

Site Reliability Engineering principles, practices, and culture.

Posts

đź‘‹ Sign in for the ability to sort posts by relevant, latest, or top.
Remetric: find waste in self-hosted Prometheus, Grafana, and Loki

Remetric: find waste in self-hosted Prometheus, Grafana, and Loki

Comments
6 min read
The Degradation Ladder: How Systems Fail Before They Fail

The Degradation Ladder: How Systems Fail Before They Fail

Comments
5 min read
AI SRE and AI DevOps: different problems, one reliability stack

AI SRE and AI DevOps: different problems, one reliability stack

Comments
6 min read
Your Agent Acts Without Checking Your Error Budget — That's the Failure Mode Nobody Is Tracking

Your Agent Acts Without Checking Your Error Budget — That's the Failure Mode Nobody Is Tracking

Comments
6 min read
How We Killed Our Worst Alert (And What We Learned)

How We Killed Our Worst Alert (And What We Learned)

Comments
2 min read
The Hidden Cost of Downtime: How SRE Error Budgets Protect National Economic Infrastructure

The Hidden Cost of Downtime: How SRE Error Budgets Protect National Economic Infrastructure

Comments
11 min read
Why Backup Success Does Not Mean Database Recoverability

Why Backup Success Does Not Mean Database Recoverability

Comments
2 min read
Game day on our build cluster: killing an AZ to test LLM flake detection

Game day on our build cluster: killing an AZ to test LLM flake detection

Comments
4 min read
I got tired of writing post-mortems — so I built RCAi for SREs

I got tired of writing post-mortems — so I built RCAi for SREs

Comments
1 min read
Diagnosing KubeAPIErrorBudgetBurn: When a 7-Year-Old Disk Takes Down Your Control Plane

Diagnosing KubeAPIErrorBudgetBurn: When a 7-Year-Old Disk Takes Down Your Control Plane

Comments
5 min read
The Reliability Roadmap: A 90-Day Plan for New SRE Teams

The Reliability Roadmap: A 90-Day Plan for New SRE Teams

Comments
2 min read
I Got Tired of 35-Minute Incident Reviews — So I Built an AI SRE Copilot

I Got Tired of 35-Minute Incident Reviews — So I Built an AI SRE Copilot

Comments
2 min read
Scaling On-Call When You Only Have 5 Engineers

Scaling On-Call When You Only Have 5 Engineers

Comments
2 min read
A note on building reliability infrastructure for AI agents and why post-incident debugging matters more than pre-flight validation.

A note on building reliability infrastructure for AI agents and why post-incident debugging matters more than pre-flight validation.

1
Comments
4 min read
Observability in 2026: Distributed Tracing Replaced Logs, and OpenTelemetry Won

Observability in 2026: Distributed Tracing Replaced Logs, and OpenTelemetry Won

Comments
5 min read
đź‘‹ Sign in for the ability to sort posts by relevant, latest, or top.