DEV Community

Site Reliability Engineering

Site Reliability Engineering principles, practices, and culture.

Posts

👋 Sign in for the ability to sort posts by relevant, latest, or top.
The Future of SRE: What the Next 5 Years Look Like

The Future of SRE: What the Next 5 Years Look Like

Comments
3 min read
Why Setting Up Observability Takes Forever (And What To Do About It)

Why Setting Up Observability Takes Forever (And What To Do About It)

Comments
4 min read
Railway vs AWS: When Leaving Railway Means Owning Reliability

Railway vs AWS: When Leaving Railway Means Owning Reliability

Comments
14 min read
Runbooks in Minutes: An On-Call Incident Copilot with HazelJS

Runbooks in Minutes: An On-Call Incident Copilot with HazelJS

Comments
4 min read
Great Stack to Doesn't Work #10 — Season Finale: "When PagerDuty Calls at 3 AM"

Great Stack to Doesn't Work #10 — Season Finale: "When PagerDuty Calls at 3 AM"

Comments
12 min read
Stop breaking production: a migration path to unified platforms 🛠️

Stop breaking production: a migration path to unified platforms 🛠️

Comments
1 min read
Building a Career in SRE: From Junior to Staff

Building a Career in SRE: From Junior to Staff

Comments
2 min read
The Human-in-the-Loop SRE: Designing Automation Escalation Policies for AI-Assisted Operations

The Human-in-the-Loop SRE: Designing Automation Escalation Policies for AI-Assisted Operations

Comments
15 min read
CPU and DB were bored, yet every site timed out: a slow-read bot that starved Apache's workers

CPU and DB were bored, yet every site timed out: a slow-read bot that starved Apache's workers

Comments
5 min read
I'm building a read-only context engine for Kubernetes and AI agents

I'm building a read-only context engine for Kubernetes and AI agents

Comments
6 min read
The Post-Mortem That Taught My System How to Fix Itself Using Hindsight

The Post-Mortem That Taught My System How to Fix Itself Using Hindsight

Comments
7 min read
I Asked Claude to Map My Infrastructure. Then I Asked a Purpose-Built Tool.

I Asked Claude to Map My Infrastructure. Then I Asked a Purpose-Built Tool.

Comments
4 min read
Surviving the region you run in: failover on Aurora DSQL, and what the demo proves

Surviving the region you run in: failover on Aurora DSQL, and what the demo proves

Comments
5 min read
What is SRE? A Beginner's Guide to Site Reliability Engineering

What is SRE? A Beginner's Guide to Site Reliability Engineering

Comments
5 min read
Ongrid : open-source ops AI agent for RCA and remediation from chat

Ongrid : open-source ops AI agent for RCA and remediation from chat

Comments
1 min read
👋 Sign in for the ability to sort posts by relevant, latest, or top.