DEV Community

# reliability

General discussions on building and maintaining reliable software systems.

Posts

đź‘‹ Sign in for the ability to sort posts by relevant, latest, or top.
The Fire That Reached the Backups: The OVHcloud Strasbourg Data-Centre Fire, 2021

The Fire That Reached the Backups: The OVHcloud Strasbourg Data-Centre Fire, 2021

Comments
7 min read
Automatic Error Recovery in AI Agent Networks

Automatic Error Recovery in AI Agent Networks

2
Comments
2 min read
Claude Code for Canary Deployments: How I Ship to 1% of Users Before Breaking Everything

Claude Code for Canary Deployments: How I Ship to 1% of Users Before Breaking Everything

Comments
9 min read
The Hidden Cost of Downtime: How SRE Error Budgets Protect National Economic Infrastructure

The Hidden Cost of Downtime: How SRE Error Budgets Protect National Economic Infrastructure

Comments
11 min read
Eleven silent-failure modes across 36 agent platforms, and the structural feature they share

Eleven silent-failure modes across 36 agent platforms, and the structural feature they share

Comments
5 min read
How we survived 218 network transitions with zero data loss: ALEF's self-healing architecture

How we survived 218 network transitions with zero data loss: ALEF's self-healing architecture

Comments
2 min read
Grafana 'No Data' after migration: 7 reconcilers we had to kill first

Grafana 'No Data' after migration: 7 reconcilers we had to kill first

Comments
8 min read
The Silent Outage: Monitoring What You Can't See

The Silent Outage: Monitoring What You Can't See

Comments
2 min read
The silent sequential skip: a failure class every AI pipeline should name

The silent sequential skip: a failure class every AI pipeline should name

Comments
5 min read
Energy Grid Observability: What the Power Sector Can Learn from Google SRE

Energy Grid Observability: What the Power Sector Can Learn from Google SRE

Comments
12 min read
The 80% Problem: Why Getting an LLM System to 'Works in Demo' Is 20% of the Work

The 80% Problem: Why Getting an LLM System to 'Works in Demo' Is 20% of the Work

Comments
14 min read
Datadog Went Dark for 24 Hours and Came Back With a Different Philosophy

Datadog Went Dark for 24 Hours and Came Back With a Different Philosophy

Comments
13 min read
How to Fix Slow DNS Lookup: A Complete Troubleshooting Guide

How to Fix Slow DNS Lookup: A Complete Troubleshooting Guide

Comments
10 min read
Slack Built a Big Red Button to Drain an Entire Data Center in Five Minutes

Slack Built a Big Red Button to Drain an Entire Data Center in Five Minutes

Comments
12 min read
Discord Killed the MacBook Dev Environment and Never Looked Back

Discord Killed the MacBook Dev Environment and Never Looked Back

Comments
11 min read
đź‘‹ Sign in for the ability to sort posts by relevant, latest, or top.