SRE Runbook

December 8, 2025 · 34 min read

Femi Adigun

Senior Software Engineer & Coach

1. Service Outage / High Error Rate

Metadata

Severity: SEV-1 (Critical)
MTTR Target: < 30 minutes
On-Call Team: SRE, Platform Engineering
Escalation Path: SRE → Engineering Manager → VP Engineering
Last Updated: 2024-11-27
Owner: SRE Team

Symptoms

High 5xx error rate (>1% of requests)
Service returning errors instead of successful responses
Health check endpoints failing
Customer reports of service unavailability
Spike in error monitoring alerts

Detection

Automated Alerts:

Alert: ServiceHighErrorRate
Severity: Critical
Condition: error_rate > 1% for 2 minutes
Dashboard: https://grafana.company.com/service-health

Manual Checks:

# Check service health
curl -i https://api.company.com/health

# Check error rate in last 5 minutes
kubectl logs -l app=api-service --tail=1000 --since=5m | grep ERROR | wc -l

# Check pod status
kubectl get pods -n production -l app=api-service

Triage Steps

Step 1: Establish Incident Context (2 minutes)

# Check current time and impact window
date

# Check error rate trend
# View Grafana dashboard - is error rate increasing or stable?

# Identify scope
# All services or specific service?
# All regions or specific region?
# All users or subset of users?

# Recent changes
# Check last 3 deployments
kubectl rollout history deployment/api-service -n production | tail -5

# Check recent config changes
kubectl get configmap api-config -n production -o yaml | grep -A 2 "last-applied"

Record in incident doc:

Start Time: [TIMESTAMP]
Error Rate: [X%]
Affected Service: [SERVICE_NAME]
Affected Users: [ALL/SUBSET]
Recent Changes: [YES/NO - DETAILS]

Step 2: Immediate Mitigation (5 minutes)

Option A: Recent Deployment - Rollback

# If deployment in last 30 minutes, rollback immediately
kubectl rollout undo deployment/api-service -n production

# Monitor rollback progress
kubectl rollout status deployment/api-service -n production

# Watch error rate
watch -n 5 'curl -s https://api.company.com/metrics | grep error_rate'

Option B: Scale Up (If Traffic Related)

# Check current replica count
kubectl get deployment api-service -n production

# Scale up by 50%
current_replicas=$(kubectl get deployment api-service -n production -o jsonpath='{.spec.replicas}')
new_replicas=$((current_replicas * 3 / 2))
kubectl scale deployment api-service -n production --replicas=$new_replicas

# Enable HPA if not already
kubectl autoscale deployment api-service -n production --min=10 --max=50 --cpu-percent=70

Option C: Circuit Breaker (If Dependency Down)

# If error logs show dependency timeouts
# Enable circuit breaker via feature flag
curl -X POST https://feature-flags.company.com/api/flags/circuit-breaker-enable \
  -H "Authorization: Bearer $ADMIN_TOKEN" \
  -d '{"enabled": true, "service": "downstream-api"}'

# Or update config map
kubectl patch configmap api-config -n production \
  --type merge -p '{"data":{"CIRCUIT_BREAKER_ENABLED":"true"}}'

# Restart pods to pick up config
kubectl rollout restart deployment/api-service -n production

Step 3: Root Cause Investigation (15 minutes)

Check Logs:

# Recent errors
kubectl logs deployment/api-service -n production --tail=500 --since=10m | grep -i error

# Stack traces
kubectl logs deployment/api-service -n production --tail=1000 | grep -A 10 "Exception"

# All logs from failing pods
failing_pods=$(kubectl get pods -n production -l app=api-service --field-selector=status.phase!=Running -o name)
for pod in $failing_pods; do
  echo "=== Logs from $pod ==="
  kubectl logs $pod -n production --tail=100
done

Check Metrics:

# CPU usage
kubectl top pods -n production -l app=api-service

# Memory usage
kubectl top pods -n production -l app=api-service --sort-by=memory

# Request rate
curl -s "http://prometheus:9090/api/v1/query?query=rate(http_requests_total{service='api-service'}[5m])"

# Database connections
curl -s "http://prometheus:9090/api/v1/query?query=db_connections_active{service='api-service'}"

Check Dependencies:

# Test database connectivity
kubectl run -i --tty --rm debug --image=postgres:latest --restart=Never -- \
  psql -h postgres.production.svc.cluster.local -U appuser -d appdb -c "SELECT 1;"

# Test Redis
kubectl run -i --tty --rm debug --image=redis:latest --restart=Never -- \
  redis-cli -h redis.production.svc.cluster.local ping

# Test external API
curl -i -m 5 https://external-api.partner.com/health

Check Network:

# DNS resolution
nslookup api-service.production.svc.cluster.local

# Network policies
kubectl get networkpolicies -n production

# Service endpoints
kubectl get endpoints api-service -n production

Step 4: Resolution Actions

Common Root Causes & Fixes:

A. Database Connection Pool Exhaustion

# Increase pool size (if safe)
kubectl set env deployment/api-service -n production DB_POOL_SIZE=50

# Or restart pods to reset connections
kubectl rollout restart deployment/api-service -n production

B. Memory Leak / OOM

# Increase memory limits temporarily
kubectl set resources deployment api-service -n production \
  --limits=memory=4Gi --requests=memory=2Gi

# Enable heap dump on OOM (Java)
kubectl set env deployment/api-service -n production \
  JAVA_OPTS="-XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/tmp/heapdump.hprof"

# Restart rolling
kubectl rollout restart deployment/api-service -n production

C. External Dependency Failure

# Enable graceful degradation
# Update feature flag to bypass failing service
curl -X POST https://feature-flags.company.com/api/flags/use-fallback-service \
  -H "Authorization: Bearer $ADMIN_TOKEN" \
  -d '{"enabled": true}'

# Or enable cached responses
kubectl set env deployment/api-service -n production ENABLE_CACHE_FALLBACK=true

D. Configuration Error

# Revert config change
kubectl rollout undo configmap/api-config -n production

# Restart to pick up old config
kubectl rollout restart deployment/api-service -n production

Verification Steps

# 1. Check error rate returned to normal (<0.1%)
curl -s https://api.company.com/metrics | grep error_rate

# 2. Verify all pods healthy
kubectl get pods -n production -l app=api-service | grep -c Running
expected_count=$(kubectl get deployment api-service -n production -o jsonpath='{.spec.replicas}')
echo "Expected: $expected_count"

# 3. Test end-to-end
curl -i -X POST https://api.company.com/v1/test \
  -H "Content-Type: application/json" \
  -d '{"test": "data"}'

# 4. Check dependent services
curl -i https://api.company.com/health/dependencies

# 5. Monitor for 15 minutes
watch -n 30 'date && curl -s https://api.company.com/metrics | grep -E "error_rate|latency_p99"'

Communication Templates

Initial Announcement (Slack/Status Page):

🚨 INCIDENT: API Service Experiencing High Error Rate

Status: Investigating
Impact: ~40% of API requests failing
Affected: api.company.com
Started: [TIMESTAMP]
Team: Investigating root cause
ETA: 15 minutes for initial mitigation

Updates: Will provide update in 10 minutes
War Room: #incident-2024-1127-001

Update:

📊 UPDATE: API Service Incident

Status: Mitigation Applied
Action: Rolled back deployment v2.3.5
Result: Error rate decreased from 40% to 2%
Next: Monitoring for stability, investigating root cause
ETA: Full resolution in 10 minutes

Resolution:

✅ RESOLVED: API Service Incident

Status: Resolved
Duration: 27 minutes (10:15 AM - 10:42 AM ET)
Root Cause: Database connection pool exhaustion from v2.3.5 config change
Resolution: Rolled back to v2.3.4
Impact: ~2,400 failed requests during incident window
Postmortem: Will be published within 48 hours

Thank you for your patience.

Escalation Criteria

Escalate to Engineering Manager if:

MTTR exceeds 30 minutes
Impact >50% of users
Data loss suspected
Security implications identified

Escalate to VP Engineering if:

MTTR exceeds 1 hour
Major customer impact
Media/PR implications
Regulatory reporting required

Contact:

Primary On-Call SRE: [Use PagerDuty]
Engineering Manager: [Slack: @eng-manager] [Phone: XXX-XXX-XXXX]
VP Engineering: [Slack: @vp-eng] [Phone: XXX-XXX-XXXX]
Security Team: security@company.com [Slack: #security-incidents]

Post-Incident Actions

Immediate (Same Day):

Update incident timeline in documentation
Notify all stakeholders of resolution
Begin postmortem document
Capture all logs, metrics, traces for analysis
Take database/system snapshots if relevant

Within 48 Hours:

Complete blameless postmortem
Identify action items with owners
Schedule postmortem review meeting
Update runbook with lessons learned

Within 1 Week:

Implement quick wins from action items
Add monitoring/alerting to prevent recurrence
Share learnings with broader team

Reference Links

Grafana Dashboard: https://grafana.company.com/d/service-health
PagerDuty Escalation: https://company.pagerduty.com/escalation_policies
Incident Response Process: https://wiki.company.com/sre/incident-response
Postmortem Template: https://wiki.company.com/sre/postmortem-template

2. High Latency / Performance Degradation

Metadata

Severity: SEV-2 (High)
MTTR Target: < 1 hour
On-Call Team: SRE, Backend Engineering
Last Updated: 2024-11-27
Owner: SRE Team

Symptoms

P95/P99 latency exceeding SLO
User complaints about slow responses
Timeouts in dependent services
Increased request queue depth
Slow database queries

Detection

Automated Alerts:

Alert: HighLatencyP99
Severity: Warning
Condition: p99_latency > 500ms for 5 minutes
SLO: p99 < 200ms
Dashboard: https://grafana.company.com/latency

Triage Steps

Step 1: Quantify Impact (2 minutes)

# Check current latency
curl -s https://api.company.com/metrics | grep -E "latency_p50|latency_p95|latency_p99"

# Get latency percentiles from Prometheus
curl -s "http://prometheus:9090/api/v1/query?query=histogram_quantile(0.99, rate(http_request_duration_seconds_bucket{service='api'}[5m]))"

# Affected endpoints
kubectl logs deployment/api-service -n production --tail=1000 | \
  awk '{print $7}' | sort | uniq -c | sort -rn | head -10

Document:

Current P99: [XXX ms] (SLO: 200ms)
Current P95: [XXX ms]
Affected Endpoints: [LIST]
User Reports: [NUMBER]

Step 2: Identify Bottleneck (10 minutes)

Check Application Performance:

# CPU throttling
kubectl top pods -n production -l app=api-service

# Check for CPU throttling
kubectl describe pods -n production -l app=api-service | grep -A 5 "cpu"

# Memory pressure
kubectl top pods -n production -l app=api-service --sort-by=memory

# Thread dumps (Java applications)
kubectl exec -it deployment/api-service -n production -- jstack 1 > thread-dump.txt

# Profile CPU (if profiling enabled)
kubectl exec -it deployment/api-service -n production -- \
  curl http://localhost:6060/debug/pprof/profile?seconds=30 > cpu-profile.out

Check Database:

# Active queries
kubectl exec -it postgres-0 -n production -- psql -U postgres -c "
SELECT pid, now() - pg_stat_activity.query_start AS duration, query, state
FROM pg_stat_activity
WHERE (now() - pg_stat_activity.query_start) > interval '5 seconds'
ORDER BY duration DESC;
"

# Slow query log
kubectl exec -it postgres-0 -n production -- tail -100 /var/log/postgresql/slow-query.log

# Database connections
kubectl exec -it postgres-0 -n production -- psql -U postgres -c "
SELECT count(*) as connection_count, state
FROM pg_stat_activity
GROUP BY state;
"

# Lock waits
kubectl exec -it postgres-0 -n production -- psql -U postgres -c "
SELECT blocked_locks.pid AS blocked_pid,
       blocking_locks.pid AS blocking_pid,
       blocked_activity.query AS blocked_query,
       blocking_activity.query AS blocking_query
FROM pg_catalog.pg_locks blocked_locks
JOIN pg_catalog.pg_stat_activity blocked_activity ON blocked_activity.pid = blocked_locks.pid
JOIN pg_catalog.pg_locks blocking_locks ON blocking_locks.locktype = blocked_locks.locktype
JOIN pg_catalog.pg_stat_activity blocking_activity ON blocking_activity.pid = blocking_locks.pid
WHERE NOT blocked_locks.granted;
"

Check Cache:

# Redis hit rate
redis-cli --latency-history

# Cache stats
kubectl exec -it redis-0 -n production -- redis-cli INFO stats | grep -E "hit|miss"

# Memory usage
kubectl exec -it redis-0 -n production -- redis-cli INFO memory | grep used_memory_human

# Slow log
kubectl exec -it redis-0 -n production -- redis-cli SLOWLOG GET 10

Check Network:

# Network latency to dependencies
for service in postgres redis external-api; do
  echo "=== $service ==="
  kubectl run ping-test --image=busybox --rm -it --restart=Never -- \
    ping -c 5 $service.production.svc.cluster.local
done

# DNS lookup times
kubectl run dns-test --image=busybox --rm -it --restart=Never -- \
  nslookup api.company.com

# External API latency
time curl -X GET https://external-api.partner.com/v1/data

Check Distributed Traces:

# Identify slow spans in Jaeger
# Navigate to Jaeger UI: https://jaeger.company.com
# Filter by:
# - Service: api-service
# - Min Duration: 500ms
# - Lookback: 1 hour

# Programmatic trace query
curl "http://jaeger-query:16686/api/traces?service=api-service&limit=20&lookback=1h&minDuration=500ms"

Step 3: Apply Mitigation

Scenario A: Database Slow Queries

# Kill long-running queries (if safe)
kubectl exec -it postgres-0 -n production -- psql -U postgres -c "
SELECT pg_terminate_backend(pid)
FROM pg_stat_activity
WHERE (now() - pg_stat_activity.query_start) > interval '60 seconds'
  AND state = 'active'
  AND pid <> pg_backend_pid();
"

# Add missing index (if identified)
kubectl exec -it postgres-0 -n production -- psql -U postgres -d appdb -c "
CREATE INDEX CONCURRENTLY idx_users_email ON users(email);
"

# Analyze tables (update statistics)
kubectl exec -it postgres-0 -n production -- psql -U postgres -d appdb -c "
ANALYZE VERBOSE users;
"

# Scale read replicas
kubectl scale statefulset postgres-replica -n production --replicas=5

Scenario B: Cache Miss Storm

# Pre-warm cache with common queries
kubectl exec -it deployment/api-service -n production -- \
  curl -X POST http://localhost:8080/admin/cache/warmup

# Increase cache size
kubectl exec -it redis-0 -n production -- redis-cli CONFIG SET maxmemory 4gb

# Enable cache fallback to stale data
kubectl set env deployment/api-service -n production CACHE_SERVE_STALE=true

Scenario C: CPU/Memory Constrained

# Increase resources
kubectl set resources deployment api-service -n production \
  --limits=cpu=2000m,memory=4Gi \
  --requests=cpu=1000m,memory=2Gi

# Scale horizontally
kubectl scale deployment api-service -n production --replicas=20

# Enable HPA
kubectl autoscale deployment api-service -n production \
  --min=10 --max=30 --cpu-percent=60

Scenario D: External API Slow

# Increase timeout and enable caching
kubectl set env deployment/api-service -n production \
  EXTERNAL_API_TIMEOUT=10000 \
  EXTERNAL_API_CACHE_ENABLED=true \
  EXTERNAL_API_CACHE_TTL=300

# Enable circuit breaker
kubectl set env deployment/api-service -n production \
  CIRCUIT_BREAKER_ENABLED=true \
  CIRCUIT_BREAKER_THRESHOLD=50

# Use fallback/cached data
kubectl patch configmap api-config -n production \
  --type merge -p '{"data":{"USE_FALLBACK_DATA":"true"}}'

Scenario E: Thread Pool Exhaustion

# Increase thread pool size
kubectl set env deployment/api-service -n production \
  THREAD_POOL_SIZE=200 \
  THREAD_QUEUE_SIZE=1000

# Restart to apply
kubectl rollout restart deployment/api-service -n production

Verification Steps

# 1. Monitor latency improvement
watch -n 10 'curl -s https://api.company.com/metrics | grep latency_p99'

# 2. Check trace samples
# View Jaeger for recent requests - should show improved latency

# 3. Database query times
kubectl exec -it postgres-0 -n production -- psql -U postgres -c "
SELECT query, mean_exec_time, calls
FROM pg_stat_statements
ORDER BY mean_exec_time DESC
LIMIT 10;
"

# 4. Resource utilization normalized
kubectl top pods -n production -l app=api-service

# 5. Error rate stable (ensure fix didn't introduce errors)
curl -s https://api.company.com/metrics | grep error_rate

Root Cause Investigation

Common Causes:

N+1 Query Problem
- Check ORM query patterns
- Enable query logging
- Add eager loading
Missing Database Index
- Analyze slow query log
- Use EXPLAIN ANALYZE
- Create appropriate indexes
Memory Garbage Collection
- Check GC logs (Java/JVM)
- Tune GC parameters
- Increase heap size
Inefficient Algorithm
- Profile code execution
- Identify hot paths
- Optimize algorithms
External Service Degradation
- Check dependency SLOs
- Implement caching
- Add circuit breakers

Communication Templates

Initial Alert:

⚠️  INCIDENT: API Latency Degradation

Status: Investigating
Impact: P99 latency at 800ms (SLO: 200ms)
Affected: All API endpoints
User Impact: Slow response times
Team: Investigating root cause
Updates: Every 15 minutes in #incident-channel

Resolution:

✅ RESOLVED: API Latency Degradation

Duration: 45 minutes
Root Cause: Missing database index on users.email causing table scans
Resolution: Added index, latency returned to normal
Current P99: 180ms (within SLO)
Postmortem: Will be published within 48 hours

Post-Incident Actions

Add database query monitoring
Implement automated index recommendations
Load test with realistic data volumes
Add latency SLO alerts per endpoint
Review and optimize slow queries
Implement APM (Application Performance Monitoring)

3. Database Connection Pool Exhaustion

Metadata

Severity: SEV-1 (Critical)
MTTR Target: < 15 minutes
On-Call Team: SRE, Database Team
Last Updated: 2024-11-27

Symptoms

"Connection pool exhausted" errors in application logs
Requests timing out
Database showing many idle connections
Application unable to acquire new connections
Connection pool at 100% utilization

Detection

# Check connection pool metrics
curl -s https://api.company.com/metrics | grep db_pool

# Expected output:
# db_pool_active 50
# db_pool_idle 0
# db_pool_total 50
# db_pool_wait_count 1500  <-- High wait count indicates problem

Triage Steps

Step 1: Confirm Pool Exhaustion (1 minute)

# Application side
kubectl logs deployment/api-service -n production --tail=100 | \
  grep -i "connection pool\|unable to acquire\|timeout"

# Database side - check connection count
kubectl exec -it postgres-0 -n production -- psql -U postgres -c "
SELECT count(*) as total_connections,
       state,
       application_name
FROM pg_stat_activity
WHERE datname = 'appdb'
GROUP BY state, application_name
ORDER BY total_connections DESC;
"

# Check pool configuration
kubectl get configmap api-config -n production -o yaml | grep -i pool

Step 2: Immediate Mitigation (5 minutes)

Option A: Restart Application Pods (Fastest)

# Rolling restart to reset connections
kubectl rollout restart deployment/api-service -n production

# Monitor restart
kubectl rollout status deployment/api-service -n production

# Verify connections released
kubectl exec -it postgres-0 -n production -- psql -U postgres -c "
SELECT count(*) FROM pg_stat_activity WHERE datname = 'appdb';
"

Option B: Increase Pool Size (If Infrastructure Allows)

# Check database connection limit
kubectl exec -it postgres-0 -n production -- psql -U postgres -c "
SHOW max_connections;
"

# Calculate safe pool size
# max_connections / number_of_app_instances = pool_size_per_instance
# Example: 200 max / 10 instances = 20 per instance (current might be 50)

# Increase pool size temporarily
kubectl set env deployment/api-service -n production \
  DB_POOL_SIZE=30 \
  DB_POOL_MAX_IDLE=10 \
  DB_CONNECTION_TIMEOUT=30000

# Monitor
watch -n 5 'kubectl logs deployment/api-service -n production --tail=50 | grep -i pool'

Option C: Kill Idle Connections (If Many Idle)

# Identify idle connections
kubectl exec -it postgres-0 -n production -- psql -U postgres -c "
SELECT pid, state, query_start, state_change, query
FROM pg_stat_activity
WHERE datname = 'appdb'
  AND state = 'idle'
  AND state_change < now() - interval '5 minutes';
"

# Kill long-idle connections
kubectl exec -it postgres-0 -n production -- psql -U postgres -c "
SELECT pg_terminate_backend(pid)
FROM pg_stat_activity
WHERE datname = 'appdb'
  AND state = 'idle'
  AND state_change < now() - interval '10 minutes'
  AND pid <> pg_backend_pid();
"

Step 3: Root Cause Analysis (10 minutes)

Check for Connection Leaks:

# Application logs - look for unclosed connections
kubectl logs deployment/api-service -n production --tail=5000 | \
  grep -i "connection not closed\|resource leak"

# Check connection lifecycle
kubectl exec -it postgres-0 -n production -- psql -U postgres -c "
SELECT application_name,
       state,
       count(*) as conn_count,
       max(now() - state_change) as max_idle_time
FROM pg_stat_activity
WHERE datname = 'appdb'
GROUP BY application_name, state
ORDER BY conn_count DESC;
"

# Long-running transactions
kubectl exec -it postgres-0 -n production -- psql -U postgres -c "
SELECT pid, now() - xact_start AS duration, state, query
FROM pg_stat_activity
WHERE datname = 'appdb'
  AND xact_start IS NOT NULL
ORDER BY duration DESC
LIMIT 20;
"

Check Recent Changes:

# Recent deployments
kubectl rollout history deployment/api-service -n production | tail -5

# Config changes
kubectl get configmap api-config -n production -o yaml | \
  grep -A 2 "last-applied-configuration"

# Recent code changes affecting database access
git log --since="24 hours ago" --grep="database\|pool\|connection" --oneline

Check for Traffic Spike:

# Request rate
curl "http://prometheus:9090/api/v1/query?query=rate(http_requests_total{service='api-service'}[5m])"

# Compare to baseline
curl "http://prometheus:9090/api/v1/query?query=rate(http_requests_total{service='api-service'}[5m] offset 1h)"

Resolution Actions

Permanent Fix Options:

A. Fix Connection Leak in Code

# Bad - connection leak
def get_user(user_id):
    conn = db_pool.getconn()
    cursor = conn.cursor()
    cursor.execute("SELECT * FROM users WHERE id = %s", (user_id,))
    result = cursor.fetchone()
    return result  # Connection never returned!

# Good - using context manager
def get_user(user_id):
    with db_pool.getconn() as conn:
        with conn.cursor() as cursor:
            cursor.execute("SELECT * FROM users WHERE id = %s", (user_id,))
            return cursor.fetchone()
    # Connection automatically returned to pool

B. Optimize Pool Configuration

# Configure based on actual usage patterns
kubectl set env deployment/api-service -n production \
  DB_POOL_SIZE=20 \
  DB_POOL_MIN_IDLE=5 \
  DB_POOL_MAX_IDLE=10 \
  DB_POOL_IDLE_TIMEOUT=300000 \
  DB_POOL_CONNECTION_TIMEOUT=30000 \
  DB_POOL_VALIDATION_TIMEOUT=5000

# Enable connection validation
kubectl set env deployment/api-service -n production \
  DB_POOL_TEST_ON_BORROW=true \
  DB_POOL_TEST_WHILE_IDLE=true

C. Implement Connection Pooler (PgBouncer)

# Deploy PgBouncer
kubectl apply -f - <<EOF
apiVersion: apps/v1
kind: Deployment
metadata:
  name: pgbouncer
  namespace: production
spec:
  replicas: 3
  selector:
    matchLabels:
      app: pgbouncer
  template:
    metadata:
      labels:
        app: pgbouncer
    spec:
      containers:
      - name: pgbouncer
        image: pgbouncer/pgbouncer:latest
        env:
        - name: DATABASES_HOST
          value: postgres.production.svc.cluster.local
        - name: POOL_MODE
          value: transaction
        - name: MAX_CLIENT_CONN
          value: "1000"
        - name: DEFAULT_POOL_SIZE
          value: "25"
---
apiVersion: v1
kind: Service
metadata:
  name: pgbouncer
  namespace: production
spec:
  selector:
    app: pgbouncer
  ports:
  - port: 5432
    targetPort: 5432
EOF

# Update application to use PgBouncer
kubectl set env deployment/api-service -n production \
  DB_HOST=pgbouncer.production.svc.cluster.local

D. Scale Database Connections

# Increase PostgreSQL max_connections
kubectl exec -it postgres-0 -n production -- psql -U postgres -c "
ALTER SYSTEM SET max_connections = 300;
SELECT pg_reload_conf();
"

# Note: May require database restart for some changes
kubectl exec -it postgres-0 -n production -- psql -U postgres -c "
SELECT pg_reload_conf();
"

Monitoring & Alerting

Add Proactive Monitoring:

# Prometheus alert rule
groups:
  - name: database_pool
    rules:
      - alert: ConnectionPoolHighUtilization
        expr: db_pool_active / db_pool_total > 0.7
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Connection pool utilization >70%"
          description: "Pool at {{ $value }}% capacity"

      - alert: ConnectionPoolExhausted
        expr: db_pool_active / db_pool_total > 0.9
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "Connection pool nearly exhausted"

      - alert: ConnectionPoolWaitTime
        expr: rate(db_pool_wait_count[5m]) > 10
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High connection pool wait count"

Dashboard Metrics:

- db_pool_total (configured pool size)
- db_pool_active (connections in use)
- db_pool_idle (connections available)
- db_pool_wait_count (requests waiting for connection)
- db_pool_wait_time_ms (time waiting for connection)
- db_connection_lifetime_seconds (connection age histogram)

Verification Steps

# 1. Pool utilization back to normal
kubectl exec -it deployment/api-service -n production -- \
  curl http://localhost:8080/metrics | grep db_pool_active

# Should see: db_pool_active < 70% of db_pool_total

# 2. No wait queue
kubectl exec -it deployment/api-service -n production -- \
  curl http://localhost:8080/metrics | grep db_pool_wait_count

# Should see: db_pool_wait_count = 0 or minimal

# 3. Database connection count stable
kubectl exec -it postgres-0 -n production -- psql -U postgres -c "
SELECT count(*) FROM pg_stat_activity WHERE datname = 'appdb';
"

# 4. No errors in logs
kubectl logs deployment/api-service -n production --tail=100 | \
  grep -i "connection" | grep -i error

# 5. Response times normal
curl -s https://api.company.com/metrics | grep latency_p99

Prevention

Code Review Checklist:

All database connections properly closed
Using connection pool best practices
Proper error handling to ensure connection release
No connection usage outside transactions
Connection timeout configured

Testing:

Load test with connection pool monitoring
Chaos engineering: Test with limited connections
Connection leak detection in CI/CD

Architecture:

Consider connection pooler (PgBouncer)
Implement read replicas to distribute load
Use caching to reduce database queries

4. Memory Leak / OOM Kills

Metadata

Severity: SEV-2 (High)
MTTR Target: < 1 hour
On-Call Team: SRE, Backend Engineering
Last Updated: 2024-11-27

Symptoms

Pods being OOMKilled (Out of Memory)
Memory usage continuously increasing
Slow performance and increased GC pressure
Pod restarts without clear error
"Cannot allocate memory" errors

Detection

# Check for OOMKilled pods
kubectl get pods -n production -l app=api-service | grep OOMKilled

# Check pod events
kubectl get events -n production --field-selector involvedObject.name=api-service-xxx | \
  grep -i oom

# Memory usage trend
kubectl top pods -n production -l app=api-service --sort-by=memory

Triage Steps

Step 1: Confirm OOM Issue (2 minutes)

# Check pod status and restart reason
kubectl describe pod <pod-name> -n production | grep -A 10 "Last State"

# Should see output like:
# Last State: Terminated
#   Reason: OOMKilled
#   Exit Code: 137

# Check memory limits
kubectl get pod <pod-name> -n production -o jsonpath='{.spec.containers[*].resources}'

# Monitor memory usage
watch -n 5 'kubectl top pods -n production -l app=api-service'

Step 2: Immediate Mitigation (10 minutes)

Option A: Increase Memory Limits (Quick Fix)

# Current limits
kubectl get deployment api-service -n production -o jsonpath='{.spec.template.spec.containers[0].resources}'

# Increase memory limit temporarily (2x current)
kubectl set resources deployment api-service -n production \
  --limits=memory=4Gi \
  --requests=memory=2Gi

# Monitor rollout
kubectl rollout status deployment/api-service -n production

# Watch memory usage
watch -n 10 'kubectl top pods -n production -l app=api-service'

Option B: Scale Out (If Memory Leak is Gradual)

# Add more pods to distribute load
kubectl scale deployment api-service -n production --replicas=15

# Enable HPA with lower memory target
kubectl autoscale deployment api-service -n production \
  --min=10 --max=30 --cpu-percent=70

# Note: This is temporary - still need to fix leak

Option C: Implement Pod Lifecycle (Workaround)

# Restart pods proactively before they OOM
# Add to deployment spec:
kubectl patch deployment api-service -n production --type=json -p='[
  {
    "op": "add",
    "path": "/spec/template/spec/containers/0/lifecycle",
    "value": {
      "preStop": {
        "exec": {
          "command": ["/bin/sh", "-c", "sleep 15"]
        }
      }
    }
  }
]'

# Add TTL to pod lifecycle (requires external controller)
# Or implement rolling restart every 12 hours

Step 3: Capture Diagnostics (15 minutes)

Capture Heap Dump (Java/JVM):

# Before pod is killed, capture heap dump
kubectl exec -it <pod-name> -n production -- \
  jcmd 1 GC.heap_dump /tmp/heapdump.hprof

# Copy heap dump locally
kubectl cp production/<pod-name>:/tmp/heapdump.hprof ./heapdump-$(date +%Y%m%d-%H%M%S).hprof

# Analyze with MAT or jhat
# Upload to analysis tools or analyze locally

Capture Memory Profile (Go Applications):

# If profiling endpoint enabled
kubectl exec -it <pod-name> -n production -- \
  curl http://localhost:6060/debug/pprof/heap > heap-$(date +%Y%m%d-%H%M%S).prof

# Copy locally
kubectl cp production/<pod-name>:/tmp/heap-profile.prof ./heap-profile.prof

# Analyze
go tool pprof -http=:8080 heap-profile.prof

Capture Memory Metrics (Python):

# Install memory_profiler if not already
kubectl exec -it <pod-name> -n production -- pip install memory_profiler

# Profile specific function
kubectl exec -it <pod-name> -n production -- \
  python -m memory_profiler app.py

# Or use tracemalloc for running process
kubectl exec -it <pod-name> -n production -- python3 <<'EOF'
import tracemalloc
tracemalloc.start()
# Take snapshot
snapshot = tracemalloc.take_snapshot()
top_stats = snapshot.statistics('lineno')
for stat in top_stats[:10]:
    print(stat)
EOF

Check for Common Memory Issues:

# Large objects in memory
kubectl exec -it <pod-name> -n production -- \
  curl http://localhost:8080/debug/vars | jq '.memstats'

# Check goroutine leaks (Go)
kubectl exec -it <pod-name> -n production -- \
  curl http://localhost:6060/debug/pprof/goroutine?debug=1

# Check thread count (Java)
kubectl exec -it <pod-name> -n production -- \
  jcmd 1 Thread.print | grep "Thread" | wc -l

# File descriptor leaks
kubectl exec -it <pod-name> -n production -- \
  ls -la /proc/1/fd | wc -l

Step 4: Identify Root Cause

Common Memory Leak Causes:

A. Caching Without Eviction:

# Bad - unbounded cache
cache = {}

def get_user(user_id):
    if user_id not in cache:
        cache[user_id] = fetch_from_db(user_id)  # Cache grows forever!
    return cache[user_id]

# Good - bounded cache with LRU
from functools import lru_cache

@lru_cache(maxsize=1000)
def get_user(user_id):
    return fetch_from_db(user_id)

B. Event Listeners Not Removed:

// Bad - event listener leak
class Component {
  constructor() {
    this.data = new Array(1000000);
    window.addEventListener("resize", () => this.handleResize());
  }
  // Missing cleanup!
}

// Good - cleanup listeners
class Component {
  constructor() {
    this.data = new Array(1000000);
    this.handleResize = this.handleResize.bind(this);
    window.addEventListener("resize", this.handleResize);
  }

  destroy() {
    window.removeEventListener("resize", this.handleResize);
    this.data = null;
  }
}

C. Goroutine Leaks (Go):

// Bad - goroutine leak
func processRequests() {
    for request := range requests {
        go handleRequest(request) // Goroutines never cleaned up
    }
}

// Good - bounded goroutines
func processRequests() {
    sem := make(chan struct{}, 100) // Max 100 concurrent
    for request := range requests {
        sem <- struct{}{}
        go func(req Request) {
            defer func() { <-sem }()
            handleRequest(req)
        }(request)
    }
}

D. Database Result Sets Not Closed:

// Bad - result set leak
public List<User> getUsers() {
    Statement stmt = conn.createStatement();
    ResultSet rs = stmt.executeQuery("SELECT * FROM users");
    List<User> users = new ArrayList<>();
    while (rs.next()) {
        users.add(new User(rs));
    }
    return users; // ResultSet and Statement never closed!
}

// Good - use try-with-resources
public List<User> getUsers() {
    List<User> users = new ArrayList<>();
    try (Statement stmt = conn.createStatement();
         ResultSet rs = stmt.executeQuery("SELECT * FROM users")) {
        while (rs.next()) {
            users.add(new User(rs));
        }
    }
    return users;
}

Resolution Actions

Short-term:

# 1. Increase memory limits (already done)

# 2. Enable memory monitoring
kubectl patch deployment api-service -n production --type=json -p='[
  {
    "op": "add",
    "path": "/spec/template/metadata/annotations/prometheus.io~1scrape",
    "value": "true"
  }
]'

# 3. Add liveness probe with memory check
kubectl patch deployment api-service -n production --type=json -p='[
  {
    "op": "add",
    "path": "/spec/template/spec/containers/0/livenessProbe",
    "value": {
      "httpGet": {
        "path": "/health",
        "port": 8080
      },
      "initialDelaySeconds": 30,
      "periodSeconds": 10
    }
  }
]'

# 4. Implement automatic restart before OOM
# Create CronJob to restart pods every 12 hours (temporary)
kubectl create -f - <<EOF
apiVersion: batch/v1
kind: CronJob
metadata:
  name: api-service-restart
  namespace: production
spec:
  schedule: "0 */12 * * *"
  jobTemplate:
    spec:
      template:
        spec:
          serviceAccountName: pod-restarter
          containers:
          - name: kubectl
            image: bitnami/kubectl:latest
            command:
            - /bin/sh
            - -c
            - kubectl rollout restart deployment/api-service -n production
          restartPolicy: OnFailure
EOF

Long-term Fix:

# 1. Fix code leak (deploy patch)
git checkout -b fix/memory-leak
# ... implement fix ...
git commit -m "Fix: Remove unbounded cache causing memory leak"
git push origin fix/memory-leak

# 2. Add memory profiling in production
kubectl set env deployment/api-service -n production \
  ENABLE_PROFILING=true \
  PROFILING_PORT=6060

# 3. Implement memory limits in code
# For Java:
kubectl set env deployment/api-service -n production \
  JAVA_OPTS="-Xmx2g -XX:MaxMetaspaceSize=256m -XX:+UseG1GC"

# 4. Add memory monitoring dashboard

# 5. Implement alerts for memory growth

Monitoring & Prevention

Add Alerts:

groups:
  - name: memory_alerts
    rules:
      - alert: MemoryUsageHigh
        expr: container_memory_usage_bytes{pod=~"api-service.*"} / container_spec_memory_limit_bytes{pod=~"api-service.*"} > 0.8
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "Pod memory usage >80%"

      - alert: MemoryUsageGrowing
        expr: predict_linear(container_memory_usage_bytes{pod=~"api-service.*"}[1h], 3600) > container_spec_memory_limit_bytes{pod=~"api-service.*"}
        for: 15m
        labels:
          severity: warning
        annotations:
          summary: "Memory usage trending towards OOM"

      - alert: OOMKillsDetected
        expr: increase(kube_pod_container_status_restarts_total{pod=~"api-service.*"}[15m]) > 3
        labels:
          severity: critical
        annotations:
          summary: "Multiple pod restarts detected (possible OOM)"

Grafana Dashboard:

- Memory Usage (%)
- Memory Usage (bytes) over time
- Predicted time to OOM
- GC frequency and duration
- Heap size vs used heap
- Number of objects in memory
- Pod restart count

Verification Steps

# 1. Memory usage stable
watch -n 30 'kubectl top pods -n production -l app=api-service | tail -5'

# 2. No OOM kills in last hour
kubectl get events -n production --field-selector reason=OOMKilling | grep "last hour"

# 3. Pod uptime increasing (not restarting)
kubectl get pods -n production -l app=api-service -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.status.startTime}{"\n"}{end}'

# 4. Memory growth linear/flat (not exponential)
# Check Grafana memory usage graph

# 5. Application metrics healthy
curl -s https://api.company.com/metrics | grep -E "heap|gc|memory"

Post-Incident Actions

Analyze heap dump to identify leak source
Review code for common leak patterns
Add memory profiling to CI/CD
Implement memory budgets in code
Add integration tests for memory leaks
Document memory configuration guidelines
Train team on memory leak prevention

5. Disk Space Exhaustion

Metadata

Severity: SEV-2 (High), escalates to SEV-1 if database affected
MTTR Target: < 30 minutes
On-Call Team: SRE, Infrastructure
Last Updated: 2024-11-27

Symptoms

"No space left on device" errors
Applications unable to write logs
Database unable to write data
Pod evictions due to disk pressure
Slow I/O performance

Detection

# Check disk usage on nodes
kubectl get nodes -o custom-columns=NAME:.metadata.name,DISK:.status.allocatable.ephemeral-storage

# Check specific node
kubectl describe node <node-name> | grep -A 5 "Allocated resources"

# SSH to node and check
ssh <node> df -h

# Check for disk pressure
kubectl get nodes -o json | jq '.items[] | select(.status.conditions[] | select(.type=="DiskPressure" and .status=="True")) | .metadata.name'

Triage Steps

Step 1: Identify Affected Systems (2 minutes)

# Which nodes are affected?
for node in $(kubectl get nodes -o name); do
  echo "=== $node ==="
  kubectl describe $node | grep -E "DiskPressure|ephemeral-storage"
done

# Which pods are on affected nodes?
kubectl get pods -n production -o wide | grep <affected-node>

# Critical services affected?
kubectl get pods -n production -l tier=critical -o wide

Step 2: Immediate Mitigation (10 minutes)

Option A: Clean Up Logs

# SSH to affected node
ssh <node-name>

# Find large log files
sudo find /var/log -type f -size +100M -exec ls -lh {} \; | awk '{ print $9 ": " $5 }'

# Rotate/truncate large logs
sudo truncate -s 0 /var/log/containers/*.log
sudo journalctl --vacuum-size=500M

# Clean Docker logs (if applicable)
sudo sh -c "truncate -s 0 /var/lib/docker/containers/*/*-json.log"

# Kubernetes log cleanup
sudo find /var/log/pods -name "*.log" -mtime +7 -delete

Option B: Remove Unused Docker Images

# On affected node
ssh <node-name>

# List images sorted by size
sudo docker images --format "{{.Repository}}:{{.Tag}}\t{{.Size}}" | sort -k2 -h

# Remove unused images
sudo docker image prune -a --filter "until=72h" -f

# Remove dangling volumes
sudo docker volume prune -f

Option C: Clean Up Pod Ephemeral Storage

# Find pods using most disk
kubectl get pods -n production -o json | \
  jq -r '.items[] | "\(.metadata.name) \(.spec.nodeName)"' | \
  while read pod node; do
    if [ "$node" = "<affected-node>" ]; then
      echo "=== $pod ==="
      kubectl exec $pod -n production -- du -sh /tmp /var/tmp 2>/dev/null || true
    fi
  done

# Clean up specific pod
kubectl exec <pod-name> -n production -- sh -c "rm -rf /tmp/*"

Option D: Cordon and Drain Node

# Prevent new pods from scheduling
kubectl cordon <node-name>

# Drain pods to other nodes
kubectl drain <node-name> --ignore-daemonsets --delete-emptydir-data --grace-period=60

# Clean up on node
ssh <node-name>
sudo docker system prune -a -f --volumes
sudo rm -rf /var/log/pods/*
sudo rm -rf /var/lib/kubelet/pods/*

# Uncordon when ready
kubectl uncordon <node-name>

Option E: Emergency Database Cleanup (If DB Affected)

# Connect to database pod
kubectl exec -it postgres-0 -n production -- psql -U postgres

# Check database sizes
SELECT pg_database.datname,
       pg_size_pretty(pg_database_size(pg_database.datname)) AS size
FROM pg_database
ORDER BY pg_database_size(pg_database.datname) DESC;

# Check table sizes
SELECT schemaname,
       tablename,
       pg_size_pretty(pg_total_relation_size(schemaname||'.'||tablename)) AS size
FROM pg_tables
ORDER BY pg_total_relation_size(schemaname||'.'||tablename) DESC
LIMIT 20;

# Archive old data (if safe)
# Example: Archive logs older than 90 days
BEGIN;
COPY (SELECT * FROM audit_logs WHERE created_at < NOW() - INTERVAL '90 days')
TO PROGRAM 'gzip > /tmp/audit_logs_archive_$(date +%Y%m%d).csv.gz' WITH CSV HEADER;
DELETE FROM audit_logs WHERE created_at < NOW() - INTERVAL '90 days';
COMMIT;

# Vacuum to reclaim space
VACUUM FULL audit_logs;

Step 3: Root Cause Analysis

Find What's Consuming Space:

# On affected node
ssh <node-name>

# Find largest directories
sudo du -h --max-depth=3 / 2>/dev/null | sort -hr | head -20

# Find largest files
sudo find / -type f -size +500M -exec ls -lh {} \; 2>/dev/null | awk '{ print $9 ": " $5 }'

# Check specific directories
sudo du -sh /var/lib/docker/*
sudo du -sh /var/lib/kubelet/*
sudo du -sh /var/log/*

# Find recently created large files
sudo find / -type f -size +100M -mtime -1 -exec ls -lh {} \; 2>/dev/null

Common Culprits:

Excessive Logging

# Check log volume
kubectl exec <pod-name> -n production -- du -sh /var/log

# Check logging rate
kubectl logs <pod-name> -n production --tail=100 --timestamps | \
  awk '{print $1}' | sort | uniq -c

Temp File Accumulation

# Check temp directories
kubectl exec <pod-name> -n production -- du -sh /tmp /var/tmp

# Find old temp files
kubectl exec <pod-name> -n production -- find /tmp -type f -mtime +7 -ls

Database Growth

# PostgreSQL WAL files
kubectl exec postgres-0 -n production -- \
  du -sh /var/lib/postgresql/data/pg_wal/

# MySQL binary logs
kubectl exec mysql-0 -n production -- \
  du -sh /var/lib/mysql/binlog/

Image/Container Buildup

# Unused containers
sudo docker ps -a --filter "status=exited" --filter "status=dead"

# Image layer cache
sudo du -sh /var/lib/docker/overlay2/

Resolution Actions

Short-term:

# 1. Implement log rotation
kubectl apply -f - <<EOF
apiVersion: v1
kind: ConfigMap
metadata:
  name: filebeat-config
  namespace: production
data:
  filebeat.yml: |
    filebeat.inputs:
    - type: container
      paths:
        - /var/log/containers/*.log
      processors:
        - add_kubernetes_metadata:
            host: \${NODE_NAME}
            matchers:
            - logs_path:
                logs_path: "/var/log/containers/"
    output.logstash:
      hosts: ["logstash:5044"]
    # Local log cleanup
    queue.mem:
      events: 4096
      flush.min_events: 512
      flush.timeout: 5s
EOF

# 2. Set up log shipping
kubectl apply -f https://raw.githubusercontent.com/elastic/beats/master/deploy/kubernetes/filebeat-kubernetes.yaml

# 3. Configure log rotation on nodes
# Add to node configuration or DaemonSet
cat <<'EOF' | sudo tee /etc/logrotate.d/containers
/var/log/containers/*.log {
    daily
    rotate 7
    compress
    missingok
    notifempty
    create 0644 root root
    postrotate
        /usr/bin/docker ps -a --format '{{.Names}}' | xargs -I {} docker kill -s HUP {} 2>/dev/null || true
    endscript
}
EOF

Long-term:

# 1. Set ephemeral-storage limits on pods
kubectl patch deployment api-service -n production --type=json -p='[
  {
    "op": "add",
    "path": "/spec/template/spec/containers/0/resources/limits/ephemeral-storage",
    "value": "2Gi"
  },
  {
    "op": "add",
    "path": "/spec/template/spec/containers/0/resources/requests/ephemeral-storage",
    "value": "1Gi"
  }
]'

# 2. Enable disk usage monitoring
kubectl apply -f - <<EOF
apiVersion: v1
kind: ConfigMap
metadata:
  name: node-exporter-config
  namespace: monitoring
data:
  entrypoint.sh: |
    #!/bin/sh
    exec /bin/node_exporter \
      --collector.filesystem \
      --collector.filesystem.mount-points-exclude="^/(dev|proc|sys|var/lib/docker/.+)(\$|/)" \
      --web.listen-address=:9100
EOF

# 3. Set up automated cleanup CronJob
kubectl apply -f - <<EOF
apiVersion: batch/v1
kind: CronJob
metadata:
  name: disk-cleanup
  namespace: kube-system
spec:
  schedule: "0 2 * * *"  # 2 AM daily
  jobTemplate:
    spec:
      template:
        spec:
          hostPID: true
          hostNetwork: true
          containers:
          - name: cleanup
            image: alpine:latest
            command:
            - /bin/sh
            - -c
            - |
              # Clean old logs
              find /host/var/log/pods -name "*.log" -mtime +7 -delete
              # Clean old containers
              nsenter --mount=/proc/1/ns/mnt -- docker system prune -af --filter "until=72h"
            securityContext:
              privileged: true
            volumeMounts:
            - name: host
              mountPath: /host
          volumes:
          - name: host
            hostPath:
              path: /
          restartPolicy: OnFailure
EOF

# 4. Implement disk alerts
kubectl apply -f - <<EOF
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: disk-space-alerts
  namespace: monitoring
spec:
  groups:
  - name: disk
    rules:
    - alert: NodeDiskSpaceHigh
      expr: (node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}) < 0.2
      for: 5m
      labels:
        severity: warning
      annotations:
        summary: "Node disk space <20%"

    - alert: NodeDiskSpaceCritical
      expr: (node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}) < 0.1
      for: 2m
      labels:
        severity: critical
      annotations:
        summary: "Node disk space <10%"
EOF

Monitoring & Prevention

Metrics to Track:

- node_filesystem_avail_bytes (available disk space)
- node_filesystem_size_bytes (total disk space)
- container_fs_usage_bytes (container filesystem usage)
- kubelet_volume_stats_used_bytes (PV usage)
- log_file_size (application log sizes)

Dashboards:

- Node disk usage per mount point
- Pod ephemeral storage usage
- PV usage trends
- Log growth rate
- Image/container count over time

Verification Steps

# 1. Disk space recovered
ssh <node-name> df -h /

# 2. No disk pressure
kubectl describe node <node-name> | grep DiskPressure

# 3. Pods stable
kubectl get pods -n production -o wide | grep <node-name>

# 4. Services healthy
curl -i https://api.company.com/health

# 5. No pod evictions
kubectl get events --field-selector reason=Evicted -n production

Post-Incident Actions

Analyze what caused disk fill
Implement proper log management strategy
Set ephemeral-storage limits on all pods
Configure automated cleanup
Add capacity planning for storage
Review and optimize logging verbosity
Document disk space requirements

6. Certificate Expiration

Metadata

Severity: SEV-1 (Critical) if expired, SEV-3 (Low) if approaching
MTTR Target: < 15 minutes for renewal, 0 minutes for prevention
On-Call Team: SRE, Security
Last Updated: 2024-11-27

Symptoms

"SSL certificate has expired" errors
Browsers showing security warnings
API clients unable to connect
Services failing TLS handshake
Certificate validation errors in logs

Detection

# Check certificate expiration
echo | openssl s_client -servername api.company.com -connect api.company.com:443 2>/dev/null | \
  openssl x509 -noout -dates

# Check all Kubernetes TLS secrets
kubectl get secrets -A -o json | \
  jq -r '.items[] | select(.type=="kubernetes.io/tls") | "\(.metadata.namespace)/\(.metadata.name)"' | \
  while read secret; do
    namespace=$(echo $secret | cut -d/ -f1)
    name=$(echo $secret | cut -d/ -f2)
    expiry=$(kubectl get secret $name -n $namespace -o jsonpath='{.data.tls\.crt}' | \
             base64 -d | openssl x509 -noout -enddate 2>/dev/null | cut -d= -f2)
    echo "$secret: $expiry"
  done

# Check certificate in

Table of Contents​

1. Service Outage / High Error Rate​

Metadata​

Symptoms​

Detection​

Triage Steps​

Step 1: Establish Incident Context (2 minutes)​

Step 2: Immediate Mitigation (5 minutes)​

Step 3: Root Cause Investigation (15 minutes)​

Step 4: Resolution Actions​

Verification Steps​

Communication Templates​

Escalation Criteria​

Post-Incident Actions​

Related Runbooks​

Reference Links​

2. High Latency / Performance Degradation​

Metadata​

Symptoms​

Detection​

Triage Steps​

Step 1: Quantify Impact (2 minutes)​

Step 2: Identify Bottleneck (10 minutes)​

Step 3: Apply Mitigation​

Verification Steps​

Root Cause Investigation​

Communication Templates​

Post-Incident Actions​

Related Runbooks​

3. Database Connection Pool Exhaustion​

Metadata​

Symptoms​

Detection​

Triage Steps​

Step 1: Confirm Pool Exhaustion (1 minute)​

Step 2: Immediate Mitigation (5 minutes)​

Step 3: Root Cause Analysis (10 minutes)​

Resolution Actions​

Monitoring & Alerting​

Verification Steps​

Prevention​

Related Runbooks​

4. Memory Leak / OOM Kills​

Metadata​

Symptoms​

Detection​

Triage Steps​

Step 1: Confirm OOM Issue (2 minutes)​

Step 2: Immediate Mitigation (10 minutes)​

Step 3: Capture Diagnostics (15 minutes)​

Step 4: Identify Root Cause​

Resolution Actions​

Monitoring & Prevention​

Verification Steps​

Post-Incident Actions​

Related Runbooks​

5. Disk Space Exhaustion​

Metadata​

Symptoms​

Detection​

Triage Steps​

Step 1: Identify Affected Systems (2 minutes)​

Step 2: Immediate Mitigation (10 minutes)​

Step 3: Root Cause Analysis​

Resolution Actions​

Monitoring & Prevention​

Verification Steps​

Post-Incident Actions​

Related Runbooks​

6. Certificate Expiration​

Metadata​

Symptoms​

Detection​

Table of Contents

1. Service Outage / High Error Rate

Metadata

Symptoms

Detection

Triage Steps

Step 1: Establish Incident Context (2 minutes)

Step 2: Immediate Mitigation (5 minutes)

Step 3: Root Cause Investigation (15 minutes)

Step 4: Resolution Actions

Verification Steps

Communication Templates

Escalation Criteria

Post-Incident Actions

Related Runbooks

Reference Links

2. High Latency / Performance Degradation

Metadata

Symptoms

Detection

Triage Steps

Step 1: Quantify Impact (2 minutes)

Step 2: Identify Bottleneck (10 minutes)

Step 3: Apply Mitigation

Verification Steps

Root Cause Investigation

Communication Templates

Post-Incident Actions

Related Runbooks

3. Database Connection Pool Exhaustion

Metadata

Symptoms

Detection

Triage Steps

Step 1: Confirm Pool Exhaustion (1 minute)

Step 2: Immediate Mitigation (5 minutes)

Step 3: Root Cause Analysis (10 minutes)

Resolution Actions

Monitoring & Alerting

Verification Steps

Prevention

Related Runbooks

4. Memory Leak / OOM Kills

Metadata

Symptoms

Detection

Triage Steps

Step 1: Confirm OOM Issue (2 minutes)

Step 2: Immediate Mitigation (10 minutes)

Step 3: Capture Diagnostics (15 minutes)

Step 4: Identify Root Cause

Resolution Actions

Monitoring & Prevention

Verification Steps

Post-Incident Actions

Related Runbooks

5. Disk Space Exhaustion

Metadata

Symptoms

Detection

Triage Steps

Step 1: Identify Affected Systems (2 minutes)

Step 2: Immediate Mitigation (10 minutes)

Step 3: Root Cause Analysis

Resolution Actions

Monitoring & Prevention

Verification Steps

Post-Incident Actions

Related Runbooks

6. Certificate Expiration

Metadata

Symptoms

Detection