Skip to main content

SRE Runbook

· 34 min read
Femi Adigun
Senior Software Engineer & Coach

Table of Contents

  1. Service Outage / High Error Rate
  2. High Latency / Performance Degradation
  3. Database Connection Pool Exhaustion
  4. Memory Leak / OOM Kills
  5. Disk Space Exhaustion
  6. Certificate Expiration
  7. DDoS Attack / Traffic Surge
  8. Kubernetes Pod CrashLoopBackOff
  9. Message Queue Backup / Consumer Lag
  10. Database Replication Lag
  11. Cache Invalidation / Cache Storm
  12. Failed Deployment / Rollback
  13. Security Incident / Breach Detection
  14. Data Corruption
  15. DNS Resolution Failures

1. Service Outage / High Error Rate

Metadata

  • Severity: SEV-1 (Critical)
  • MTTR Target: < 30 minutes
  • On-Call Team: SRE, Platform Engineering
  • Escalation Path: SRE → Engineering Manager → VP Engineering
  • Last Updated: 2024-11-27
  • Owner: SRE Team

Symptoms

  • High 5xx error rate (>1% of requests)
  • Service returning errors instead of successful responses
  • Health check endpoints failing
  • Customer reports of service unavailability
  • Spike in error monitoring alerts

Detection

Automated Alerts:

Alert: ServiceHighErrorRate
Severity: Critical
Condition: error_rate > 1% for 2 minutes
Dashboard: https://grafana.company.com/service-health

Manual Checks:

# Check service health
curl -i https://api.company.com/health

# Check error rate in last 5 minutes
kubectl logs -l app=api-service --tail=1000 --since=5m | grep ERROR | wc -l

# Check pod status
kubectl get pods -n production -l app=api-service

Triage Steps

Step 1: Establish Incident Context (2 minutes)

# Check current time and impact window
date

# Check error rate trend
# View Grafana dashboard - is error rate increasing or stable?

# Identify scope
# All services or specific service?
# All regions or specific region?
# All users or subset of users?

# Recent changes
# Check last 3 deployments
kubectl rollout history deployment/api-service -n production | tail -5

# Check recent config changes
kubectl get configmap api-config -n production -o yaml | grep -A 2 "last-applied"

Record in incident doc:

Start Time: [TIMESTAMP]
Error Rate: [X%]
Affected Service: [SERVICE_NAME]
Affected Users: [ALL/SUBSET]
Recent Changes: [YES/NO - DETAILS]

Step 2: Immediate Mitigation (5 minutes)

Option A: Recent Deployment - Rollback

# If deployment in last 30 minutes, rollback immediately
kubectl rollout undo deployment/api-service -n production

# Monitor rollback progress
kubectl rollout status deployment/api-service -n production

# Watch error rate
watch -n 5 'curl -s https://api.company.com/metrics | grep error_rate'

Option B: Scale Up (If Traffic Related)

# Check current replica count
kubectl get deployment api-service -n production

# Scale up by 50%
current_replicas=$(kubectl get deployment api-service -n production -o jsonpath='{.spec.replicas}')
new_replicas=$((current_replicas * 3 / 2))
kubectl scale deployment api-service -n production --replicas=$new_replicas

# Enable HPA if not already
kubectl autoscale deployment api-service -n production --min=10 --max=50 --cpu-percent=70

Option C: Circuit Breaker (If Dependency Down)

# If error logs show dependency timeouts
# Enable circuit breaker via feature flag
curl -X POST https://feature-flags.company.com/api/flags/circuit-breaker-enable \
-H "Authorization: Bearer $ADMIN_TOKEN" \
-d '{"enabled": true, "service": "downstream-api"}'

# Or update config map
kubectl patch configmap api-config -n production \
--type merge -p '{"data":{"CIRCUIT_BREAKER_ENABLED":"true"}}'

# Restart pods to pick up config
kubectl rollout restart deployment/api-service -n production

Step 3: Root Cause Investigation (15 minutes)

Check Logs:

# Recent errors
kubectl logs deployment/api-service -n production --tail=500 --since=10m | grep -i error

# Stack traces
kubectl logs deployment/api-service -n production --tail=1000 | grep -A 10 "Exception"

# All logs from failing pods
failing_pods=$(kubectl get pods -n production -l app=api-service --field-selector=status.phase!=Running -o name)
for pod in $failing_pods; do
echo "=== Logs from $pod ==="
kubectl logs $pod -n production --tail=100
done

Check Metrics:

# CPU usage
kubectl top pods -n production -l app=api-service

# Memory usage
kubectl top pods -n production -l app=api-service --sort-by=memory

# Request rate
curl -s "http://prometheus:9090/api/v1/query?query=rate(http_requests_total{service='api-service'}[5m])"

# Database connections
curl -s "http://prometheus:9090/api/v1/query?query=db_connections_active{service='api-service'}"

Check Dependencies:

# Test database connectivity
kubectl run -i --tty --rm debug --image=postgres:latest --restart=Never -- \
psql -h postgres.production.svc.cluster.local -U appuser -d appdb -c "SELECT 1;"

# Test Redis
kubectl run -i --tty --rm debug --image=redis:latest --restart=Never -- \
redis-cli -h redis.production.svc.cluster.local ping

# Test external API
curl -i -m 5 https://external-api.partner.com/health

Check Network:

# DNS resolution
nslookup api-service.production.svc.cluster.local

# Network policies
kubectl get networkpolicies -n production

# Service endpoints
kubectl get endpoints api-service -n production

Step 4: Resolution Actions

Common Root Causes & Fixes:

A. Database Connection Pool Exhaustion

# Increase pool size (if safe)
kubectl set env deployment/api-service -n production DB_POOL_SIZE=50

# Or restart pods to reset connections
kubectl rollout restart deployment/api-service -n production

B. Memory Leak / OOM

# Increase memory limits temporarily
kubectl set resources deployment api-service -n production \
--limits=memory=4Gi --requests=memory=2Gi

# Enable heap dump on OOM (Java)
kubectl set env deployment/api-service -n production \
JAVA_OPTS="-XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/tmp/heapdump.hprof"

# Restart rolling
kubectl rollout restart deployment/api-service -n production

C. External Dependency Failure

# Enable graceful degradation
# Update feature flag to bypass failing service
curl -X POST https://feature-flags.company.com/api/flags/use-fallback-service \
-H "Authorization: Bearer $ADMIN_TOKEN" \
-d '{"enabled": true}'

# Or enable cached responses
kubectl set env deployment/api-service -n production ENABLE_CACHE_FALLBACK=true

D. Configuration Error

# Revert config change
kubectl rollout undo configmap/api-config -n production

# Restart to pick up old config
kubectl rollout restart deployment/api-service -n production

Verification Steps

# 1. Check error rate returned to normal (<0.1%)
curl -s https://api.company.com/metrics | grep error_rate

# 2. Verify all pods healthy
kubectl get pods -n production -l app=api-service | grep -c Running
expected_count=$(kubectl get deployment api-service -n production -o jsonpath='{.spec.replicas}')
echo "Expected: $expected_count"

# 3. Test end-to-end
curl -i -X POST https://api.company.com/v1/test \
-H "Content-Type: application/json" \
-d '{"test": "data"}'

# 4. Check dependent services
curl -i https://api.company.com/health/dependencies

# 5. Monitor for 15 minutes
watch -n 30 'date && curl -s https://api.company.com/metrics | grep -E "error_rate|latency_p99"'

Communication Templates

Initial Announcement (Slack/Status Page):

🚨 INCIDENT: API Service Experiencing High Error Rate

Status: Investigating
Impact: ~40% of API requests failing
Affected: api.company.com
Started: [TIMESTAMP]
Team: Investigating root cause
ETA: 15 minutes for initial mitigation

Updates: Will provide update in 10 minutes
War Room: #incident-2024-1127-001

Update:

📊 UPDATE: API Service Incident

Status: Mitigation Applied
Action: Rolled back deployment v2.3.5
Result: Error rate decreased from 40% to 2%
Next: Monitoring for stability, investigating root cause
ETA: Full resolution in 10 minutes

Resolution:

✅ RESOLVED: API Service Incident

Status: Resolved
Duration: 27 minutes (10:15 AM - 10:42 AM ET)
Root Cause: Database connection pool exhaustion from v2.3.5 config change
Resolution: Rolled back to v2.3.4
Impact: ~2,400 failed requests during incident window
Postmortem: Will be published within 48 hours

Thank you for your patience.

Escalation Criteria

Escalate to Engineering Manager if:

  • MTTR exceeds 30 minutes
  • Impact >50% of users
  • Data loss suspected
  • Security implications identified

Escalate to VP Engineering if:

  • MTTR exceeds 1 hour
  • Major customer impact
  • Media/PR implications
  • Regulatory reporting required

Contact:

Primary On-Call SRE: [Use PagerDuty]
Engineering Manager: [Slack: @eng-manager] [Phone: XXX-XXX-XXXX]
VP Engineering: [Slack: @vp-eng] [Phone: XXX-XXX-XXXX]
Security Team: security@company.com [Slack: #security-incidents]

Post-Incident Actions

Immediate (Same Day):

  • Update incident timeline in documentation
  • Notify all stakeholders of resolution
  • Begin postmortem document
  • Capture all logs, metrics, traces for analysis
  • Take database/system snapshots if relevant

Within 48 Hours:

  • Complete blameless postmortem
  • Identify action items with owners
  • Schedule postmortem review meeting
  • Update runbook with lessons learned

Within 1 Week:

  • Implement quick wins from action items
  • Add monitoring/alerting to prevent recurrence
  • Share learnings with broader team

2. High Latency / Performance Degradation

Metadata

  • Severity: SEV-2 (High)
  • MTTR Target: < 1 hour
  • On-Call Team: SRE, Backend Engineering
  • Last Updated: 2024-11-27
  • Owner: SRE Team

Symptoms

  • P95/P99 latency exceeding SLO
  • User complaints about slow responses
  • Timeouts in dependent services
  • Increased request queue depth
  • Slow database queries

Detection

Automated Alerts:

Alert: HighLatencyP99
Severity: Warning
Condition: p99_latency > 500ms for 5 minutes
SLO: p99 < 200ms
Dashboard: https://grafana.company.com/latency

Triage Steps

Step 1: Quantify Impact (2 minutes)

# Check current latency
curl -s https://api.company.com/metrics | grep -E "latency_p50|latency_p95|latency_p99"

# Get latency percentiles from Prometheus
curl -s "http://prometheus:9090/api/v1/query?query=histogram_quantile(0.99, rate(http_request_duration_seconds_bucket{service='api'}[5m]))"

# Affected endpoints
kubectl logs deployment/api-service -n production --tail=1000 | \
awk '{print $7}' | sort | uniq -c | sort -rn | head -10

Document:

Current P99: [XXX ms] (SLO: 200ms)
Current P95: [XXX ms]
Affected Endpoints: [LIST]
User Reports: [NUMBER]

Step 2: Identify Bottleneck (10 minutes)

Check Application Performance:

# CPU throttling
kubectl top pods -n production -l app=api-service

# Check for CPU throttling
kubectl describe pods -n production -l app=api-service | grep -A 5 "cpu"

# Memory pressure
kubectl top pods -n production -l app=api-service --sort-by=memory

# Thread dumps (Java applications)
kubectl exec -it deployment/api-service -n production -- jstack 1 > thread-dump.txt

# Profile CPU (if profiling enabled)
kubectl exec -it deployment/api-service -n production -- \
curl http://localhost:6060/debug/pprof/profile?seconds=30 > cpu-profile.out

Check Database:

# Active queries
kubectl exec -it postgres-0 -n production -- psql -U postgres -c "
SELECT pid, now() - pg_stat_activity.query_start AS duration, query, state
FROM pg_stat_activity
WHERE (now() - pg_stat_activity.query_start) > interval '5 seconds'
ORDER BY duration DESC;
"

# Slow query log
kubectl exec -it postgres-0 -n production -- tail -100 /var/log/postgresql/slow-query.log

# Database connections
kubectl exec -it postgres-0 -n production -- psql -U postgres -c "
SELECT count(*) as connection_count, state
FROM pg_stat_activity
GROUP BY state;
"

# Lock waits
kubectl exec -it postgres-0 -n production -- psql -U postgres -c "
SELECT blocked_locks.pid AS blocked_pid,
blocking_locks.pid AS blocking_pid,
blocked_activity.query AS blocked_query,
blocking_activity.query AS blocking_query
FROM pg_catalog.pg_locks blocked_locks
JOIN pg_catalog.pg_stat_activity blocked_activity ON blocked_activity.pid = blocked_locks.pid
JOIN pg_catalog.pg_locks blocking_locks ON blocking_locks.locktype = blocked_locks.locktype
JOIN pg_catalog.pg_stat_activity blocking_activity ON blocking_activity.pid = blocking_locks.pid
WHERE NOT blocked_locks.granted;
"

Check Cache:

# Redis hit rate
redis-cli --latency-history

# Cache stats
kubectl exec -it redis-0 -n production -- redis-cli INFO stats | grep -E "hit|miss"

# Memory usage
kubectl exec -it redis-0 -n production -- redis-cli INFO memory | grep used_memory_human

# Slow log
kubectl exec -it redis-0 -n production -- redis-cli SLOWLOG GET 10

Check Network:

# Network latency to dependencies
for service in postgres redis external-api; do
echo "=== $service ==="
kubectl run ping-test --image=busybox --rm -it --restart=Never -- \
ping -c 5 $service.production.svc.cluster.local
done

# DNS lookup times
kubectl run dns-test --image=busybox --rm -it --restart=Never -- \
nslookup api.company.com

# External API latency
time curl -X GET https://external-api.partner.com/v1/data

Check Distributed Traces:

# Identify slow spans in Jaeger
# Navigate to Jaeger UI: https://jaeger.company.com
# Filter by:
# - Service: api-service
# - Min Duration: 500ms
# - Lookback: 1 hour

# Programmatic trace query
curl "http://jaeger-query:16686/api/traces?service=api-service&limit=20&lookback=1h&minDuration=500ms"

Step 3: Apply Mitigation

Scenario A: Database Slow Queries

# Kill long-running queries (if safe)
kubectl exec -it postgres-0 -n production -- psql -U postgres -c "
SELECT pg_terminate_backend(pid)
FROM pg_stat_activity
WHERE (now() - pg_stat_activity.query_start) > interval '60 seconds'
AND state = 'active'
AND pid <> pg_backend_pid();
"

# Add missing index (if identified)
kubectl exec -it postgres-0 -n production -- psql -U postgres -d appdb -c "
CREATE INDEX CONCURRENTLY idx_users_email ON users(email);
"

# Analyze tables (update statistics)
kubectl exec -it postgres-0 -n production -- psql -U postgres -d appdb -c "
ANALYZE VERBOSE users;
"

# Scale read replicas
kubectl scale statefulset postgres-replica -n production --replicas=5

Scenario B: Cache Miss Storm

# Pre-warm cache with common queries
kubectl exec -it deployment/api-service -n production -- \
curl -X POST http://localhost:8080/admin/cache/warmup

# Increase cache size
kubectl exec -it redis-0 -n production -- redis-cli CONFIG SET maxmemory 4gb

# Enable cache fallback to stale data
kubectl set env deployment/api-service -n production CACHE_SERVE_STALE=true

Scenario C: CPU/Memory Constrained

# Increase resources
kubectl set resources deployment api-service -n production \
--limits=cpu=2000m,memory=4Gi \
--requests=cpu=1000m,memory=2Gi

# Scale horizontally
kubectl scale deployment api-service -n production --replicas=20

# Enable HPA
kubectl autoscale deployment api-service -n production \
--min=10 --max=30 --cpu-percent=60

Scenario D: External API Slow

# Increase timeout and enable caching
kubectl set env deployment/api-service -n production \
EXTERNAL_API_TIMEOUT=10000 \
EXTERNAL_API_CACHE_ENABLED=true \
EXTERNAL_API_CACHE_TTL=300

# Enable circuit breaker
kubectl set env deployment/api-service -n production \
CIRCUIT_BREAKER_ENABLED=true \
CIRCUIT_BREAKER_THRESHOLD=50

# Use fallback/cached data
kubectl patch configmap api-config -n production \
--type merge -p '{"data":{"USE_FALLBACK_DATA":"true"}}'

Scenario E: Thread Pool Exhaustion

# Increase thread pool size
kubectl set env deployment/api-service -n production \
THREAD_POOL_SIZE=200 \
THREAD_QUEUE_SIZE=1000

# Restart to apply
kubectl rollout restart deployment/api-service -n production

Verification Steps

# 1. Monitor latency improvement
watch -n 10 'curl -s https://api.company.com/metrics | grep latency_p99'

# 2. Check trace samples
# View Jaeger for recent requests - should show improved latency

# 3. Database query times
kubectl exec -it postgres-0 -n production -- psql -U postgres -c "
SELECT query, mean_exec_time, calls
FROM pg_stat_statements
ORDER BY mean_exec_time DESC
LIMIT 10;
"

# 4. Resource utilization normalized
kubectl top pods -n production -l app=api-service

# 5. Error rate stable (ensure fix didn't introduce errors)
curl -s https://api.company.com/metrics | grep error_rate

Root Cause Investigation

Common Causes:

  1. N+1 Query Problem

    • Check ORM query patterns
    • Enable query logging
    • Add eager loading
  2. Missing Database Index

    • Analyze slow query log
    • Use EXPLAIN ANALYZE
    • Create appropriate indexes
  3. Memory Garbage Collection

    • Check GC logs (Java/JVM)
    • Tune GC parameters
    • Increase heap size
  4. Inefficient Algorithm

    • Profile code execution
    • Identify hot paths
    • Optimize algorithms
  5. External Service Degradation

    • Check dependency SLOs
    • Implement caching
    • Add circuit breakers

Communication Templates

Initial Alert:

⚠️  INCIDENT: API Latency Degradation

Status: Investigating
Impact: P99 latency at 800ms (SLO: 200ms)
Affected: All API endpoints
User Impact: Slow response times
Team: Investigating root cause
Updates: Every 15 minutes in #incident-channel

Resolution:

✅ RESOLVED: API Latency Degradation

Duration: 45 minutes
Root Cause: Missing database index on users.email causing table scans
Resolution: Added index, latency returned to normal
Current P99: 180ms (within SLO)
Postmortem: Will be published within 48 hours

Post-Incident Actions

  • Add database query monitoring
  • Implement automated index recommendations
  • Load test with realistic data volumes
  • Add latency SLO alerts per endpoint
  • Review and optimize slow queries
  • Implement APM (Application Performance Monitoring)

3. Database Connection Pool Exhaustion

Metadata

  • Severity: SEV-1 (Critical)
  • MTTR Target: < 15 minutes
  • On-Call Team: SRE, Database Team
  • Last Updated: 2024-11-27

Symptoms

  • "Connection pool exhausted" errors in application logs
  • Requests timing out
  • Database showing many idle connections
  • Application unable to acquire new connections
  • Connection pool at 100% utilization

Detection

# Check connection pool metrics
curl -s https://api.company.com/metrics | grep db_pool

# Expected output:
# db_pool_active 50
# db_pool_idle 0
# db_pool_total 50
# db_pool_wait_count 1500 <-- High wait count indicates problem

Triage Steps

Step 1: Confirm Pool Exhaustion (1 minute)

# Application side
kubectl logs deployment/api-service -n production --tail=100 | \
grep -i "connection pool\|unable to acquire\|timeout"

# Database side - check connection count
kubectl exec -it postgres-0 -n production -- psql -U postgres -c "
SELECT count(*) as total_connections,
state,
application_name
FROM pg_stat_activity
WHERE datname = 'appdb'
GROUP BY state, application_name
ORDER BY total_connections DESC;
"

# Check pool configuration
kubectl get configmap api-config -n production -o yaml | grep -i pool

Step 2: Immediate Mitigation (5 minutes)

Option A: Restart Application Pods (Fastest)

# Rolling restart to reset connections
kubectl rollout restart deployment/api-service -n production

# Monitor restart
kubectl rollout status deployment/api-service -n production

# Verify connections released
kubectl exec -it postgres-0 -n production -- psql -U postgres -c "
SELECT count(*) FROM pg_stat_activity WHERE datname = 'appdb';
"

Option B: Increase Pool Size (If Infrastructure Allows)

# Check database connection limit
kubectl exec -it postgres-0 -n production -- psql -U postgres -c "
SHOW max_connections;
"

# Calculate safe pool size
# max_connections / number_of_app_instances = pool_size_per_instance
# Example: 200 max / 10 instances = 20 per instance (current might be 50)

# Increase pool size temporarily
kubectl set env deployment/api-service -n production \
DB_POOL_SIZE=30 \
DB_POOL_MAX_IDLE=10 \
DB_CONNECTION_TIMEOUT=30000

# Monitor
watch -n 5 'kubectl logs deployment/api-service -n production --tail=50 | grep -i pool'

Option C: Kill Idle Connections (If Many Idle)

# Identify idle connections
kubectl exec -it postgres-0 -n production -- psql -U postgres -c "
SELECT pid, state, query_start, state_change, query
FROM pg_stat_activity
WHERE datname = 'appdb'
AND state = 'idle'
AND state_change < now() - interval '5 minutes';
"

# Kill long-idle connections
kubectl exec -it postgres-0 -n production -- psql -U postgres -c "
SELECT pg_terminate_backend(pid)
FROM pg_stat_activity
WHERE datname = 'appdb'
AND state = 'idle'
AND state_change < now() - interval '10 minutes'
AND pid <> pg_backend_pid();
"

Step 3: Root Cause Analysis (10 minutes)

Check for Connection Leaks:

# Application logs - look for unclosed connections
kubectl logs deployment/api-service -n production --tail=5000 | \
grep -i "connection not closed\|resource leak"

# Check connection lifecycle
kubectl exec -it postgres-0 -n production -- psql -U postgres -c "
SELECT application_name,
state,
count(*) as conn_count,
max(now() - state_change) as max_idle_time
FROM pg_stat_activity
WHERE datname = 'appdb'
GROUP BY application_name, state
ORDER BY conn_count DESC;
"

# Long-running transactions
kubectl exec -it postgres-0 -n production -- psql -U postgres -c "
SELECT pid, now() - xact_start AS duration, state, query
FROM pg_stat_activity
WHERE datname = 'appdb'
AND xact_start IS NOT NULL
ORDER BY duration DESC
LIMIT 20;
"

Check Recent Changes:

# Recent deployments
kubectl rollout history deployment/api-service -n production | tail -5

# Config changes
kubectl get configmap api-config -n production -o yaml | \
grep -A 2 "last-applied-configuration"

# Recent code changes affecting database access
git log --since="24 hours ago" --grep="database\|pool\|connection" --oneline

Check for Traffic Spike:

# Request rate
curl "http://prometheus:9090/api/v1/query?query=rate(http_requests_total{service='api-service'}[5m])"

# Compare to baseline
curl "http://prometheus:9090/api/v1/query?query=rate(http_requests_total{service='api-service'}[5m] offset 1h)"

Resolution Actions

Permanent Fix Options:

A. Fix Connection Leak in Code

# Bad - connection leak
def get_user(user_id):
conn = db_pool.getconn()
cursor = conn.cursor()
cursor.execute("SELECT * FROM users WHERE id = %s", (user_id,))
result = cursor.fetchone()
return result # Connection never returned!

# Good - using context manager
def get_user(user_id):
with db_pool.getconn() as conn:
with conn.cursor() as cursor:
cursor.execute("SELECT * FROM users WHERE id = %s", (user_id,))
return cursor.fetchone()
# Connection automatically returned to pool

B. Optimize Pool Configuration

# Configure based on actual usage patterns
kubectl set env deployment/api-service -n production \
DB_POOL_SIZE=20 \
DB_POOL_MIN_IDLE=5 \
DB_POOL_MAX_IDLE=10 \
DB_POOL_IDLE_TIMEOUT=300000 \
DB_POOL_CONNECTION_TIMEOUT=30000 \
DB_POOL_VALIDATION_TIMEOUT=5000

# Enable connection validation
kubectl set env deployment/api-service -n production \
DB_POOL_TEST_ON_BORROW=true \
DB_POOL_TEST_WHILE_IDLE=true

C. Implement Connection Pooler (PgBouncer)

# Deploy PgBouncer
kubectl apply -f - <<EOF
apiVersion: apps/v1
kind: Deployment
metadata:
name: pgbouncer
namespace: production
spec:
replicas: 3
selector:
matchLabels:
app: pgbouncer
template:
metadata:
labels:
app: pgbouncer
spec:
containers:
- name: pgbouncer
image: pgbouncer/pgbouncer:latest
env:
- name: DATABASES_HOST
value: postgres.production.svc.cluster.local
- name: POOL_MODE
value: transaction
- name: MAX_CLIENT_CONN
value: "1000"
- name: DEFAULT_POOL_SIZE
value: "25"
---
apiVersion: v1
kind: Service
metadata:
name: pgbouncer
namespace: production
spec:
selector:
app: pgbouncer
ports:
- port: 5432
targetPort: 5432
EOF

# Update application to use PgBouncer
kubectl set env deployment/api-service -n production \
DB_HOST=pgbouncer.production.svc.cluster.local

D. Scale Database Connections

# Increase PostgreSQL max_connections
kubectl exec -it postgres-0 -n production -- psql -U postgres -c "
ALTER SYSTEM SET max_connections = 300;
SELECT pg_reload_conf();
"

# Note: May require database restart for some changes
kubectl exec -it postgres-0 -n production -- psql -U postgres -c "
SELECT pg_reload_conf();
"

Monitoring & Alerting

Add Proactive Monitoring:

# Prometheus alert rule
groups:
- name: database_pool
rules:
- alert: ConnectionPoolHighUtilization
expr: db_pool_active / db_pool_total > 0.7
for: 5m
labels:
severity: warning
annotations:
summary: "Connection pool utilization >70%"
description: "Pool at {{ $value }}% capacity"

- alert: ConnectionPoolExhausted
expr: db_pool_active / db_pool_total > 0.9
for: 2m
labels:
severity: critical
annotations:
summary: "Connection pool nearly exhausted"

- alert: ConnectionPoolWaitTime
expr: rate(db_pool_wait_count[5m]) > 10
for: 5m
labels:
severity: warning
annotations:
summary: "High connection pool wait count"

Dashboard Metrics:

- db_pool_total (configured pool size)
- db_pool_active (connections in use)
- db_pool_idle (connections available)
- db_pool_wait_count (requests waiting for connection)
- db_pool_wait_time_ms (time waiting for connection)
- db_connection_lifetime_seconds (connection age histogram)

Verification Steps

# 1. Pool utilization back to normal
kubectl exec -it deployment/api-service -n production -- \
curl http://localhost:8080/metrics | grep db_pool_active

# Should see: db_pool_active < 70% of db_pool_total

# 2. No wait queue
kubectl exec -it deployment/api-service -n production -- \
curl http://localhost:8080/metrics | grep db_pool_wait_count

# Should see: db_pool_wait_count = 0 or minimal

# 3. Database connection count stable
kubectl exec -it postgres-0 -n production -- psql -U postgres -c "
SELECT count(*) FROM pg_stat_activity WHERE datname = 'appdb';
"

# 4. No errors in logs
kubectl logs deployment/api-service -n production --tail=100 | \
grep -i "connection" | grep -i error

# 5. Response times normal
curl -s https://api.company.com/metrics | grep latency_p99

Prevention

Code Review Checklist:

  • All database connections properly closed
  • Using connection pool best practices
  • Proper error handling to ensure connection release
  • No connection usage outside transactions
  • Connection timeout configured

Testing:

  • Load test with connection pool monitoring
  • Chaos engineering: Test with limited connections
  • Connection leak detection in CI/CD

Architecture:

  • Consider connection pooler (PgBouncer)
  • Implement read replicas to distribute load
  • Use caching to reduce database queries

4. Memory Leak / OOM Kills

Metadata

  • Severity: SEV-2 (High)
  • MTTR Target: < 1 hour
  • On-Call Team: SRE, Backend Engineering
  • Last Updated: 2024-11-27

Symptoms

  • Pods being OOMKilled (Out of Memory)
  • Memory usage continuously increasing
  • Slow performance and increased GC pressure
  • Pod restarts without clear error
  • "Cannot allocate memory" errors

Detection

# Check for OOMKilled pods
kubectl get pods -n production -l app=api-service | grep OOMKilled

# Check pod events
kubectl get events -n production --field-selector involvedObject.name=api-service-xxx | \
grep -i oom

# Memory usage trend
kubectl top pods -n production -l app=api-service --sort-by=memory

Triage Steps

Step 1: Confirm OOM Issue (2 minutes)

# Check pod status and restart reason
kubectl describe pod <pod-name> -n production | grep -A 10 "Last State"

# Should see output like:
# Last State: Terminated
# Reason: OOMKilled
# Exit Code: 137

# Check memory limits
kubectl get pod <pod-name> -n production -o jsonpath='{.spec.containers[*].resources}'

# Monitor memory usage
watch -n 5 'kubectl top pods -n production -l app=api-service'

Step 2: Immediate Mitigation (10 minutes)

Option A: Increase Memory Limits (Quick Fix)

# Current limits
kubectl get deployment api-service -n production -o jsonpath='{.spec.template.spec.containers[0].resources}'

# Increase memory limit temporarily (2x current)
kubectl set resources deployment api-service -n production \
--limits=memory=4Gi \
--requests=memory=2Gi

# Monitor rollout
kubectl rollout status deployment/api-service -n production

# Watch memory usage
watch -n 10 'kubectl top pods -n production -l app=api-service'

Option B: Scale Out (If Memory Leak is Gradual)

# Add more pods to distribute load
kubectl scale deployment api-service -n production --replicas=15

# Enable HPA with lower memory target
kubectl autoscale deployment api-service -n production \
--min=10 --max=30 --cpu-percent=70

# Note: This is temporary - still need to fix leak

Option C: Implement Pod Lifecycle (Workaround)

# Restart pods proactively before they OOM
# Add to deployment spec:
kubectl patch deployment api-service -n production --type=json -p='[
{
"op": "add",
"path": "/spec/template/spec/containers/0/lifecycle",
"value": {
"preStop": {
"exec": {
"command": ["/bin/sh", "-c", "sleep 15"]
}
}
}
}
]'

# Add TTL to pod lifecycle (requires external controller)
# Or implement rolling restart every 12 hours

Step 3: Capture Diagnostics (15 minutes)

Capture Heap Dump (Java/JVM):

# Before pod is killed, capture heap dump
kubectl exec -it <pod-name> -n production -- \
jcmd 1 GC.heap_dump /tmp/heapdump.hprof

# Copy heap dump locally
kubectl cp production/<pod-name>:/tmp/heapdump.hprof ./heapdump-$(date +%Y%m%d-%H%M%S).hprof

# Analyze with MAT or jhat
# Upload to analysis tools or analyze locally

Capture Memory Profile (Go Applications):

# If profiling endpoint enabled
kubectl exec -it <pod-name> -n production -- \
curl http://localhost:6060/debug/pprof/heap > heap-$(date +%Y%m%d-%H%M%S).prof

# Copy locally
kubectl cp production/<pod-name>:/tmp/heap-profile.prof ./heap-profile.prof

# Analyze
go tool pprof -http=:8080 heap-profile.prof

Capture Memory Metrics (Python):

# Install memory_profiler if not already
kubectl exec -it <pod-name> -n production -- pip install memory_profiler

# Profile specific function
kubectl exec -it <pod-name> -n production -- \
python -m memory_profiler app.py

# Or use tracemalloc for running process
kubectl exec -it <pod-name> -n production -- python3 <<'EOF'
import tracemalloc
tracemalloc.start()
# Take snapshot
snapshot = tracemalloc.take_snapshot()
top_stats = snapshot.statistics('lineno')
for stat in top_stats[:10]:
print(stat)
EOF

Check for Common Memory Issues:

# Large objects in memory
kubectl exec -it <pod-name> -n production -- \
curl http://localhost:8080/debug/vars | jq '.memstats'

# Check goroutine leaks (Go)
kubectl exec -it <pod-name> -n production -- \
curl http://localhost:6060/debug/pprof/goroutine?debug=1

# Check thread count (Java)
kubectl exec -it <pod-name> -n production -- \
jcmd 1 Thread.print | grep "Thread" | wc -l

# File descriptor leaks
kubectl exec -it <pod-name> -n production -- \
ls -la /proc/1/fd | wc -l

Step 4: Identify Root Cause

Common Memory Leak Causes:

A. Caching Without Eviction:

# Bad - unbounded cache
cache = {}

def get_user(user_id):
if user_id not in cache:
cache[user_id] = fetch_from_db(user_id) # Cache grows forever!
return cache[user_id]

# Good - bounded cache with LRU
from functools import lru_cache

@lru_cache(maxsize=1000)
def get_user(user_id):
return fetch_from_db(user_id)

B. Event Listeners Not Removed:

// Bad - event listener leak
class Component {
constructor() {
this.data = new Array(1000000);
window.addEventListener("resize", () => this.handleResize());
}
// Missing cleanup!
}

// Good - cleanup listeners
class Component {
constructor() {
this.data = new Array(1000000);
this.handleResize = this.handleResize.bind(this);
window.addEventListener("resize", this.handleResize);
}

destroy() {
window.removeEventListener("resize", this.handleResize);
this.data = null;
}
}

C. Goroutine Leaks (Go):

// Bad - goroutine leak
func processRequests() {
for request := range requests {
go handleRequest(request) // Goroutines never cleaned up
}
}

// Good - bounded goroutines
func processRequests() {
sem := make(chan struct{}, 100) // Max 100 concurrent
for request := range requests {
sem <- struct{}{}
go func(req Request) {
defer func() { <-sem }()
handleRequest(req)
}(request)
}
}

D. Database Result Sets Not Closed:

// Bad - result set leak
public List<User> getUsers() {
Statement stmt = conn.createStatement();
ResultSet rs = stmt.executeQuery("SELECT * FROM users");
List<User> users = new ArrayList<>();
while (rs.next()) {
users.add(new User(rs));
}
return users; // ResultSet and Statement never closed!
}

// Good - use try-with-resources
public List<User> getUsers() {
List<User> users = new ArrayList<>();
try (Statement stmt = conn.createStatement();
ResultSet rs = stmt.executeQuery("SELECT * FROM users")) {
while (rs.next()) {
users.add(new User(rs));
}
}
return users;
}

Resolution Actions

Short-term:

# 1. Increase memory limits (already done)

# 2. Enable memory monitoring
kubectl patch deployment api-service -n production --type=json -p='[
{
"op": "add",
"path": "/spec/template/metadata/annotations/prometheus.io~1scrape",
"value": "true"
}
]'

# 3. Add liveness probe with memory check
kubectl patch deployment api-service -n production --type=json -p='[
{
"op": "add",
"path": "/spec/template/spec/containers/0/livenessProbe",
"value": {
"httpGet": {
"path": "/health",
"port": 8080
},
"initialDelaySeconds": 30,
"periodSeconds": 10
}
}
]'

# 4. Implement automatic restart before OOM
# Create CronJob to restart pods every 12 hours (temporary)
kubectl create -f - <<EOF
apiVersion: batch/v1
kind: CronJob
metadata:
name: api-service-restart
namespace: production
spec:
schedule: "0 */12 * * *"
jobTemplate:
spec:
template:
spec:
serviceAccountName: pod-restarter
containers:
- name: kubectl
image: bitnami/kubectl:latest
command:
- /bin/sh
- -c
- kubectl rollout restart deployment/api-service -n production
restartPolicy: OnFailure
EOF

Long-term Fix:

# 1. Fix code leak (deploy patch)
git checkout -b fix/memory-leak
# ... implement fix ...
git commit -m "Fix: Remove unbounded cache causing memory leak"
git push origin fix/memory-leak

# 2. Add memory profiling in production
kubectl set env deployment/api-service -n production \
ENABLE_PROFILING=true \
PROFILING_PORT=6060

# 3. Implement memory limits in code
# For Java:
kubectl set env deployment/api-service -n production \
JAVA_OPTS="-Xmx2g -XX:MaxMetaspaceSize=256m -XX:+UseG1GC"

# 4. Add memory monitoring dashboard

# 5. Implement alerts for memory growth

Monitoring & Prevention

Add Alerts:

groups:
- name: memory_alerts
rules:
- alert: MemoryUsageHigh
expr: container_memory_usage_bytes{pod=~"api-service.*"} / container_spec_memory_limit_bytes{pod=~"api-service.*"} > 0.8
for: 10m
labels:
severity: warning
annotations:
summary: "Pod memory usage >80%"

- alert: MemoryUsageGrowing
expr: predict_linear(container_memory_usage_bytes{pod=~"api-service.*"}[1h], 3600) > container_spec_memory_limit_bytes{pod=~"api-service.*"}
for: 15m
labels:
severity: warning
annotations:
summary: "Memory usage trending towards OOM"

- alert: OOMKillsDetected
expr: increase(kube_pod_container_status_restarts_total{pod=~"api-service.*"}[15m]) > 3
labels:
severity: critical
annotations:
summary: "Multiple pod restarts detected (possible OOM)"

Grafana Dashboard:

- Memory Usage (%)
- Memory Usage (bytes) over time
- Predicted time to OOM
- GC frequency and duration
- Heap size vs used heap
- Number of objects in memory
- Pod restart count

Verification Steps

# 1. Memory usage stable
watch -n 30 'kubectl top pods -n production -l app=api-service | tail -5'

# 2. No OOM kills in last hour
kubectl get events -n production --field-selector reason=OOMKilling | grep "last hour"

# 3. Pod uptime increasing (not restarting)
kubectl get pods -n production -l app=api-service -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.status.startTime}{"\n"}{end}'

# 4. Memory growth linear/flat (not exponential)
# Check Grafana memory usage graph

# 5. Application metrics healthy
curl -s https://api.company.com/metrics | grep -E "heap|gc|memory"

Post-Incident Actions

  • Analyze heap dump to identify leak source
  • Review code for common leak patterns
  • Add memory profiling to CI/CD
  • Implement memory budgets in code
  • Add integration tests for memory leaks
  • Document memory configuration guidelines
  • Train team on memory leak prevention

5. Disk Space Exhaustion

Metadata

  • Severity: SEV-2 (High), escalates to SEV-1 if database affected
  • MTTR Target: < 30 minutes
  • On-Call Team: SRE, Infrastructure
  • Last Updated: 2024-11-27

Symptoms

  • "No space left on device" errors
  • Applications unable to write logs
  • Database unable to write data
  • Pod evictions due to disk pressure
  • Slow I/O performance

Detection

# Check disk usage on nodes
kubectl get nodes -o custom-columns=NAME:.metadata.name,DISK:.status.allocatable.ephemeral-storage

# Check specific node
kubectl describe node <node-name> | grep -A 5 "Allocated resources"

# SSH to node and check
ssh <node> df -h

# Check for disk pressure
kubectl get nodes -o json | jq '.items[] | select(.status.conditions[] | select(.type=="DiskPressure" and .status=="True")) | .metadata.name'

Triage Steps

Step 1: Identify Affected Systems (2 minutes)

# Which nodes are affected?
for node in $(kubectl get nodes -o name); do
echo "=== $node ==="
kubectl describe $node | grep -E "DiskPressure|ephemeral-storage"
done

# Which pods are on affected nodes?
kubectl get pods -n production -o wide | grep <affected-node>

# Critical services affected?
kubectl get pods -n production -l tier=critical -o wide

Step 2: Immediate Mitigation (10 minutes)

Option A: Clean Up Logs

# SSH to affected node
ssh <node-name>

# Find large log files
sudo find /var/log -type f -size +100M -exec ls -lh {} \; | awk '{ print $9 ": " $5 }'

# Rotate/truncate large logs
sudo truncate -s 0 /var/log/containers/*.log
sudo journalctl --vacuum-size=500M

# Clean Docker logs (if applicable)
sudo sh -c "truncate -s 0 /var/lib/docker/containers/*/*-json.log"

# Kubernetes log cleanup
sudo find /var/log/pods -name "*.log" -mtime +7 -delete

Option B: Remove Unused Docker Images

# On affected node
ssh <node-name>

# List images sorted by size
sudo docker images --format "{{.Repository}}:{{.Tag}}\t{{.Size}}" | sort -k2 -h

# Remove unused images
sudo docker image prune -a --filter "until=72h" -f

# Remove dangling volumes
sudo docker volume prune -f

Option C: Clean Up Pod Ephemeral Storage

# Find pods using most disk
kubectl get pods -n production -o json | \
jq -r '.items[] | "\(.metadata.name) \(.spec.nodeName)"' | \
while read pod node; do
if [ "$node" = "<affected-node>" ]; then
echo "=== $pod ==="
kubectl exec $pod -n production -- du -sh /tmp /var/tmp 2>/dev/null || true
fi
done

# Clean up specific pod
kubectl exec <pod-name> -n production -- sh -c "rm -rf /tmp/*"

Option D: Cordon and Drain Node

# Prevent new pods from scheduling
kubectl cordon <node-name>

# Drain pods to other nodes
kubectl drain <node-name> --ignore-daemonsets --delete-emptydir-data --grace-period=60

# Clean up on node
ssh <node-name>
sudo docker system prune -a -f --volumes
sudo rm -rf /var/log/pods/*
sudo rm -rf /var/lib/kubelet/pods/*

# Uncordon when ready
kubectl uncordon <node-name>

Option E: Emergency Database Cleanup (If DB Affected)

# Connect to database pod
kubectl exec -it postgres-0 -n production -- psql -U postgres

# Check database sizes
SELECT pg_database.datname,
pg_size_pretty(pg_database_size(pg_database.datname)) AS size
FROM pg_database
ORDER BY pg_database_size(pg_database.datname) DESC;

# Check table sizes
SELECT schemaname,
tablename,
pg_size_pretty(pg_total_relation_size(schemaname||'.'||tablename)) AS size
FROM pg_tables
ORDER BY pg_total_relation_size(schemaname||'.'||tablename) DESC
LIMIT 20;

# Archive old data (if safe)
# Example: Archive logs older than 90 days
BEGIN;
COPY (SELECT * FROM audit_logs WHERE created_at < NOW() - INTERVAL '90 days')
TO PROGRAM 'gzip > /tmp/audit_logs_archive_$(date +%Y%m%d).csv.gz' WITH CSV HEADER;
DELETE FROM audit_logs WHERE created_at < NOW() - INTERVAL '90 days';
COMMIT;

# Vacuum to reclaim space
VACUUM FULL audit_logs;

Step 3: Root Cause Analysis

Find What's Consuming Space:

# On affected node
ssh <node-name>

# Find largest directories
sudo du -h --max-depth=3 / 2>/dev/null | sort -hr | head -20

# Find largest files
sudo find / -type f -size +500M -exec ls -lh {} \; 2>/dev/null | awk '{ print $9 ": " $5 }'

# Check specific directories
sudo du -sh /var/lib/docker/*
sudo du -sh /var/lib/kubelet/*
sudo du -sh /var/log/*

# Find recently created large files
sudo find / -type f -size +100M -mtime -1 -exec ls -lh {} \; 2>/dev/null

Common Culprits:

  1. Excessive Logging
# Check log volume
kubectl exec <pod-name> -n production -- du -sh /var/log

# Check logging rate
kubectl logs <pod-name> -n production --tail=100 --timestamps | \
awk '{print $1}' | sort | uniq -c
  1. Temp File Accumulation
# Check temp directories
kubectl exec <pod-name> -n production -- du -sh /tmp /var/tmp

# Find old temp files
kubectl exec <pod-name> -n production -- find /tmp -type f -mtime +7 -ls
  1. Database Growth
# PostgreSQL WAL files
kubectl exec postgres-0 -n production -- \
du -sh /var/lib/postgresql/data/pg_wal/

# MySQL binary logs
kubectl exec mysql-0 -n production -- \
du -sh /var/lib/mysql/binlog/
  1. Image/Container Buildup
# Unused containers
sudo docker ps -a --filter "status=exited" --filter "status=dead"

# Image layer cache
sudo du -sh /var/lib/docker/overlay2/

Resolution Actions

Short-term:

# 1. Implement log rotation
kubectl apply -f - <<EOF
apiVersion: v1
kind: ConfigMap
metadata:
name: filebeat-config
namespace: production
data:
filebeat.yml: |
filebeat.inputs:
- type: container
paths:
- /var/log/containers/*.log
processors:
- add_kubernetes_metadata:
host: \${NODE_NAME}
matchers:
- logs_path:
logs_path: "/var/log/containers/"
output.logstash:
hosts: ["logstash:5044"]
# Local log cleanup
queue.mem:
events: 4096
flush.min_events: 512
flush.timeout: 5s
EOF

# 2. Set up log shipping
kubectl apply -f https://raw.githubusercontent.com/elastic/beats/master/deploy/kubernetes/filebeat-kubernetes.yaml

# 3. Configure log rotation on nodes
# Add to node configuration or DaemonSet
cat <<'EOF' | sudo tee /etc/logrotate.d/containers
/var/log/containers/*.log {
daily
rotate 7
compress
missingok
notifempty
create 0644 root root
postrotate
/usr/bin/docker ps -a --format '{{.Names}}' | xargs -I {} docker kill -s HUP {} 2>/dev/null || true
endscript
}
EOF

Long-term:

# 1. Set ephemeral-storage limits on pods
kubectl patch deployment api-service -n production --type=json -p='[
{
"op": "add",
"path": "/spec/template/spec/containers/0/resources/limits/ephemeral-storage",
"value": "2Gi"
},
{
"op": "add",
"path": "/spec/template/spec/containers/0/resources/requests/ephemeral-storage",
"value": "1Gi"
}
]'

# 2. Enable disk usage monitoring
kubectl apply -f - <<EOF
apiVersion: v1
kind: ConfigMap
metadata:
name: node-exporter-config
namespace: monitoring
data:
entrypoint.sh: |
#!/bin/sh
exec /bin/node_exporter \
--collector.filesystem \
--collector.filesystem.mount-points-exclude="^/(dev|proc|sys|var/lib/docker/.+)(\$|/)" \
--web.listen-address=:9100
EOF

# 3. Set up automated cleanup CronJob
kubectl apply -f - <<EOF
apiVersion: batch/v1
kind: CronJob
metadata:
name: disk-cleanup
namespace: kube-system
spec:
schedule: "0 2 * * *" # 2 AM daily
jobTemplate:
spec:
template:
spec:
hostPID: true
hostNetwork: true
containers:
- name: cleanup
image: alpine:latest
command:
- /bin/sh
- -c
- |
# Clean old logs
find /host/var/log/pods -name "*.log" -mtime +7 -delete
# Clean old containers
nsenter --mount=/proc/1/ns/mnt -- docker system prune -af --filter "until=72h"
securityContext:
privileged: true
volumeMounts:
- name: host
mountPath: /host
volumes:
- name: host
hostPath:
path: /
restartPolicy: OnFailure
EOF

# 4. Implement disk alerts
kubectl apply -f - <<EOF
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: disk-space-alerts
namespace: monitoring
spec:
groups:
- name: disk
rules:
- alert: NodeDiskSpaceHigh
expr: (node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}) < 0.2
for: 5m
labels:
severity: warning
annotations:
summary: "Node disk space <20%"

- alert: NodeDiskSpaceCritical
expr: (node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}) < 0.1
for: 2m
labels:
severity: critical
annotations:
summary: "Node disk space <10%"
EOF

Monitoring & Prevention

Metrics to Track:

- node_filesystem_avail_bytes (available disk space)
- node_filesystem_size_bytes (total disk space)
- container_fs_usage_bytes (container filesystem usage)
- kubelet_volume_stats_used_bytes (PV usage)
- log_file_size (application log sizes)

Dashboards:

- Node disk usage per mount point
- Pod ephemeral storage usage
- PV usage trends
- Log growth rate
- Image/container count over time

Verification Steps

# 1. Disk space recovered
ssh <node-name> df -h /

# 2. No disk pressure
kubectl describe node <node-name> | grep DiskPressure

# 3. Pods stable
kubectl get pods -n production -o wide | grep <node-name>

# 4. Services healthy
curl -i https://api.company.com/health

# 5. No pod evictions
kubectl get events --field-selector reason=Evicted -n production

Post-Incident Actions

  • Analyze what caused disk fill
  • Implement proper log management strategy
  • Set ephemeral-storage limits on all pods
  • Configure automated cleanup
  • Add capacity planning for storage
  • Review and optimize logging verbosity
  • Document disk space requirements

6. Certificate Expiration

Metadata

  • Severity: SEV-1 (Critical) if expired, SEV-3 (Low) if approaching
  • MTTR Target: < 15 minutes for renewal, 0 minutes for prevention
  • On-Call Team: SRE, Security
  • Last Updated: 2024-11-27

Symptoms

  • "SSL certificate has expired" errors
  • Browsers showing security warnings
  • API clients unable to connect
  • Services failing TLS handshake
  • Certificate validation errors in logs

Detection

# Check certificate expiration
echo | openssl s_client -servername api.company.com -connect api.company.com:443 2>/dev/null | \
openssl x509 -noout -dates

# Check all Kubernetes TLS secrets
kubectl get secrets -A -o json | \
jq -r '.items[] | select(.type=="kubernetes.io/tls") | "\(.metadata.namespace)/\(.metadata.name)"' | \
while read secret; do
namespace=$(echo $secret | cut -d/ -f1)
name=$(echo $secret | cut -d/ -f2)
expiry=$(kubectl get secret $name -n $namespace -o jsonpath='{.data.tls\.crt}' | \
base64 -d | openssl x509 -noout -enddate 2>/dev/null | cut -d= -f2)
echo "$secret: $expiry"
done

# Check certificate in