SRE Runbook
Table of Contents
- Service Outage / High Error Rate
- High Latency / Performance Degradation
- Database Connection Pool Exhaustion
- Memory Leak / OOM Kills
- Disk Space Exhaustion
- Certificate Expiration
- DDoS Attack / Traffic Surge
- Kubernetes Pod CrashLoopBackOff
- Message Queue Backup / Consumer Lag
- Database Replication Lag
- Cache Invalidation / Cache Storm
- Failed Deployment / Rollback
- Security Incident / Breach Detection
- Data Corruption
- DNS Resolution Failures
1. Service Outage / High Error Rate
Metadata
- Severity: SEV-1 (Critical)
- MTTR Target: < 30 minutes
- On-Call Team: SRE, Platform Engineering
- Escalation Path: SRE → Engineering Manager → VP Engineering
- Last Updated: 2024-11-27
- Owner: SRE Team
Symptoms
- High 5xx error rate (>1% of requests)
- Service returning errors instead of successful responses
- Health check endpoints failing
- Customer reports of service unavailability
- Spike in error monitoring alerts
Detection
Automated Alerts:
Alert: ServiceHighErrorRate
Severity: Critical
Condition: error_rate > 1% for 2 minutes
Dashboard: https://grafana.company.com/service-health
Manual Checks:
# Check service health
curl -i https://api.company.com/health
# Check error rate in last 5 minutes
kubectl logs -l app=api-service --tail=1000 --since=5m | grep ERROR | wc -l
# Check pod status
kubectl get pods -n production -l app=api-service
Triage Steps
Step 1: Establish Incident Context (2 minutes)
# Check current time and impact window
date
# Check error rate trend
# View Grafana dashboard - is error rate increasing or stable?
# Identify scope
# All services or specific service?
# All regions or specific region?
# All users or subset of users?
# Recent changes
# Check last 3 deployments
kubectl rollout history deployment/api-service -n production | tail -5
# Check recent config changes
kubectl get configmap api-config -n production -o yaml | grep -A 2 "last-applied"
Record in incident doc:
Start Time: [TIMESTAMP]
Error Rate: [X%]
Affected Service: [SERVICE_NAME]
Affected Users: [ALL/SUBSET]
Recent Changes: [YES/NO - DETAILS]
Step 2: Immediate Mitigation (5 minutes)
Option A: Recent Deployment - Rollback
# If deployment in last 30 minutes, rollback immediately
kubectl rollout undo deployment/api-service -n production
# Monitor rollback progress
kubectl rollout status deployment/api-service -n production
# Watch error rate
watch -n 5 'curl -s https://api.company.com/metrics | grep error_rate'
Option B: Scale Up (If Traffic Related)
# Check current replica count
kubectl get deployment api-service -n production
# Scale up by 50%
current_replicas=$(kubectl get deployment api-service -n production -o jsonpath='{.spec.replicas}')
new_replicas=$((current_replicas * 3 / 2))
kubectl scale deployment api-service -n production --replicas=$new_replicas
# Enable HPA if not already
kubectl autoscale deployment api-service -n production --min=10 --max=50 --cpu-percent=70
Option C: Circuit Breaker (If Dependency Down)
# If error logs show dependency timeouts
# Enable circuit breaker via feature flag
curl -X POST https://feature-flags.company.com/api/flags/circuit-breaker-enable \
-H "Authorization: Bearer $ADMIN_TOKEN" \
-d '{"enabled": true, "service": "downstream-api"}'
# Or update config map
kubectl patch configmap api-config -n production \
--type merge -p '{"data":{"CIRCUIT_BREAKER_ENABLED":"true"}}'
# Restart pods to pick up config
kubectl rollout restart deployment/api-service -n production
Step 3: Root Cause Investigation (15 minutes)
Check Logs:
# Recent errors
kubectl logs deployment/api-service -n production --tail=500 --since=10m | grep -i error
# Stack traces
kubectl logs deployment/api-service -n production --tail=1000 | grep -A 10 "Exception"
# All logs from failing pods
failing_pods=$(kubectl get pods -n production -l app=api-service --field-selector=status.phase!=Running -o name)
for pod in $failing_pods; do
echo "=== Logs from $pod ==="
kubectl logs $pod -n production --tail=100
done
Check Metrics:
# CPU usage
kubectl top pods -n production -l app=api-service
# Memory usage
kubectl top pods -n production -l app=api-service --sort-by=memory
# Request rate
curl -s "http://prometheus:9090/api/v1/query?query=rate(http_requests_total{service='api-service'}[5m])"
# Database connections
curl -s "http://prometheus:9090/api/v1/query?query=db_connections_active{service='api-service'}"
Check Dependencies:
# Test database connectivity
kubectl run -i --tty --rm debug --image=postgres:latest --restart=Never -- \
psql -h postgres.production.svc.cluster.local -U appuser -d appdb -c "SELECT 1;"
# Test Redis
kubectl run -i --tty --rm debug --image=redis:latest --restart=Never -- \
redis-cli -h redis.production.svc.cluster.local ping
# Test external API
curl -i -m 5 https://external-api.partner.com/health
Check Network:
# DNS resolution
nslookup api-service.production.svc.cluster.local
# Network policies
kubectl get networkpolicies -n production
# Service endpoints
kubectl get endpoints api-service -n production
Step 4: Resolution Actions
Common Root Causes & Fixes:
A. Database Connection Pool Exhaustion
# Increase pool size (if safe)
kubectl set env deployment/api-service -n production DB_POOL_SIZE=50
# Or restart pods to reset connections
kubectl rollout restart deployment/api-service -n production
B. Memory Leak / OOM
# Increase memory limits temporarily
kubectl set resources deployment api-service -n production \
--limits=memory=4Gi --requests=memory=2Gi
# Enable heap dump on OOM (Java)
kubectl set env deployment/api-service -n production \
JAVA_OPTS="-XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/tmp/heapdump.hprof"
# Restart rolling
kubectl rollout restart deployment/api-service -n production
C. External Dependency Failure
# Enable graceful degradation
# Update feature flag to bypass failing service
curl -X POST https://feature-flags.company.com/api/flags/use-fallback-service \
-H "Authorization: Bearer $ADMIN_TOKEN" \
-d '{"enabled": true}'
# Or enable cached responses
kubectl set env deployment/api-service -n production ENABLE_CACHE_FALLBACK=true
D. Configuration Error
# Revert config change
kubectl rollout undo configmap/api-config -n production
# Restart to pick up old config
kubectl rollout restart deployment/api-service -n production
Verification Steps
# 1. Check error rate returned to normal (<0.1%)
curl -s https://api.company.com/metrics | grep error_rate
# 2. Verify all pods healthy
kubectl get pods -n production -l app=api-service | grep -c Running
expected_count=$(kubectl get deployment api-service -n production -o jsonpath='{.spec.replicas}')
echo "Expected: $expected_count"
# 3. Test end-to-end
curl -i -X POST https://api.company.com/v1/test \
-H "Content-Type: application/json" \
-d '{"test": "data"}'
# 4. Check dependent services
curl -i https://api.company.com/health/dependencies
# 5. Monitor for 15 minutes
watch -n 30 'date && curl -s https://api.company.com/metrics | grep -E "error_rate|latency_p99"'
Communication Templates
Initial Announcement (Slack/Status Page):
🚨 INCIDENT: API Service Experiencing High Error Rate
Status: Investigating
Impact: ~40% of API requests failing
Affected: api.company.com
Started: [TIMESTAMP]
Team: Investigating root cause
ETA: 15 minutes for initial mitigation
Updates: Will provide update in 10 minutes
War Room: #incident-2024-1127-001
Update:
📊 UPDATE: API Service Incident
Status: Mitigation Applied
Action: Rolled back deployment v2.3.5
Result: Error rate decreased from 40% to 2%
Next: Monitoring for stability, investigating root cause
ETA: Full resolution in 10 minutes
Resolution:
✅ RESOLVED: API Service Incident
Status: Resolved
Duration: 27 minutes (10:15 AM - 10:42 AM ET)
Root Cause: Database connection pool exhaustion from v2.3.5 config change
Resolution: Rolled back to v2.3.4
Impact: ~2,400 failed requests during incident window
Postmortem: Will be published within 48 hours
Thank you for your patience.
Escalation Criteria
Escalate to Engineering Manager if:
- MTTR exceeds 30 minutes
- Impact >50% of users
- Data loss suspected
- Security implications identified
Escalate to VP Engineering if:
- MTTR exceeds 1 hour
- Major customer impact
- Media/PR implications
- Regulatory reporting required
Contact:
Primary On-Call SRE: [Use PagerDuty]
Engineering Manager: [Slack: @eng-manager] [Phone: XXX-XXX-XXXX]
VP Engineering: [Slack: @vp-eng] [Phone: XXX-XXX-XXXX]
Security Team: security@company.com [Slack: #security-incidents]
Post-Incident Actions
Immediate (Same Day):
- Update incident timeline in documentation
- Notify all stakeholders of resolution
- Begin postmortem document
- Capture all logs, metrics, traces for analysis
- Take database/system snapshots if relevant
Within 48 Hours:
- Complete blameless postmortem
- Identify action items with owners
- Schedule postmortem review meeting
- Update runbook with lessons learned
Within 1 Week:
- Implement quick wins from action items
- Add monitoring/alerting to prevent recurrence
- Share learnings with broader team
Related Runbooks
Reference Links
- Grafana Dashboard: https://grafana.company.com/d/service-health
- PagerDuty Escalation: https://company.pagerduty.com/escalation_policies
- Incident Response Process: https://wiki.company.com/sre/incident-response
- Postmortem Template: https://wiki.company.com/sre/postmortem-template
2. High Latency / Performance Degradation
Metadata
- Severity: SEV-2 (High)
- MTTR Target: < 1 hour
- On-Call Team: SRE, Backend Engineering
- Last Updated: 2024-11-27
- Owner: SRE Team
Symptoms
- P95/P99 latency exceeding SLO
- User complaints about slow responses
- Timeouts in dependent services
- Increased request queue depth
- Slow database queries
Detection
Automated Alerts:
Alert: HighLatencyP99
Severity: Warning
Condition: p99_latency > 500ms for 5 minutes
SLO: p99 < 200ms
Dashboard: https://grafana.company.com/latency
Triage Steps
Step 1: Quantify Impact (2 minutes)
# Check current latency
curl -s https://api.company.com/metrics | grep -E "latency_p50|latency_p95|latency_p99"
# Get latency percentiles from Prometheus
curl -s "http://prometheus:9090/api/v1/query?query=histogram_quantile(0.99, rate(http_request_duration_seconds_bucket{service='api'}[5m]))"
# Affected endpoints
kubectl logs deployment/api-service -n production --tail=1000 | \
awk '{print $7}' | sort | uniq -c | sort -rn | head -10
Document:
Current P99: [XXX ms] (SLO: 200ms)
Current P95: [XXX ms]
Affected Endpoints: [LIST]
User Reports: [NUMBER]
Step 2: Identify Bottleneck (10 minutes)
Check Application Performance:
# CPU throttling
kubectl top pods -n production -l app=api-service
# Check for CPU throttling
kubectl describe pods -n production -l app=api-service | grep -A 5 "cpu"
# Memory pressure
kubectl top pods -n production -l app=api-service --sort-by=memory
# Thread dumps (Java applications)
kubectl exec -it deployment/api-service -n production -- jstack 1 > thread-dump.txt
# Profile CPU (if profiling enabled)
kubectl exec -it deployment/api-service -n production -- \
curl http://localhost:6060/debug/pprof/profile?seconds=30 > cpu-profile.out
Check Database:
# Active queries
kubectl exec -it postgres-0 -n production -- psql -U postgres -c "
SELECT pid, now() - pg_stat_activity.query_start AS duration, query, state
FROM pg_stat_activity
WHERE (now() - pg_stat_activity.query_start) > interval '5 seconds'
ORDER BY duration DESC;
"
# Slow query log
kubectl exec -it postgres-0 -n production -- tail -100 /var/log/postgresql/slow-query.log
# Database connections
kubectl exec -it postgres-0 -n production -- psql -U postgres -c "
SELECT count(*) as connection_count, state
FROM pg_stat_activity
GROUP BY state;
"
# Lock waits
kubectl exec -it postgres-0 -n production -- psql -U postgres -c "
SELECT blocked_locks.pid AS blocked_pid,
blocking_locks.pid AS blocking_pid,
blocked_activity.query AS blocked_query,
blocking_activity.query AS blocking_query
FROM pg_catalog.pg_locks blocked_locks
JOIN pg_catalog.pg_stat_activity blocked_activity ON blocked_activity.pid = blocked_locks.pid
JOIN pg_catalog.pg_locks blocking_locks ON blocking_locks.locktype = blocked_locks.locktype
JOIN pg_catalog.pg_stat_activity blocking_activity ON blocking_activity.pid = blocking_locks.pid
WHERE NOT blocked_locks.granted;
"
Check Cache:
# Redis hit rate
redis-cli --latency-history
# Cache stats
kubectl exec -it redis-0 -n production -- redis-cli INFO stats | grep -E "hit|miss"
# Memory usage
kubectl exec -it redis-0 -n production -- redis-cli INFO memory | grep used_memory_human
# Slow log
kubectl exec -it redis-0 -n production -- redis-cli SLOWLOG GET 10
Check Network:
# Network latency to dependencies
for service in postgres redis external-api; do
echo "=== $service ==="
kubectl run ping-test --image=busybox --rm -it --restart=Never -- \
ping -c 5 $service.production.svc.cluster.local
done
# DNS lookup times
kubectl run dns-test --image=busybox --rm -it --restart=Never -- \
nslookup api.company.com
# External API latency
time curl -X GET https://external-api.partner.com/v1/data
Check Distributed Traces:
# Identify slow spans in Jaeger
# Navigate to Jaeger UI: https://jaeger.company.com
# Filter by:
# - Service: api-service
# - Min Duration: 500ms
# - Lookback: 1 hour
# Programmatic trace query
curl "http://jaeger-query:16686/api/traces?service=api-service&limit=20&lookback=1h&minDuration=500ms"
Step 3: Apply Mitigation
Scenario A: Database Slow Queries
# Kill long-running queries (if safe)
kubectl exec -it postgres-0 -n production -- psql -U postgres -c "
SELECT pg_terminate_backend(pid)
FROM pg_stat_activity
WHERE (now() - pg_stat_activity.query_start) > interval '60 seconds'
AND state = 'active'
AND pid <> pg_backend_pid();
"
# Add missing index (if identified)
kubectl exec -it postgres-0 -n production -- psql -U postgres -d appdb -c "
CREATE INDEX CONCURRENTLY idx_users_email ON users(email);
"
# Analyze tables (update statistics)
kubectl exec -it postgres-0 -n production -- psql -U postgres -d appdb -c "
ANALYZE VERBOSE users;
"
# Scale read replicas
kubectl scale statefulset postgres-replica -n production --replicas=5
Scenario B: Cache Miss Storm
# Pre-warm cache with common queries
kubectl exec -it deployment/api-service -n production -- \
curl -X POST http://localhost:8080/admin/cache/warmup
# Increase cache size
kubectl exec -it redis-0 -n production -- redis-cli CONFIG SET maxmemory 4gb
# Enable cache fallback to stale data
kubectl set env deployment/api-service -n production CACHE_SERVE_STALE=true
Scenario C: CPU/Memory Constrained
# Increase resources
kubectl set resources deployment api-service -n production \
--limits=cpu=2000m,memory=4Gi \
--requests=cpu=1000m,memory=2Gi
# Scale horizontally
kubectl scale deployment api-service -n production --replicas=20
# Enable HPA
kubectl autoscale deployment api-service -n production \
--min=10 --max=30 --cpu-percent=60
Scenario D: External API Slow
# Increase timeout and enable caching
kubectl set env deployment/api-service -n production \
EXTERNAL_API_TIMEOUT=10000 \
EXTERNAL_API_CACHE_ENABLED=true \
EXTERNAL_API_CACHE_TTL=300
# Enable circuit breaker
kubectl set env deployment/api-service -n production \
CIRCUIT_BREAKER_ENABLED=true \
CIRCUIT_BREAKER_THRESHOLD=50
# Use fallback/cached data
kubectl patch configmap api-config -n production \
--type merge -p '{"data":{"USE_FALLBACK_DATA":"true"}}'
Scenario E: Thread Pool Exhaustion
# Increase thread pool size
kubectl set env deployment/api-service -n production \
THREAD_POOL_SIZE=200 \
THREAD_QUEUE_SIZE=1000
# Restart to apply
kubectl rollout restart deployment/api-service -n production
Verification Steps
# 1. Monitor latency improvement
watch -n 10 'curl -s https://api.company.com/metrics | grep latency_p99'
# 2. Check trace samples
# View Jaeger for recent requests - should show improved latency
# 3. Database query times
kubectl exec -it postgres-0 -n production -- psql -U postgres -c "
SELECT query, mean_exec_time, calls
FROM pg_stat_statements
ORDER BY mean_exec_time DESC
LIMIT 10;
"
# 4. Resource utilization normalized
kubectl top pods -n production -l app=api-service
# 5. Error rate stable (ensure fix didn't introduce errors)
curl -s https://api.company.com/metrics | grep error_rate
Root Cause Investigation
Common Causes:
-
N+1 Query Problem
- Check ORM query patterns
- Enable query logging
- Add eager loading
-
Missing Database Index
- Analyze slow query log
- Use
EXPLAIN ANALYZE - Create appropriate indexes
-
Memory Garbage Collection
- Check GC logs (Java/JVM)
- Tune GC parameters
- Increase heap size
-
Inefficient Algorithm
- Profile code execution
- Identify hot paths
- Optimize algorithms
-
External Service Degradation
- Check dependency SLOs
- Implement caching
- Add circuit breakers
Communication Templates
Initial Alert:
⚠️ INCIDENT: API Latency Degradation
Status: Investigating
Impact: P99 latency at 800ms (SLO: 200ms)
Affected: All API endpoints
User Impact: Slow response times
Team: Investigating root cause
Updates: Every 15 minutes in #incident-channel
Resolution:
✅ RESOLVED: API Latency Degradation
Duration: 45 minutes
Root Cause: Missing database index on users.email causing table scans
Resolution: Added index, latency returned to normal
Current P99: 180ms (within SLO)
Postmortem: Will be published within 48 hours
Post-Incident Actions
- Add database query monitoring
- Implement automated index recommendations
- Load test with realistic data volumes
- Add latency SLO alerts per endpoint
- Review and optimize slow queries
- Implement APM (Application Performance Monitoring)
Related Runbooks
3. Database Connection Pool Exhaustion
Metadata
- Severity: SEV-1 (Critical)
- MTTR Target: < 15 minutes
- On-Call Team: SRE, Database Team
- Last Updated: 2024-11-27
Symptoms
- "Connection pool exhausted" errors in application logs
- Requests timing out
- Database showing many idle connections
- Application unable to acquire new connections
- Connection pool at 100% utilization
Detection
# Check connection pool metrics
curl -s https://api.company.com/metrics | grep db_pool
# Expected output:
# db_pool_active 50
# db_pool_idle 0
# db_pool_total 50
# db_pool_wait_count 1500 <-- High wait count indicates problem
Triage Steps
Step 1: Confirm Pool Exhaustion (1 minute)
# Application side
kubectl logs deployment/api-service -n production --tail=100 | \
grep -i "connection pool\|unable to acquire\|timeout"
# Database side - check connection count
kubectl exec -it postgres-0 -n production -- psql -U postgres -c "
SELECT count(*) as total_connections,
state,
application_name
FROM pg_stat_activity
WHERE datname = 'appdb'
GROUP BY state, application_name
ORDER BY total_connections DESC;
"
# Check pool configuration
kubectl get configmap api-config -n production -o yaml | grep -i pool
Step 2: Immediate Mitigation (5 minutes)
Option A: Restart Application Pods (Fastest)
# Rolling restart to reset connections
kubectl rollout restart deployment/api-service -n production
# Monitor restart
kubectl rollout status deployment/api-service -n production
# Verify connections released
kubectl exec -it postgres-0 -n production -- psql -U postgres -c "
SELECT count(*) FROM pg_stat_activity WHERE datname = 'appdb';
"
Option B: Increase Pool Size (If Infrastructure Allows)
# Check database connection limit
kubectl exec -it postgres-0 -n production -- psql -U postgres -c "
SHOW max_connections;
"
# Calculate safe pool size
# max_connections / number_of_app_instances = pool_size_per_instance
# Example: 200 max / 10 instances = 20 per instance (current might be 50)
# Increase pool size temporarily
kubectl set env deployment/api-service -n production \
DB_POOL_SIZE=30 \
DB_POOL_MAX_IDLE=10 \
DB_CONNECTION_TIMEOUT=30000
# Monitor
watch -n 5 'kubectl logs deployment/api-service -n production --tail=50 | grep -i pool'
Option C: Kill Idle Connections (If Many Idle)
# Identify idle connections
kubectl exec -it postgres-0 -n production -- psql -U postgres -c "
SELECT pid, state, query_start, state_change, query
FROM pg_stat_activity
WHERE datname = 'appdb'
AND state = 'idle'
AND state_change < now() - interval '5 minutes';
"
# Kill long-idle connections
kubectl exec -it postgres-0 -n production -- psql -U postgres -c "
SELECT pg_terminate_backend(pid)
FROM pg_stat_activity
WHERE datname = 'appdb'
AND state = 'idle'
AND state_change < now() - interval '10 minutes'
AND pid <> pg_backend_pid();
"
Step 3: Root Cause Analysis (10 minutes)
Check for Connection Leaks:
# Application logs - look for unclosed connections
kubectl logs deployment/api-service -n production --tail=5000 | \
grep -i "connection not closed\|resource leak"
# Check connection lifecycle
kubectl exec -it postgres-0 -n production -- psql -U postgres -c "
SELECT application_name,
state,
count(*) as conn_count,
max(now() - state_change) as max_idle_time
FROM pg_stat_activity
WHERE datname = 'appdb'
GROUP BY application_name, state
ORDER BY conn_count DESC;
"
# Long-running transactions
kubectl exec -it postgres-0 -n production -- psql -U postgres -c "
SELECT pid, now() - xact_start AS duration, state, query
FROM pg_stat_activity
WHERE datname = 'appdb'
AND xact_start IS NOT NULL
ORDER BY duration DESC
LIMIT 20;
"
Check Recent Changes:
# Recent deployments
kubectl rollout history deployment/api-service -n production | tail -5
# Config changes
kubectl get configmap api-config -n production -o yaml | \
grep -A 2 "last-applied-configuration"
# Recent code changes affecting database access
git log --since="24 hours ago" --grep="database\|pool\|connection" --oneline
Check for Traffic Spike:
# Request rate
curl "http://prometheus:9090/api/v1/query?query=rate(http_requests_total{service='api-service'}[5m])"
# Compare to baseline
curl "http://prometheus:9090/api/v1/query?query=rate(http_requests_total{service='api-service'}[5m] offset 1h)"
Resolution Actions
Permanent Fix Options:
A. Fix Connection Leak in Code
# Bad - connection leak
def get_user(user_id):
conn = db_pool.getconn()
cursor = conn.cursor()
cursor.execute("SELECT * FROM users WHERE id = %s", (user_id,))
result = cursor.fetchone()
return result # Connection never returned!
# Good - using context manager
def get_user(user_id):
with db_pool.getconn() as conn:
with conn.cursor() as cursor:
cursor.execute("SELECT * FROM users WHERE id = %s", (user_id,))
return cursor.fetchone()
# Connection automatically returned to pool
B. Optimize Pool Configuration
# Configure based on actual usage patterns
kubectl set env deployment/api-service -n production \
DB_POOL_SIZE=20 \
DB_POOL_MIN_IDLE=5 \
DB_POOL_MAX_IDLE=10 \
DB_POOL_IDLE_TIMEOUT=300000 \
DB_POOL_CONNECTION_TIMEOUT=30000 \
DB_POOL_VALIDATION_TIMEOUT=5000
# Enable connection validation
kubectl set env deployment/api-service -n production \
DB_POOL_TEST_ON_BORROW=true \
DB_POOL_TEST_WHILE_IDLE=true
C. Implement Connection Pooler (PgBouncer)
# Deploy PgBouncer
kubectl apply -f - <<EOF
apiVersion: apps/v1
kind: Deployment
metadata:
name: pgbouncer
namespace: production
spec:
replicas: 3
selector:
matchLabels:
app: pgbouncer
template:
metadata:
labels:
app: pgbouncer
spec:
containers:
- name: pgbouncer
image: pgbouncer/pgbouncer:latest
env:
- name: DATABASES_HOST
value: postgres.production.svc.cluster.local
- name: POOL_MODE
value: transaction
- name: MAX_CLIENT_CONN
value: "1000"
- name: DEFAULT_POOL_SIZE
value: "25"
---
apiVersion: v1
kind: Service
metadata:
name: pgbouncer
namespace: production
spec:
selector:
app: pgbouncer
ports:
- port: 5432
targetPort: 5432
EOF
# Update application to use PgBouncer
kubectl set env deployment/api-service -n production \
DB_HOST=pgbouncer.production.svc.cluster.local
D. Scale Database Connections
# Increase PostgreSQL max_connections
kubectl exec -it postgres-0 -n production -- psql -U postgres -c "
ALTER SYSTEM SET max_connections = 300;
SELECT pg_reload_conf();
"
# Note: May require database restart for some changes
kubectl exec -it postgres-0 -n production -- psql -U postgres -c "
SELECT pg_reload_conf();
"
Monitoring & Alerting
Add Proactive Monitoring:
# Prometheus alert rule
groups:
- name: database_pool
rules:
- alert: ConnectionPoolHighUtilization
expr: db_pool_active / db_pool_total > 0.7
for: 5m
labels:
severity: warning
annotations:
summary: "Connection pool utilization >70%"
description: "Pool at {{ $value }}% capacity"
- alert: ConnectionPoolExhausted
expr: db_pool_active / db_pool_total > 0.9
for: 2m
labels:
severity: critical
annotations:
summary: "Connection pool nearly exhausted"
- alert: ConnectionPoolWaitTime
expr: rate(db_pool_wait_count[5m]) > 10
for: 5m
labels:
severity: warning
annotations:
summary: "High connection pool wait count"
Dashboard Metrics:
- db_pool_total (configured pool size)
- db_pool_active (connections in use)
- db_pool_idle (connections available)
- db_pool_wait_count (requests waiting for connection)
- db_pool_wait_time_ms (time waiting for connection)
- db_connection_lifetime_seconds (connection age histogram)
Verification Steps
# 1. Pool utilization back to normal
kubectl exec -it deployment/api-service -n production -- \
curl http://localhost:8080/metrics | grep db_pool_active
# Should see: db_pool_active < 70% of db_pool_total
# 2. No wait queue
kubectl exec -it deployment/api-service -n production -- \
curl http://localhost:8080/metrics | grep db_pool_wait_count
# Should see: db_pool_wait_count = 0 or minimal
# 3. Database connection count stable
kubectl exec -it postgres-0 -n production -- psql -U postgres -c "
SELECT count(*) FROM pg_stat_activity WHERE datname = 'appdb';
"
# 4. No errors in logs
kubectl logs deployment/api-service -n production --tail=100 | \
grep -i "connection" | grep -i error
# 5. Response times normal
curl -s https://api.company.com/metrics | grep latency_p99
Prevention
Code Review Checklist:
- All database connections properly closed
- Using connection pool best practices
- Proper error handling to ensure connection release
- No connection usage outside transactions
- Connection timeout configured
Testing:
- Load test with connection pool monitoring
- Chaos engineering: Test with limited connections
- Connection leak detection in CI/CD
Architecture:
- Consider connection pooler (PgBouncer)
- Implement read replicas to distribute load
- Use caching to reduce database queries
Related Runbooks
4. Memory Leak / OOM Kills
Metadata
- Severity: SEV-2 (High)
- MTTR Target: < 1 hour
- On-Call Team: SRE, Backend Engineering
- Last Updated: 2024-11-27
Symptoms
- Pods being OOMKilled (Out of Memory)
- Memory usage continuously increasing
- Slow performance and increased GC pressure
- Pod restarts without clear error
- "Cannot allocate memory" errors
Detection
# Check for OOMKilled pods
kubectl get pods -n production -l app=api-service | grep OOMKilled
# Check pod events
kubectl get events -n production --field-selector involvedObject.name=api-service-xxx | \
grep -i oom
# Memory usage trend
kubectl top pods -n production -l app=api-service --sort-by=memory
Triage Steps
Step 1: Confirm OOM Issue (2 minutes)
# Check pod status and restart reason
kubectl describe pod <pod-name> -n production | grep -A 10 "Last State"
# Should see output like:
# Last State: Terminated
# Reason: OOMKilled
# Exit Code: 137
# Check memory limits
kubectl get pod <pod-name> -n production -o jsonpath='{.spec.containers[*].resources}'
# Monitor memory usage
watch -n 5 'kubectl top pods -n production -l app=api-service'
Step 2: Immediate Mitigation (10 minutes)
Option A: Increase Memory Limits (Quick Fix)
# Current limits
kubectl get deployment api-service -n production -o jsonpath='{.spec.template.spec.containers[0].resources}'
# Increase memory limit temporarily (2x current)
kubectl set resources deployment api-service -n production \
--limits=memory=4Gi \
--requests=memory=2Gi
# Monitor rollout
kubectl rollout status deployment/api-service -n production
# Watch memory usage
watch -n 10 'kubectl top pods -n production -l app=api-service'
Option B: Scale Out (If Memory Leak is Gradual)
# Add more pods to distribute load
kubectl scale deployment api-service -n production --replicas=15
# Enable HPA with lower memory target
kubectl autoscale deployment api-service -n production \
--min=10 --max=30 --cpu-percent=70
# Note: This is temporary - still need to fix leak
Option C: Implement Pod Lifecycle (Workaround)
# Restart pods proactively before they OOM
# Add to deployment spec:
kubectl patch deployment api-service -n production --type=json -p='[
{
"op": "add",
"path": "/spec/template/spec/containers/0/lifecycle",
"value": {
"preStop": {
"exec": {
"command": ["/bin/sh", "-c", "sleep 15"]
}
}
}
}
]'
# Add TTL to pod lifecycle (requires external controller)
# Or implement rolling restart every 12 hours
Step 3: Capture Diagnostics (15 minutes)
Capture Heap Dump (Java/JVM):
# Before pod is killed, capture heap dump
kubectl exec -it <pod-name> -n production -- \
jcmd 1 GC.heap_dump /tmp/heapdump.hprof
# Copy heap dump locally
kubectl cp production/<pod-name>:/tmp/heapdump.hprof ./heapdump-$(date +%Y%m%d-%H%M%S).hprof
# Analyze with MAT or jhat
# Upload to analysis tools or analyze locally
Capture Memory Profile (Go Applications):
# If profiling endpoint enabled
kubectl exec -it <pod-name> -n production -- \
curl http://localhost:6060/debug/pprof/heap > heap-$(date +%Y%m%d-%H%M%S).prof
# Copy locally
kubectl cp production/<pod-name>:/tmp/heap-profile.prof ./heap-profile.prof
# Analyze
go tool pprof -http=:8080 heap-profile.prof
Capture Memory Metrics (Python):
# Install memory_profiler if not already
kubectl exec -it <pod-name> -n production -- pip install memory_profiler
# Profile specific function
kubectl exec -it <pod-name> -n production -- \
python -m memory_profiler app.py
# Or use tracemalloc for running process
kubectl exec -it <pod-name> -n production -- python3 <<'EOF'
import tracemalloc
tracemalloc.start()
# Take snapshot
snapshot = tracemalloc.take_snapshot()
top_stats = snapshot.statistics('lineno')
for stat in top_stats[:10]:
print(stat)
EOF
Check for Common Memory Issues:
# Large objects in memory
kubectl exec -it <pod-name> -n production -- \
curl http://localhost:8080/debug/vars | jq '.memstats'
# Check goroutine leaks (Go)
kubectl exec -it <pod-name> -n production -- \
curl http://localhost:6060/debug/pprof/goroutine?debug=1
# Check thread count (Java)
kubectl exec -it <pod-name> -n production -- \
jcmd 1 Thread.print | grep "Thread" | wc -l
# File descriptor leaks
kubectl exec -it <pod-name> -n production -- \
ls -la /proc/1/fd | wc -l
Step 4: Identify Root Cause
Common Memory Leak Causes:
A. Caching Without Eviction:
# Bad - unbounded cache
cache = {}
def get_user(user_id):
if user_id not in cache:
cache[user_id] = fetch_from_db(user_id) # Cache grows forever!
return cache[user_id]
# Good - bounded cache with LRU
from functools import lru_cache
@lru_cache(maxsize=1000)
def get_user(user_id):
return fetch_from_db(user_id)
B. Event Listeners Not Removed:
// Bad - event listener leak
class Component {
constructor() {
this.data = new Array(1000000);
window.addEventListener("resize", () => this.handleResize());
}
// Missing cleanup!
}
// Good - cleanup listeners
class Component {
constructor() {
this.data = new Array(1000000);
this.handleResize = this.handleResize.bind(this);
window.addEventListener("resize", this.handleResize);
}
destroy() {
window.removeEventListener("resize", this.handleResize);
this.data = null;
}
}
C. Goroutine Leaks (Go):
// Bad - goroutine leak
func processRequests() {
for request := range requests {
go handleRequest(request) // Goroutines never cleaned up
}
}
// Good - bounded goroutines
func processRequests() {
sem := make(chan struct{}, 100) // Max 100 concurrent
for request := range requests {
sem <- struct{}{}
go func(req Request) {
defer func() { <-sem }()
handleRequest(req)
}(request)
}
}
D. Database Result Sets Not Closed:
// Bad - result set leak
public List<User> getUsers() {
Statement stmt = conn.createStatement();
ResultSet rs = stmt.executeQuery("SELECT * FROM users");
List<User> users = new ArrayList<>();
while (rs.next()) {
users.add(new User(rs));
}
return users; // ResultSet and Statement never closed!
}
// Good - use try-with-resources
public List<User> getUsers() {
List<User> users = new ArrayList<>();
try (Statement stmt = conn.createStatement();
ResultSet rs = stmt.executeQuery("SELECT * FROM users")) {
while (rs.next()) {
users.add(new User(rs));
}
}
return users;
}
Resolution Actions
Short-term:
# 1. Increase memory limits (already done)
# 2. Enable memory monitoring
kubectl patch deployment api-service -n production --type=json -p='[
{
"op": "add",
"path": "/spec/template/metadata/annotations/prometheus.io~1scrape",
"value": "true"
}
]'
# 3. Add liveness probe with memory check
kubectl patch deployment api-service -n production --type=json -p='[
{
"op": "add",
"path": "/spec/template/spec/containers/0/livenessProbe",
"value": {
"httpGet": {
"path": "/health",
"port": 8080
},
"initialDelaySeconds": 30,
"periodSeconds": 10
}
}
]'
# 4. Implement automatic restart before OOM
# Create CronJob to restart pods every 12 hours (temporary)
kubectl create -f - <<EOF
apiVersion: batch/v1
kind: CronJob
metadata:
name: api-service-restart
namespace: production
spec:
schedule: "0 */12 * * *"
jobTemplate:
spec:
template:
spec:
serviceAccountName: pod-restarter
containers:
- name: kubectl
image: bitnami/kubectl:latest
command:
- /bin/sh
- -c
- kubectl rollout restart deployment/api-service -n production
restartPolicy: OnFailure
EOF
Long-term Fix:
# 1. Fix code leak (deploy patch)
git checkout -b fix/memory-leak
# ... implement fix ...
git commit -m "Fix: Remove unbounded cache causing memory leak"
git push origin fix/memory-leak
# 2. Add memory profiling in production
kubectl set env deployment/api-service -n production \
ENABLE_PROFILING=true \
PROFILING_PORT=6060
# 3. Implement memory limits in code
# For Java:
kubectl set env deployment/api-service -n production \
JAVA_OPTS="-Xmx2g -XX:MaxMetaspaceSize=256m -XX:+UseG1GC"
# 4. Add memory monitoring dashboard
# 5. Implement alerts for memory growth
Monitoring & Prevention
Add Alerts:
groups:
- name: memory_alerts
rules:
- alert: MemoryUsageHigh
expr: container_memory_usage_bytes{pod=~"api-service.*"} / container_spec_memory_limit_bytes{pod=~"api-service.*"} > 0.8
for: 10m
labels:
severity: warning
annotations:
summary: "Pod memory usage >80%"
- alert: MemoryUsageGrowing
expr: predict_linear(container_memory_usage_bytes{pod=~"api-service.*"}[1h], 3600) > container_spec_memory_limit_bytes{pod=~"api-service.*"}
for: 15m
labels:
severity: warning
annotations:
summary: "Memory usage trending towards OOM"
- alert: OOMKillsDetected
expr: increase(kube_pod_container_status_restarts_total{pod=~"api-service.*"}[15m]) > 3
labels:
severity: critical
annotations:
summary: "Multiple pod restarts detected (possible OOM)"
Grafana Dashboard:
- Memory Usage (%)
- Memory Usage (bytes) over time
- Predicted time to OOM
- GC frequency and duration
- Heap size vs used heap
- Number of objects in memory
- Pod restart count
Verification Steps
# 1. Memory usage stable
watch -n 30 'kubectl top pods -n production -l app=api-service | tail -5'
# 2. No OOM kills in last hour
kubectl get events -n production --field-selector reason=OOMKilling | grep "last hour"
# 3. Pod uptime increasing (not restarting)
kubectl get pods -n production -l app=api-service -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.status.startTime}{"\n"}{end}'
# 4. Memory growth linear/flat (not exponential)
# Check Grafana memory usage graph
# 5. Application metrics healthy
curl -s https://api.company.com/metrics | grep -E "heap|gc|memory"
Post-Incident Actions
- Analyze heap dump to identify leak source
- Review code for common leak patterns
- Add memory profiling to CI/CD
- Implement memory budgets in code
- Add integration tests for memory leaks
- Document memory configuration guidelines
- Train team on memory leak prevention
Related Runbooks
5. Disk Space Exhaustion
Metadata
- Severity: SEV-2 (High), escalates to SEV-1 if database affected
- MTTR Target: < 30 minutes
- On-Call Team: SRE, Infrastructure
- Last Updated: 2024-11-27
Symptoms
- "No space left on device" errors
- Applications unable to write logs
- Database unable to write data
- Pod evictions due to disk pressure
- Slow I/O performance
Detection
# Check disk usage on nodes
kubectl get nodes -o custom-columns=NAME:.metadata.name,DISK:.status.allocatable.ephemeral-storage
# Check specific node
kubectl describe node <node-name> | grep -A 5 "Allocated resources"
# SSH to node and check
ssh <node> df -h
# Check for disk pressure
kubectl get nodes -o json | jq '.items[] | select(.status.conditions[] | select(.type=="DiskPressure" and .status=="True")) | .metadata.name'
Triage Steps
Step 1: Identify Affected Systems (2 minutes)
# Which nodes are affected?
for node in $(kubectl get nodes -o name); do
echo "=== $node ==="
kubectl describe $node | grep -E "DiskPressure|ephemeral-storage"
done
# Which pods are on affected nodes?
kubectl get pods -n production -o wide | grep <affected-node>
# Critical services affected?
kubectl get pods -n production -l tier=critical -o wide
Step 2: Immediate Mitigation (10 minutes)
Option A: Clean Up Logs
# SSH to affected node
ssh <node-name>
# Find large log files
sudo find /var/log -type f -size +100M -exec ls -lh {} \; | awk '{ print $9 ": " $5 }'
# Rotate/truncate large logs
sudo truncate -s 0 /var/log/containers/*.log
sudo journalctl --vacuum-size=500M
# Clean Docker logs (if applicable)
sudo sh -c "truncate -s 0 /var/lib/docker/containers/*/*-json.log"
# Kubernetes log cleanup
sudo find /var/log/pods -name "*.log" -mtime +7 -delete
Option B: Remove Unused Docker Images
# On affected node
ssh <node-name>
# List images sorted by size
sudo docker images --format "{{.Repository}}:{{.Tag}}\t{{.Size}}" | sort -k2 -h
# Remove unused images
sudo docker image prune -a --filter "until=72h" -f
# Remove dangling volumes
sudo docker volume prune -f
Option C: Clean Up Pod Ephemeral Storage
# Find pods using most disk
kubectl get pods -n production -o json | \
jq -r '.items[] | "\(.metadata.name) \(.spec.nodeName)"' | \
while read pod node; do
if [ "$node" = "<affected-node>" ]; then
echo "=== $pod ==="
kubectl exec $pod -n production -- du -sh /tmp /var/tmp 2>/dev/null || true
fi
done
# Clean up specific pod
kubectl exec <pod-name> -n production -- sh -c "rm -rf /tmp/*"
Option D: Cordon and Drain Node
# Prevent new pods from scheduling
kubectl cordon <node-name>
# Drain pods to other nodes
kubectl drain <node-name> --ignore-daemonsets --delete-emptydir-data --grace-period=60
# Clean up on node
ssh <node-name>
sudo docker system prune -a -f --volumes
sudo rm -rf /var/log/pods/*
sudo rm -rf /var/lib/kubelet/pods/*
# Uncordon when ready
kubectl uncordon <node-name>
Option E: Emergency Database Cleanup (If DB Affected)
# Connect to database pod
kubectl exec -it postgres-0 -n production -- psql -U postgres
# Check database sizes
SELECT pg_database.datname,
pg_size_pretty(pg_database_size(pg_database.datname)) AS size
FROM pg_database
ORDER BY pg_database_size(pg_database.datname) DESC;
# Check table sizes
SELECT schemaname,
tablename,
pg_size_pretty(pg_total_relation_size(schemaname||'.'||tablename)) AS size
FROM pg_tables
ORDER BY pg_total_relation_size(schemaname||'.'||tablename) DESC
LIMIT 20;
# Archive old data (if safe)
# Example: Archive logs older than 90 days
BEGIN;
COPY (SELECT * FROM audit_logs WHERE created_at < NOW() - INTERVAL '90 days')
TO PROGRAM 'gzip > /tmp/audit_logs_archive_$(date +%Y%m%d).csv.gz' WITH CSV HEADER;
DELETE FROM audit_logs WHERE created_at < NOW() - INTERVAL '90 days';
COMMIT;
# Vacuum to reclaim space
VACUUM FULL audit_logs;
Step 3: Root Cause Analysis
Find What's Consuming Space:
# On affected node
ssh <node-name>
# Find largest directories
sudo du -h --max-depth=3 / 2>/dev/null | sort -hr | head -20
# Find largest files
sudo find / -type f -size +500M -exec ls -lh {} \; 2>/dev/null | awk '{ print $9 ": " $5 }'
# Check specific directories
sudo du -sh /var/lib/docker/*
sudo du -sh /var/lib/kubelet/*
sudo du -sh /var/log/*
# Find recently created large files
sudo find / -type f -size +100M -mtime -1 -exec ls -lh {} \; 2>/dev/null
Common Culprits:
- Excessive Logging
# Check log volume
kubectl exec <pod-name> -n production -- du -sh /var/log
# Check logging rate
kubectl logs <pod-name> -n production --tail=100 --timestamps | \
awk '{print $1}' | sort | uniq -c
- Temp File Accumulation
# Check temp directories
kubectl exec <pod-name> -n production -- du -sh /tmp /var/tmp
# Find old temp files
kubectl exec <pod-name> -n production -- find /tmp -type f -mtime +7 -ls
- Database Growth
# PostgreSQL WAL files
kubectl exec postgres-0 -n production -- \
du -sh /var/lib/postgresql/data/pg_wal/
# MySQL binary logs
kubectl exec mysql-0 -n production -- \
du -sh /var/lib/mysql/binlog/
- Image/Container Buildup
# Unused containers
sudo docker ps -a --filter "status=exited" --filter "status=dead"
# Image layer cache
sudo du -sh /var/lib/docker/overlay2/
Resolution Actions
Short-term:
# 1. Implement log rotation
kubectl apply -f - <<EOF
apiVersion: v1
kind: ConfigMap
metadata:
name: filebeat-config
namespace: production
data:
filebeat.yml: |
filebeat.inputs:
- type: container
paths:
- /var/log/containers/*.log
processors:
- add_kubernetes_metadata:
host: \${NODE_NAME}
matchers:
- logs_path:
logs_path: "/var/log/containers/"
output.logstash:
hosts: ["logstash:5044"]
# Local log cleanup
queue.mem:
events: 4096
flush.min_events: 512
flush.timeout: 5s
EOF
# 2. Set up log shipping
kubectl apply -f https://raw.githubusercontent.com/elastic/beats/master/deploy/kubernetes/filebeat-kubernetes.yaml
# 3. Configure log rotation on nodes
# Add to node configuration or DaemonSet
cat <<'EOF' | sudo tee /etc/logrotate.d/containers
/var/log/containers/*.log {
daily
rotate 7
compress
missingok
notifempty
create 0644 root root
postrotate
/usr/bin/docker ps -a --format '{{.Names}}' | xargs -I {} docker kill -s HUP {} 2>/dev/null || true
endscript
}
EOF
Long-term:
# 1. Set ephemeral-storage limits on pods
kubectl patch deployment api-service -n production --type=json -p='[
{
"op": "add",
"path": "/spec/template/spec/containers/0/resources/limits/ephemeral-storage",
"value": "2Gi"
},
{
"op": "add",
"path": "/spec/template/spec/containers/0/resources/requests/ephemeral-storage",
"value": "1Gi"
}
]'
# 2. Enable disk usage monitoring
kubectl apply -f - <<EOF
apiVersion: v1
kind: ConfigMap
metadata:
name: node-exporter-config
namespace: monitoring
data:
entrypoint.sh: |
#!/bin/sh
exec /bin/node_exporter \
--collector.filesystem \
--collector.filesystem.mount-points-exclude="^/(dev|proc|sys|var/lib/docker/.+)(\$|/)" \
--web.listen-address=:9100
EOF
# 3. Set up automated cleanup CronJob
kubectl apply -f - <<EOF
apiVersion: batch/v1
kind: CronJob
metadata:
name: disk-cleanup
namespace: kube-system
spec:
schedule: "0 2 * * *" # 2 AM daily
jobTemplate:
spec:
template:
spec:
hostPID: true
hostNetwork: true
containers:
- name: cleanup
image: alpine:latest
command:
- /bin/sh
- -c
- |
# Clean old logs
find /host/var/log/pods -name "*.log" -mtime +7 -delete
# Clean old containers
nsenter --mount=/proc/1/ns/mnt -- docker system prune -af --filter "until=72h"
securityContext:
privileged: true
volumeMounts:
- name: host
mountPath: /host
volumes:
- name: host
hostPath:
path: /
restartPolicy: OnFailure
EOF
# 4. Implement disk alerts
kubectl apply -f - <<EOF
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: disk-space-alerts
namespace: monitoring
spec:
groups:
- name: disk
rules:
- alert: NodeDiskSpaceHigh
expr: (node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}) < 0.2
for: 5m
labels:
severity: warning
annotations:
summary: "Node disk space <20%"
- alert: NodeDiskSpaceCritical
expr: (node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}) < 0.1
for: 2m
labels:
severity: critical
annotations:
summary: "Node disk space <10%"
EOF
Monitoring & Prevention
Metrics to Track:
- node_filesystem_avail_bytes (available disk space)
- node_filesystem_size_bytes (total disk space)
- container_fs_usage_bytes (container filesystem usage)
- kubelet_volume_stats_used_bytes (PV usage)
- log_file_size (application log sizes)
Dashboards:
- Node disk usage per mount point
- Pod ephemeral storage usage
- PV usage trends
- Log growth rate
- Image/container count over time
Verification Steps
# 1. Disk space recovered
ssh <node-name> df -h /
# 2. No disk pressure
kubectl describe node <node-name> | grep DiskPressure
# 3. Pods stable
kubectl get pods -n production -o wide | grep <node-name>
# 4. Services healthy
curl -i https://api.company.com/health
# 5. No pod evictions
kubectl get events --field-selector reason=Evicted -n production
Post-Incident Actions
- Analyze what caused disk fill
- Implement proper log management strategy
- Set ephemeral-storage limits on all pods
- Configure automated cleanup
- Add capacity planning for storage
- Review and optimize logging verbosity
- Document disk space requirements
Related Runbooks
6. Certificate Expiration
Metadata
- Severity: SEV-1 (Critical) if expired, SEV-3 (Low) if approaching
- MTTR Target: < 15 minutes for renewal, 0 minutes for prevention
- On-Call Team: SRE, Security
- Last Updated: 2024-11-27
Symptoms
- "SSL certificate has expired" errors
- Browsers showing security warnings
- API clients unable to connect
- Services failing TLS handshake
- Certificate validation errors in logs
Detection
# Check certificate expiration
echo | openssl s_client -servername api.company.com -connect api.company.com:443 2>/dev/null | \
openssl x509 -noout -dates
# Check all Kubernetes TLS secrets
kubectl get secrets -A -o json | \
jq -r '.items[] | select(.type=="kubernetes.io/tls") | "\(.metadata.namespace)/\(.metadata.name)"' | \
while read secret; do
namespace=$(echo $secret | cut -d/ -f1)
name=$(echo $secret | cut -d/ -f2)
expiry=$(kubectl get secret $name -n $namespace -o jsonpath='{.data.tls\.crt}' | \
base64 -d | openssl x509 -noout -enddate 2>/dev/null | cut -d= -f2)
echo "$secret: $expiry"
done
# Check certificate in