Skip to main content

SRE Runbook

· 34 min read
Femi Adigun
Senior Software Engineer & Coach

Table of Contents

  1. Service Outage / High Error Rate
  2. High Latency / Performance Degradation
  3. Database Connection Pool Exhaustion
  4. Memory Leak / OOM Kills
  5. Disk Space Exhaustion
  6. Certificate Expiration
  7. DDoS Attack / Traffic Surge
  8. Kubernetes Pod CrashLoopBackOff
  9. Message Queue Backup / Consumer Lag
  10. Database Replication Lag
  11. Cache Invalidation / Cache Storm
  12. Failed Deployment / Rollback
  13. Security Incident / Breach Detection
  14. Data Corruption
  15. DNS Resolution Failures

1. Service Outage / High Error Rate

Metadata

  • Severity: SEV-1 (Critical)
  • MTTR Target: < 30 minutes
  • On-Call Team: SRE, Platform Engineering
  • Escalation Path: SRE → Engineering Manager → VP Engineering
  • Last Updated: 2024-11-27
  • Owner: SRE Team

Symptoms

  • High 5xx error rate (>1% of requests)
  • Service returning errors instead of successful responses
  • Health check endpoints failing
  • Customer reports of service unavailability
  • Spike in error monitoring alerts

Detection

Automated Alerts:

Alert: ServiceHighErrorRate
Severity: Critical
Condition: error_rate > 1% for 2 minutes
Dashboard: https://grafana.company.com/service-health

Manual Checks:

# Check service health
curl -i https://api.company.com/health

# Check error rate in last 5 minutes
kubectl logs -l app=api-service --tail=1000 --since=5m | grep ERROR | wc -l

# Check pod status
kubectl get pods -n production -l app=api-service

Triage Steps

Step 1: Establish Incident Context (2 minutes)

# Check current time and impact window
date

# Check error rate trend
# View Grafana dashboard - is error rate increasing or stable?

# Identify scope
# All services or specific service?
# All regions or specific region?
# All users or subset of users?

# Recent changes
# Check last 3 deployments
kubectl rollout history deployment/api-service -n production | tail -5

# Check recent config changes
kubectl get configmap api-config -n production -o yaml | grep -A 2 "last-applied"

Record in incident doc:

Start Time: [TIMESTAMP]
Error Rate: [X%]
Affected Service: [SERVICE_NAME]
Affected Users: [ALL/SUBSET]
Recent Changes: [YES/NO - DETAILS]

Step 2: Immediate Mitigation (5 minutes)

Option A: Recent Deployment - Rollback

# If deployment in last 30 minutes, rollback immediately
kubectl rollout undo deployment/api-service -n production

# Monitor rollback progress
kubectl rollout status deployment/api-service -n production

# Watch error rate
watch -n 5 'curl -s https://api.company.com/metrics | grep error_rate'

Option B: Scale Up (If Traffic Related)

# Check current replica count
kubectl get deployment api-service -n production

# Scale up by 50%
current_replicas=$(kubectl get deployment api-service -n production -o jsonpath='{.spec.replicas}')
new_replicas=$((current_replicas * 3 / 2))
kubectl scale deployment api-service -n production --replicas=$new_replicas

# Enable HPA if not already
kubectl autoscale deployment api-service -n production --min=10 --max=50 --cpu-percent=70

Option C: Circuit Breaker (If Dependency Down)

# If error logs show dependency timeouts
# Enable circuit breaker via feature flag
curl -X POST https://feature-flags.company.com/api/flags/circuit-breaker-enable \
-H "Authorization: Bearer $ADMIN_TOKEN" \
-d '{"enabled": true, "service": "downstream-api"}'

# Or update config map
kubectl patch configmap api-config -n production \
--type merge -p '{"data":{"CIRCUIT_BREAKER_ENABLED":"true"}}'

# Restart pods to pick up config
kubectl rollout restart deployment/api-service -n production

Step 3: Root Cause Investigation (15 minutes)

Check Logs:

# Recent errors
kubectl logs deployment/api-service -n production --tail=500 --since=10m | grep -i error

# Stack traces
kubectl logs deployment/api-service -n production --tail=1000 | grep -A 10 "Exception"

# All logs from failing pods
failing_pods=$(kubectl get pods -n production -l app=api-service --field-selector=status.phase!=Running -o name)
for pod in $failing_pods; do
echo "=== Logs from $pod ==="
kubectl logs $pod -n production --tail=100
done

Check Metrics:

# CPU usage
kubectl top pods -n production -l app=api-service

# Memory usage
kubectl top pods -n production -l app=api-service --sort-by=memory

# Request rate
curl -s "http://prometheus:9090/api/v1/query?query=rate(http_requests_total{service='api-service'}[5m])"

# Database connections
curl -s "http://prometheus:9090/api/v1/query?query=db_connections_active{service='api-service'}"

Check Dependencies:

# Test database connectivity
kubectl run -i --tty --rm debug --image=postgres:latest --restart=Never -- \
psql -h postgres.production.svc.cluster.local -U appuser -d appdb -c "SELECT 1;"

# Test Redis
kubectl run -i --tty --rm debug --image=redis:latest --restart=Never -- \
redis-cli -h redis.production.svc.cluster.local ping

# Test external API
curl -i -m 5 https://external-api.partner.com/health

Check Network:

# DNS resolution
nslookup api-service.production.svc.cluster.local

# Network policies
kubectl get networkpolicies -n production

# Service endpoints
kubectl get endpoints api-service -n production

Step 4: Resolution Actions

Common Root Causes & Fixes:

A. Database Connection Pool Exhaustion

# Increase pool size (if safe)
kubectl set env deployment/api-service -n production DB_POOL_SIZE=50

# Or restart pods to reset connections
kubectl rollout restart deployment/api-service -n production

B. Memory Leak / OOM

# Increase memory limits temporarily
kubectl set resources deployment api-service -n production \
--limits=memory=4Gi --requests=memory=2Gi

# Enable heap dump on OOM (Java)
kubectl set env deployment/api-service -n production \
JAVA_OPTS="-XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/tmp/heapdump.hprof"

# Restart rolling
kubectl rollout restart deployment/api-service -n production

C. External Dependency Failure

# Enable graceful degradation
# Update feature flag to bypass failing service
curl -X POST https://feature-flags.company.com/api/flags/use-fallback-service \
-H "Authorization: Bearer $ADMIN_TOKEN" \
-d '{"enabled": true}'

# Or enable cached responses
kubectl set env deployment/api-service -n production ENABLE_CACHE_FALLBACK=true

D. Configuration Error

# Revert config change
kubectl rollout undo configmap/api-config -n production

# Restart to pick up old config
kubectl rollout restart deployment/api-service -n production

Verification Steps

# 1. Check error rate returned to normal (<0.1%)
curl -s https://api.company.com/metrics | grep error_rate

# 2. Verify all pods healthy
kubectl get pods -n production -l app=api-service | grep -c Running
expected_count=$(kubectl get deployment api-service -n production -o jsonpath='{.spec.replicas}')
echo "Expected: $expected_count"

# 3. Test end-to-end
curl -i -X POST https://api.company.com/v1/test \
-H "Content-Type: application/json" \
-d '{"test": "data"}'

# 4. Check dependent services
curl -i https://api.company.com/health/dependencies

# 5. Monitor for 15 minutes
watch -n 30 'date && curl -s https://api.company.com/metrics | grep -E "error_rate|latency_p99"'

Communication Templates

Initial Announcement (Slack/Status Page):

🚨 INCIDENT: API Service Experiencing High Error Rate

Status: Investigating
Impact: ~40% of API requests failing
Affected: api.company.com
Started: [TIMESTAMP]
Team: Investigating root cause
ETA: 15 minutes for initial mitigation

Updates: Will provide update in 10 minutes
War Room: #incident-2024-1127-001

Update:

📊 UPDATE: API Service Incident

Status: Mitigation Applied
Action: Rolled back deployment v2.3.5
Result: Error rate decreased from 40% to 2%
Next: Monitoring for stability, investigating root cause
ETA: Full resolution in 10 minutes

Resolution:

✅ RESOLVED: API Service Incident

Status: Resolved
Duration: 27 minutes (10:15 AM - 10:42 AM ET)
Root Cause: Database connection pool exhaustion from v2.3.5 config change
Resolution: Rolled back to v2.3.4
Impact: ~2,400 failed requests during incident window
Postmortem: Will be published within 48 hours

Thank you for your patience.

Escalation Criteria

Escalate to Engineering Manager if:

  • MTTR exceeds 30 minutes
  • Impact >50% of users
  • Data loss suspected
  • Security implications identified

Escalate to VP Engineering if:

  • MTTR exceeds 1 hour
  • Major customer impact
  • Media/PR implications
  • Regulatory reporting required

Contact:

Primary On-Call SRE: [Use PagerDuty]
Engineering Manager: [Slack: @eng-manager] [Phone: XXX-XXX-XXXX]
VP Engineering: [Slack: @vp-eng] [Phone: XXX-XXX-XXXX]
Security Team: security@company.com [Slack: #security-incidents]

Post-Incident Actions

Immediate (Same Day):

  • Update incident timeline in documentation
  • Notify all stakeholders of resolution
  • Begin postmortem document
  • Capture all logs, metrics, traces for analysis
  • Take database/system snapshots if relevant

Within 48 Hours:

  • Complete blameless postmortem
  • Identify action items with owners
  • Schedule postmortem review meeting
  • Update runbook with lessons learned

Within 1 Week:

  • Implement quick wins from action items
  • Add monitoring/alerting to prevent recurrence
  • Share learnings with broader team

2. High Latency / Performance Degradation

Metadata

  • Severity: SEV-2 (High)
  • MTTR Target: < 1 hour
  • On-Call Team: SRE, Backend Engineering
  • Last Updated: 2024-11-27
  • Owner: SRE Team

Symptoms

  • P95/P99 latency exceeding SLO
  • User complaints about slow responses
  • Timeouts in dependent services
  • Increased request queue depth
  • Slow database queries

Detection

Automated Alerts:

Alert: HighLatencyP99
Severity: Warning
Condition: p99_latency > 500ms for 5 minutes
SLO: p99 < 200ms
Dashboard: https://grafana.company.com/latency

Triage Steps

Step 1: Quantify Impact (2 minutes)

# Check current latency
curl -s https://api.company.com/metrics | grep -E "latency_p50|latency_p95|latency_p99"

# Get latency percentiles from Prometheus
curl -s "http://prometheus:9090/api/v1/query?query=histogram_quantile(0.99, rate(http_request_duration_seconds_bucket{service='api'}[5m]))"

# Affected endpoints
kubectl logs deployment/api-service -n production --tail=1000 | \
awk '{print $7}' | sort | uniq -c | sort -rn | head -10

Document:

Current P99: [XXX ms] (SLO: 200ms)
Current P95: [XXX ms]
Affected Endpoints: [LIST]
User Reports: [NUMBER]

Step 2: Identify Bottleneck (10 minutes)

Check Application Performance:

# CPU throttling
kubectl top pods -n production -l app=api-service

# Check for CPU throttling
kubectl describe pods -n production -l app=api-service | grep -A 5 "cpu"

# Memory pressure
kubectl top pods -n production -l app=api-service --sort-by=memory

# Thread dumps (Java applications)
kubectl exec -it deployment/api-service -n production -- jstack 1 > thread-dump.txt

# Profile CPU (if profiling enabled)
kubectl exec -it deployment/api-service -n production -- \
curl http://localhost:6060/debug/pprof/profile?seconds=30 > cpu-profile.out

Check Database:

# Active queries
kubectl exec -it postgres-0 -n production -- psql -U postgres -c "
SELECT pid, now() - pg_stat_activity.query_start AS duration, query, state
FROM pg_stat_activity
WHERE (now() - pg_stat_activity.query_start) > interval '5 seconds'
ORDER BY duration DESC;
"

# Slow query log
kubectl exec -it postgres-0 -n production -- tail -100 /var/log/postgresql/slow-query.log

# Database connections
kubectl exec -it postgres-0 -n production -- psql -U postgres -c "
SELECT count(*) as connection_count, state
FROM pg_stat_activity
GROUP BY state;
"

# Lock waits
kubectl exec -it postgres-0 -n production -- psql -U postgres -c "
SELECT blocked_locks.pid AS blocked_pid,
blocking_locks.pid AS blocking_pid,
blocked_activity.query AS blocked_query,
blocking_activity.query AS blocking_query
FROM pg_catalog.pg_locks blocked_locks
JOIN pg_catalog.pg_stat_activity blocked_activity ON blocked_activity.pid = blocked_locks.pid
JOIN pg_catalog.pg_locks blocking_locks ON blocking_locks.locktype = blocked_locks.locktype
JOIN pg_catalog.pg_stat_activity blocking_activity ON blocking_activity.pid = blocking_locks.pid
WHERE NOT blocked_locks.granted;
"

Check Cache:

# Redis hit rate
redis-cli --latency-history

# Cache stats
kubectl exec -it redis-0 -n production -- redis-cli INFO stats | grep -E "hit|miss"

# Memory usage
kubectl exec -it redis-0 -n production -- redis-cli INFO memory | grep used_memory_human

# Slow log
kubectl exec -it redis-0 -n production -- redis-cli SLOWLOG GET 10

Check Network:

# Network latency to dependencies
for service in postgres redis external-api; do
echo "=== $service ==="
kubectl run ping-test --image=busybox --rm -it --restart=Never -- \
ping -c 5 $service.production.svc.cluster.local
done

# DNS lookup times
kubectl run dns-test --image=busybox --rm -it --restart=Never -- \
nslookup api.company.com

# External API latency
time curl -X GET https://external-api.partner.com/v1/data

Check Distributed Traces:

# Identify slow spans in Jaeger
# Navigate to Jaeger UI: https://jaeger.company.com
# Filter by:
# - Service: api-service
# - Min Duration: 500ms
# - Lookback: 1 hour

# Programmatic trace query
curl "http://jaeger-query:16686/api/traces?service=api-service&limit=20&lookback=1h&minDuration=500ms"

Step 3: Apply Mitigation

Scenario A: Database Slow Queries

# Kill long-running queries (if safe)
kubectl exec -it postgres-0 -n production -- psql -U postgres -c "
SELECT pg_terminate_backend(pid)
FROM pg_stat_activity
WHERE (now() - pg_stat_activity.query_start) > interval '60 seconds'
AND state = 'active'
AND pid <> pg_backend_pid();
"

# Add missing index (if identified)
kubectl exec -it postgres-0 -n production -- psql -U postgres -d appdb -c "
CREATE INDEX CONCURRENTLY idx_users_email ON users(email);
"

# Analyze tables (update statistics)
kubectl exec -it postgres-0 -n production -- psql -U postgres -d appdb -c "
ANALYZE VERBOSE users;
"

# Scale read replicas
kubectl scale statefulset postgres-replica -n production --replicas=5

Scenario B: Cache Miss Storm

# Pre-warm cache with common queries
kubectl exec -it deployment/api-service -n production -- \
curl -X POST http://localhost:8080/admin/cache/warmup

# Increase cache size
kubectl exec -it redis-0 -n production -- redis-cli CONFIG SET maxmemory 4gb

# Enable cache fallback to stale data
kubectl set env deployment/api-service -n production CACHE_SERVE_STALE=true

Scenario C: CPU/Memory Constrained

# Increase resources
kubectl set resources deployment api-service -n production \
--limits=cpu=2000m,memory=4Gi \
--requests=cpu=1000m,memory=2Gi

# Scale horizontally
kubectl scale deployment api-service -n production --replicas=20

# Enable HPA
kubectl autoscale deployment api-service -n production \
--min=10 --max=30 --cpu-percent=60

Scenario D: External API Slow

# Increase timeout and enable caching
kubectl set env deployment/api-service -n production \
EXTERNAL_API_TIMEOUT=10000 \
EXTERNAL_API_CACHE_ENABLED=true \
EXTERNAL_API_CACHE_TTL=300

# Enable circuit breaker
kubectl set env deployment/api-service -n production \
CIRCUIT_BREAKER_ENABLED=true \
CIRCUIT_BREAKER_THRESHOLD=50

# Use fallback/cached data
kubectl patch configmap api-config -n production \
--type merge -p '{"data":{"USE_FALLBACK_DATA":"true"}}'

Scenario E: Thread Pool Exhaustion

# Increase thread pool size
kubectl set env deployment/api-service -n production \
THREAD_POOL_SIZE=200 \
THREAD_QUEUE_SIZE=1000

# Restart to apply
kubectl rollout restart deployment/api-service -n production

Verification Steps

# 1. Monitor latency improvement
watch -n 10 'curl -s https://api.company.com/metrics | grep latency_p99'

# 2. Check trace samples
# View Jaeger for recent requests - should show improved latency

# 3. Database query times
kubectl exec -it postgres-0 -n production -- psql -U postgres -c "
SELECT query, mean_exec_time, calls
FROM pg_stat_statements
ORDER BY mean_exec_time DESC
LIMIT 10;
"

# 4. Resource utilization normalized
kubectl top pods -n production -l app=api-service

# 5. Error rate stable (ensure fix didn't introduce errors)
curl -s https://api.company.com/metrics | grep error_rate

Root Cause Investigation

Common Causes:

  1. N+1 Query Problem

    • Check ORM query patterns
    • Enable query logging
    • Add eager loading
  2. Missing Database Index

    • Analyze slow query log
    • Use EXPLAIN ANALYZE
    • Create appropriate indexes
  3. Memory Garbage Collection

    • Check GC logs (Java/JVM)
    • Tune GC parameters
    • Increase heap size
  4. Inefficient Algorithm

    • Profile code execution
    • Identify hot paths
    • Optimize algorithms
  5. External Service Degradation

    • Check dependency SLOs
    • Implement caching
    • Add circuit breakers

Communication Templates

Initial Alert:

⚠️  INCIDENT: API Latency Degradation

Status: Investigating
Impact: P99 latency at 800ms (SLO: 200ms)
Affected: All API endpoints
User Impact: Slow response times
Team: Investigating root cause
Updates: Every 15 minutes in #incident-channel

Resolution:

✅ RESOLVED: API Latency Degradation

Duration: 45 minutes
Root Cause: Missing database index on users.email causing table scans
Resolution: Added index, latency returned to normal
Current P99: 180ms (within SLO)
Postmortem: Will be published within 48 hours

Post-Incident Actions

  • Add database query monitoring
  • Implement automated index recommendations
  • Load test with realistic data volumes
  • Add latency SLO alerts per endpoint
  • Review and optimize slow queries
  • Implement APM (Application Performance Monitoring)

3. Database Connection Pool Exhaustion

Metadata

  • Severity: SEV-1 (Critical)
  • MTTR Target: < 15 minutes
  • On-Call Team: SRE, Database Team
  • Last Updated: 2024-11-27

Symptoms

  • "Connection pool exhausted" errors in application logs
  • Requests timing out
  • Database showing many idle connections
  • Application unable to acquire new connections
  • Connection pool at 100% utilization

Detection

# Check connection pool metrics
curl -s https://api.company.com/metrics | grep db_pool

# Expected output:
# db_pool_active 50
# db_pool_idle 0
# db_pool_total 50
# db_pool_wait_count 1500 <-- High wait count indicates problem

Triage Steps

Step 1: Confirm Pool Exhaustion (1 minute)

# Application side
kubectl logs deployment/api-service -n production --tail=100 | \
grep -i "connection pool\|unable to acquire\|timeout"

# Database side - check connection count
kubectl exec -it postgres-0 -n production -- psql -U postgres -c "
SELECT count(*) as total_connections,
state,
application_name
FROM pg_stat_activity
WHERE datname = 'appdb'
GROUP BY state, application_name
ORDER BY total_connections DESC;
"

# Check pool configuration
kubectl get configmap api-config -n production -o yaml | grep -i pool

Step 2: Immediate Mitigation (5 minutes)

Option A: Restart Application Pods (Fastest)

# Rolling restart to reset connections
kubectl rollout restart deployment/api-service -n production

# Monitor restart
kubectl rollout status deployment/api-service -n production

# Verify connections released
kubectl exec -it postgres-0 -n production -- psql -U postgres -c "
SELECT count(*) FROM pg_stat_activity WHERE datname = 'appdb';
"

Option B: Increase Pool Size (If Infrastructure Allows)

# Check database connection limit
kubectl exec -it postgres-0 -n production -- psql -U postgres -c "
SHOW max_connections;
"

# Calculate safe pool size
# max_connections / number_of_app_instances = pool_size_per_instance
# Example: 200 max / 10 instances = 20 per instance (current might be 50)

# Increase pool size temporarily
kubectl set env deployment/api-service -n production \
DB_POOL_SIZE=30 \
DB_POOL_MAX_IDLE=10 \
DB_CONNECTION_TIMEOUT=30000

# Monitor
watch -n 5 'kubectl logs deployment/api-service -n production --tail=50 | grep -i pool'

Option C: Kill Idle Connections (If Many Idle)

# Identify idle connections
kubectl exec -it postgres-0 -n production -- psql -U postgres -c "
SELECT pid, state, query_start, state_change, query
FROM pg_stat_activity
WHERE datname = 'appdb'
AND state = 'idle'
AND state_change < now() - interval '5 minutes';
"

# Kill long-idle connections
kubectl exec -it postgres-0 -n production -- psql -U postgres -c "
SELECT pg_terminate_backend(pid)
FROM pg_stat_activity
WHERE datname = 'appdb'
AND state = 'idle'
AND state_change < now() - interval '10 minutes'
AND pid <> pg_backend_pid();
"

Step 3: Root Cause Analysis (10 minutes)

Check for Connection Leaks:

# Application logs - look for unclosed connections
kubectl logs deployment/api-service -n production --tail=5000 | \
grep -i "connection not closed\|resource leak"

# Check connection lifecycle
kubectl exec -it postgres-0 -n production -- psql -U postgres -c "
SELECT application_name,
state,
count(*) as conn_count,
max(now() - state_change) as max_idle_time
FROM pg_stat_activity
WHERE datname = 'appdb'
GROUP BY application_name, state
ORDER BY conn_count DESC;
"

# Long-running transactions
kubectl exec -it postgres-0 -n production -- psql -U postgres -c "
SELECT pid, now() - xact_start AS duration, state, query
FROM pg_stat_activity
WHERE datname = 'appdb'
AND xact_start IS NOT NULL
ORDER BY duration DESC
LIMIT 20;
"

Check Recent Changes:

# Recent deployments
kubectl rollout history deployment/api-service -n production | tail -5

# Config changes
kubectl get configmap api-config -n production -o yaml | \
grep -A 2 "last-applied-configuration"

# Recent code changes affecting database access
git log --since="24 hours ago" --grep="database\|pool\|connection" --oneline

Check for Traffic Spike:

# Request rate
curl "http://prometheus:9090/api/v1/query?query=rate(http_requests_total{service='api-service'}[5m])"

# Compare to baseline
curl "http://prometheus:9090/api/v1/query?query=rate(http_requests_total{service='api-service'}[5m] offset 1h)"

Resolution Actions

Permanent Fix Options:

A. Fix Connection Leak in Code

# Bad - connection leak
def get_user(user_id):
conn = db_pool.getconn()
cursor = conn.cursor()
cursor.execute("SELECT * FROM users WHERE id = %s", (user_id,))
result = cursor.fetchone()
return result # Connection never returned!

# Good - using context manager
def get_user(user_id):
with db_pool.getconn() as conn:
with conn.cursor() as cursor:
cursor.execute("SELECT * FROM users WHERE id = %s", (user_id,))
return cursor.fetchone()
# Connection automatically returned to pool

B. Optimize Pool Configuration

# Configure based on actual usage patterns
kubectl set env deployment/api-service -n production \
DB_POOL_SIZE=20 \
DB_POOL_MIN_IDLE=5 \
DB_POOL_MAX_IDLE=10 \
DB_POOL_IDLE_TIMEOUT=300000 \
DB_POOL_CONNECTION_TIMEOUT=30000 \
DB_POOL_VALIDATION_TIMEOUT=5000

# Enable connection validation
kubectl set env deployment/api-service -n production \
DB_POOL_TEST_ON_BORROW=true \
DB_POOL_TEST_WHILE_IDLE=true

C. Implement Connection Pooler (PgBouncer)

# Deploy PgBouncer
kubectl apply -f - <<EOF
apiVersion: apps/v1
kind: Deployment
metadata:
name: pgbouncer
namespace: production
spec:
replicas: 3
selector:
matchLabels:
app: pgbouncer
template:
metadata:
labels:
app: pgbouncer
spec:
containers:
- name: pgbouncer
image: pgbouncer/pgbouncer:latest
env:
- name: DATABASES_HOST
value: postgres.production.svc.cluster.local
- name: POOL_MODE
value: transaction
- name: MAX_CLIENT_CONN
value: "1000"
- name: DEFAULT_POOL_SIZE
value: "25"
---
apiVersion: v1
kind: Service
metadata:
name: pgbouncer
namespace: production
spec:
selector:
app: pgbouncer
ports:
- port: 5432
targetPort: 5432
EOF

# Update application to use PgBouncer
kubectl set env deployment/api-service -n production \
DB_HOST=pgbouncer.production.svc.cluster.local

D. Scale Database Connections

# Increase PostgreSQL max_connections
kubectl exec -it postgres-0 -n production -- psql -U postgres -c "
ALTER SYSTEM SET max_connections = 300;
SELECT pg_reload_conf();
"

# Note: May require database restart for some changes
kubectl exec -it postgres-0 -n production -- psql -U postgres -c "
SELECT pg_reload_conf();
"

Monitoring & Alerting

Add Proactive Monitoring:

# Prometheus alert rule
groups:
- name: database_pool
rules:
- alert: ConnectionPoolHighUtilization
expr: db_pool_active / db_pool_total > 0.7
for: 5m
labels:
severity: warning
annotations:
summary: "Connection pool utilization >70%"
description: "Pool at {{ $value }}% capacity"

- alert: ConnectionPoolExhausted
expr: db_pool_active / db_pool_total > 0.9
for: 2m
labels:
severity: critical
annotations:
summary: "Connection pool nearly exhausted"

- alert: ConnectionPoolWaitTime
expr: rate(db_pool_wait_count[5m]) > 10
for: 5m
labels:
severity: warning
annotations:
summary: "High connection pool wait count"

Dashboard Metrics:

- db_pool_total (configured pool size)
- db_pool_active (connections in use)
- db_pool_idle (connections available)
- db_pool_wait_count (requests waiting for connection)
- db_pool_wait_time_ms (time waiting for connection)
- db_connection_lifetime_seconds (connection age histogram)

Verification Steps

# 1. Pool utilization back to normal
kubectl exec -it deployment/api-service -n production -- \
curl http://localhost:8080/metrics | grep db_pool_active

# Should see: db_pool_active < 70% of db_pool_total

# 2. No wait queue
kubectl exec -it deployment/api-service -n production -- \
curl http://localhost:8080/metrics | grep db_pool_wait_count

# Should see: db_pool_wait_count = 0 or minimal

# 3. Database connection count stable
kubectl exec -it postgres-0 -n production -- psql -U postgres -c "
SELECT count(*) FROM pg_stat_activity WHERE datname = 'appdb';
"

# 4. No errors in logs
kubectl logs deployment/api-service -n production --tail=100 | \
grep -i "connection" | grep -i error

# 5. Response times normal
curl -s https://api.company.com/metrics | grep latency_p99

Prevention

Code Review Checklist:

  • All database connections properly closed
  • Using connection pool best practices
  • Proper error handling to ensure connection release
  • No connection usage outside transactions
  • Connection timeout configured

Testing:

  • Load test with connection pool monitoring
  • Chaos engineering: Test with limited connections
  • Connection leak detection in CI/CD

Architecture:

  • Consider connection pooler (PgBouncer)
  • Implement read replicas to distribute load
  • Use caching to reduce database queries

4. Memory Leak / OOM Kills

Metadata

  • Severity: SEV-2 (High)
  • MTTR Target: < 1 hour
  • On-Call Team: SRE, Backend Engineering
  • Last Updated: 2024-11-27

Symptoms

  • Pods being OOMKilled (Out of Memory)
  • Memory usage continuously increasing
  • Slow performance and increased GC pressure
  • Pod restarts without clear error
  • "Cannot allocate memory" errors

Detection

# Check for OOMKilled pods
kubectl get pods -n production -l app=api-service | grep OOMKilled

# Check pod events
kubectl get events -n production --field-selector involvedObject.name=api-service-xxx | \
grep -i oom

# Memory usage trend
kubectl top pods -n production -l app=api-service --sort-by=memory

Triage Steps

Step 1: Confirm OOM Issue (2 minutes)

# Check pod status and restart reason
kubectl describe pod <pod-name> -n production | grep -A 10 "Last State"

# Should see output like:
# Last State: Terminated
# Reason: OOMKilled
# Exit Code: 137

# Check memory limits
kubectl get pod <pod-name> -n production -o jsonpath='{.spec.containers[*].resources}'

# Monitor memory usage
watch -n 5 'kubectl top pods -n production -l app=api-service'

Step 2: Immediate Mitigation (10 minutes)

Option A: Increase Memory Limits (Quick Fix)

# Current limits
kubectl get deployment api-service -n production -o jsonpath='{.spec.template.spec.containers[0].resources}'

# Increase memory limit temporarily (2x current)
kubectl set resources deployment api-service -n production \
--limits=memory=4Gi \
--requests=memory=2Gi

# Monitor rollout
kubectl rollout status deployment/api-service -n production

# Watch memory usage
watch -n 10 'kubectl top pods -n production -l app=api-service'

Option B: Scale Out (If Memory Leak is Gradual)

# Add more pods to distribute load
kubectl scale deployment api-service -n production --replicas=15

# Enable HPA with lower memory target
kubectl autoscale deployment api-service -n production \
--min=10 --max=30 --cpu-percent=70

# Note: This is temporary - still need to fix leak

Option C: Implement Pod Lifecycle (Workaround)

# Restart pods proactively before they OOM
# Add to deployment spec:
kubectl patch deployment api-service -n production --type=json -p='[
{
"op": "add",
"path": "/spec/template/spec/containers/0/lifecycle",
"value": {
"preStop": {
"exec": {
"command": ["/bin/sh", "-c", "sleep 15"]
}
}
}
}
]'

# Add TTL to pod lifecycle (requires external controller)
# Or implement rolling restart every 12 hours

Step 3: Capture Diagnostics (15 minutes)

Capture Heap Dump (Java/JVM):

# Before pod is killed, capture heap dump
kubectl exec -it <pod-name> -n production -- \
jcmd 1 GC.heap_dump /tmp/heapdump.hprof

# Copy heap dump locally
kubectl cp production/<pod-name>:/tmp/heapdump.hprof ./heapdump-$(date +%Y%m%d-%H%M%S).hprof

# Analyze with MAT or jhat
# Upload to analysis tools or analyze locally

Capture Memory Profile (Go Applications):

# If profiling endpoint enabled
kubectl exec -it <pod-name> -n production -- \
curl http://localhost:6060/debug/pprof/heap > heap-$(date +%Y%m%d-%H%M%S).prof

# Copy locally
kubectl cp production/<pod-name>:/tmp/heap-profile.prof ./heap-profile.prof

# Analyze
go tool pprof -http=:8080 heap-profile.prof

Capture Memory Metrics (Python):

# Install memory_profiler if not already
kubectl exec -it <pod-name> -n production -- pip install memory_profiler

# Profile specific function
kubectl exec -it <pod-name> -n production -- \
python -m memory_profiler app.py

# Or use tracemalloc for running process
kubectl exec -it <pod-name> -n production -- python3 <<'EOF'
import tracemalloc
tracemalloc.start()
# Take snapshot
snapshot = tracemalloc.take_snapshot()
top_stats = snapshot.statistics('lineno')
for stat in top_stats[:10]:
print(stat)
EOF

Check for Common Memory Issues:

# Large objects in memory
kubectl exec -it <pod-name> -n production -- \
curl http://localhost:8080/debug/vars | jq '.memstats'

# Check goroutine leaks (Go)
kubectl exec -it <pod-name> -n production -- \
curl http://localhost:6060/debug/pprof/goroutine?debug=1

# Check thread count (Java)
kubectl exec -it <pod-name> -n production -- \
jcmd 1 Thread.print | grep "Thread" | wc -l

# File descriptor leaks
kubectl exec -it <pod-name> -n production -- \
ls -la /proc/1/fd | wc -l

Step 4: Identify Root Cause

Common Memory Leak Causes:

A. Caching Without Eviction:

# Bad - unbounded cache
cache = {}

def get_user(user_id):
if user_id not in cache:
cache[user_id] = fetch_from_db(user_id) # Cache grows forever!
return cache[user_id]

# Good - bounded cache with LRU
from functools import lru_cache

@lru_cache(maxsize=1000)
def get_user(user_id):
return fetch_from_db(user_id)

B. Event Listeners Not Removed:

// Bad - event listener leak
class Component {
constructor() {
this.data = new Array(1000000);
window.addEventListener("resize", () => this.handleResize());
}
// Missing cleanup!
}

// Good - cleanup listeners
class Component {
constructor() {
this.data = new Array(1000000);
this.handleResize = this.handleResize.bind(this);
window.addEventListener("resize", this.handleResize);
}

destroy() {
window.removeEventListener("resize", this.handleResize);
this.data = null;
}
}

C. Goroutine Leaks (Go):

// Bad - goroutine leak
func processRequests() {
for request := range requests {
go handleRequest(request) // Goroutines never cleaned up
}
}

// Good - bounded goroutines
func processRequests() {
sem := make(chan struct{}, 100) // Max 100 concurrent
for request := range requests {
sem <- struct{}{}
go func(req Request) {
defer func() { <-sem }()
handleRequest(req)
}(request)
}
}

D. Database Result Sets Not Closed:

// Bad - result set leak
public List<User> getUsers() {
Statement stmt = conn.createStatement();
ResultSet rs = stmt.executeQuery("SELECT * FROM users");
List<User> users = new ArrayList<>();
while (rs.next()) {
users.add(new User(rs));
}
return users; // ResultSet and Statement never closed!
}

// Good - use try-with-resources
public List<User> getUsers() {
List<User> users = new ArrayList<>();
try (Statement stmt = conn.createStatement();
ResultSet rs = stmt.executeQuery("SELECT * FROM users")) {
while (rs.next()) {
users.add(new User(rs));
}
}
return users;
}

Resolution Actions

Short-term:

# 1. Increase memory limits (already done)

# 2. Enable memory monitoring
kubectl patch deployment api-service -n production --type=json -p='[
{
"op": "add",
"path": "/spec/template/metadata/annotations/prometheus.io~1scrape",
"value": "true"
}
]'

# 3. Add liveness probe with memory check
kubectl patch deployment api-service -n production --type=json -p='[
{
"op": "add",
"path": "/spec/template/spec/containers/0/livenessProbe",
"value": {
"httpGet": {
"path": "/health",
"port": 8080
},
"initialDelaySeconds": 30,
"periodSeconds": 10
}
}
]'

# 4. Implement automatic restart before OOM
# Create CronJob to restart pods every 12 hours (temporary)
kubectl create -f - <<EOF
apiVersion: batch/v1
kind: CronJob
metadata:
name: api-service-restart
namespace: production
spec:
schedule: "0 */12 * * *"
jobTemplate:
spec:
template:
spec:
serviceAccountName: pod-restarter
containers:
- name: kubectl
image: bitnami/kubectl:latest
command:
- /bin/sh
- -c
- kubectl rollout restart deployment/api-service -n production
restartPolicy: OnFailure
EOF

Long-term Fix:

# 1. Fix code leak (deploy patch)
git checkout -b fix/memory-leak
# ... implement fix ...
git commit -m "Fix: Remove unbounded cache causing memory leak"
git push origin fix/memory-leak

# 2. Add memory profiling in production
kubectl set env deployment/api-service -n production \
ENABLE_PROFILING=true \
PROFILING_PORT=6060

# 3. Implement memory limits in code
# For Java:
kubectl set env deployment/api-service -n production \
JAVA_OPTS="-Xmx2g -XX:MaxMetaspaceSize=256m -XX:+UseG1GC"

# 4. Add memory monitoring dashboard

# 5. Implement alerts for memory growth

Monitoring & Prevention

Add Alerts:

groups:
- name: memory_alerts
rules:
- alert: MemoryUsageHigh
expr: container_memory_usage_bytes{pod=~"api-service.*"} / container_spec_memory_limit_bytes{pod=~"api-service.*"} > 0.8
for: 10m
labels:
severity: warning
annotations:
summary: "Pod memory usage >80%"

- alert: MemoryUsageGrowing
expr: predict_linear(container_memory_usage_bytes{pod=~"api-service.*"}[1h], 3600) > container_spec_memory_limit_bytes{pod=~"api-service.*"}
for: 15m
labels:
severity: warning
annotations:
summary: "Memory usage trending towards OOM"

- alert: OOMKillsDetected
expr: increase(kube_pod_container_status_restarts_total{pod=~"api-service.*"}[15m]) > 3
labels:
severity: critical
annotations:
summary: "Multiple pod restarts detected (possible OOM)"

Grafana Dashboard:

- Memory Usage (%)
- Memory Usage (bytes) over time
- Predicted time to OOM
- GC frequency and duration
- Heap size vs used heap
- Number of objects in memory
- Pod restart count

Verification Steps

# 1. Memory usage stable
watch -n 30 'kubectl top pods -n production -l app=api-service | tail -5'

# 2. No OOM kills in last hour
kubectl get events -n production --field-selector reason=OOMKilling | grep "last hour"

# 3. Pod uptime increasing (not restarting)
kubectl get pods -n production -l app=api-service -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.status.startTime}{"\n"}{end}'

# 4. Memory growth linear/flat (not exponential)
# Check Grafana memory usage graph

# 5. Application metrics healthy
curl -s https://api.company.com/metrics | grep -E "heap|gc|memory"

Post-Incident Actions

  • Analyze heap dump to identify leak source
  • Review code for common leak patterns
  • Add memory profiling to CI/CD
  • Implement memory budgets in code
  • Add integration tests for memory leaks
  • Document memory configuration guidelines
  • Train team on memory leak prevention

5. Disk Space Exhaustion

Metadata

  • Severity: SEV-2 (High), escalates to SEV-1 if database affected
  • MTTR Target: < 30 minutes
  • On-Call Team: SRE, Infrastructure
  • Last Updated: 2024-11-27

Symptoms

  • "No space left on device" errors
  • Applications unable to write logs
  • Database unable to write data
  • Pod evictions due to disk pressure
  • Slow I/O performance

Detection

# Check disk usage on nodes
kubectl get nodes -o custom-columns=NAME:.metadata.name,DISK:.status.allocatable.ephemeral-storage

# Check specific node
kubectl describe node <node-name> | grep -A 5 "Allocated resources"

# SSH to node and check
ssh <node> df -h

# Check for disk pressure
kubectl get nodes -o json | jq '.items[] | select(.status.conditions[] | select(.type=="DiskPressure" and .status=="True")) | .metadata.name'

Triage Steps

Step 1: Identify Affected Systems (2 minutes)

# Which nodes are affected?
for node in $(kubectl get nodes -o name); do
echo "=== $node ==="
kubectl describe $node | grep -E "DiskPressure|ephemeral-storage"
done

# Which pods are on affected nodes?
kubectl get pods -n production -o wide | grep <affected-node>

# Critical services affected?
kubectl get pods -n production -l tier=critical -o wide

Step 2: Immediate Mitigation (10 minutes)

Option A: Clean Up Logs

# SSH to affected node
ssh <node-name>

# Find large log files
sudo find /var/log -type f -size +100M -exec ls -lh {} \; | awk '{ print $9 ": " $5 }'

# Rotate/truncate large logs
sudo truncate -s 0 /var/log/containers/*.log
sudo journalctl --vacuum-size=500M

# Clean Docker logs (if applicable)
sudo sh -c "truncate -s 0 /var/lib/docker/containers/*/*-json.log"

# Kubernetes log cleanup
sudo find /var/log/pods -name "*.log" -mtime +7 -delete

Option B: Remove Unused Docker Images

# On affected node
ssh <node-name>

# List images sorted by size
sudo docker images --format "{{.Repository}}:{{.Tag}}\t{{.Size}}" | sort -k2 -h

# Remove unused images
sudo docker image prune -a --filter "until=72h" -f

# Remove dangling volumes
sudo docker volume prune -f

Option C: Clean Up Pod Ephemeral Storage

# Find pods using most disk
kubectl get pods -n production -o json | \
jq -r '.items[] | "\(.metadata.name) \(.spec.nodeName)"' | \
while read pod node; do
if [ "$node" = "<affected-node>" ]; then
echo "=== $pod ==="
kubectl exec $pod -n production -- du -sh /tmp /var/tmp 2>/dev/null || true
fi
done

# Clean up specific pod
kubectl exec <pod-name> -n production -- sh -c "rm -rf /tmp/*"

Option D: Cordon and Drain Node

# Prevent new pods from scheduling
kubectl cordon <node-name>

# Drain pods to other nodes
kubectl drain <node-name> --ignore-daemonsets --delete-emptydir-data --grace-period=60

# Clean up on node
ssh <node-name>
sudo docker system prune -a -f --volumes
sudo rm -rf /var/log/pods/*
sudo rm -rf /var/lib/kubelet/pods/*

# Uncordon when ready
kubectl uncordon <node-name>

Option E: Emergency Database Cleanup (If DB Affected)

# Connect to database pod
kubectl exec -it postgres-0 -n production -- psql -U postgres

# Check database sizes
SELECT pg_database.datname,
pg_size_pretty(pg_database_size(pg_database.datname)) AS size
FROM pg_database
ORDER BY pg_database_size(pg_database.datname) DESC;

# Check table sizes
SELECT schemaname,
tablename,
pg_size_pretty(pg_total_relation_size(schemaname||'.'||tablename)) AS size
FROM pg_tables
ORDER BY pg_total_relation_size(schemaname||'.'||tablename) DESC
LIMIT 20;

# Archive old data (if safe)
# Example: Archive logs older than 90 days
BEGIN;
COPY (SELECT * FROM audit_logs WHERE created_at < NOW() - INTERVAL '90 days')
TO PROGRAM 'gzip > /tmp/audit_logs_archive_$(date +%Y%m%d).csv.gz' WITH CSV HEADER;
DELETE FROM audit_logs WHERE created_at < NOW() - INTERVAL '90 days';
COMMIT;

# Vacuum to reclaim space
VACUUM FULL audit_logs;

Step 3: Root Cause Analysis

Find What's Consuming Space:

# On affected node
ssh <node-name>

# Find largest directories
sudo du -h --max-depth=3 / 2>/dev/null | sort -hr | head -20

# Find largest files
sudo find / -type f -size +500M -exec ls -lh {} \; 2>/dev/null | awk '{ print $9 ": " $5 }'

# Check specific directories
sudo du -sh /var/lib/docker/*
sudo du -sh /var/lib/kubelet/*
sudo du -sh /var/log/*

# Find recently created large files
sudo find / -type f -size +100M -mtime -1 -exec ls -lh {} \; 2>/dev/null

Common Culprits:

  1. Excessive Logging
# Check log volume
kubectl exec <pod-name> -n production -- du -sh /var/log

# Check logging rate
kubectl logs <pod-name> -n production --tail=100 --timestamps | \
awk '{print $1}' | sort | uniq -c
  1. Temp File Accumulation
# Check temp directories
kubectl exec <pod-name> -n production -- du -sh /tmp /var/tmp

# Find old temp files
kubectl exec <pod-name> -n production -- find /tmp -type f -mtime +7 -ls
  1. Database Growth
# PostgreSQL WAL files
kubectl exec postgres-0 -n production -- \
du -sh /var/lib/postgresql/data/pg_wal/

# MySQL binary logs
kubectl exec mysql-0 -n production -- \
du -sh /var/lib/mysql/binlog/
  1. Image/Container Buildup
# Unused containers
sudo docker ps -a --filter "status=exited" --filter "status=dead"

# Image layer cache
sudo du -sh /var/lib/docker/overlay2/

Resolution Actions

Short-term:

# 1. Implement log rotation
kubectl apply -f - <<EOF
apiVersion: v1
kind: ConfigMap
metadata:
name: filebeat-config
namespace: production
data:
filebeat.yml: |
filebeat.inputs:
- type: container
paths:
- /var/log/containers/*.log
processors:
- add_kubernetes_metadata:
host: \${NODE_NAME}
matchers:
- logs_path:
logs_path: "/var/log/containers/"
output.logstash:
hosts: ["logstash:5044"]
# Local log cleanup
queue.mem:
events: 4096
flush.min_events: 512
flush.timeout: 5s
EOF

# 2. Set up log shipping
kubectl apply -f https://raw.githubusercontent.com/elastic/beats/master/deploy/kubernetes/filebeat-kubernetes.yaml

# 3. Configure log rotation on nodes
# Add to node configuration or DaemonSet
cat <<'EOF' | sudo tee /etc/logrotate.d/containers
/var/log/containers/*.log {
daily
rotate 7
compress
missingok
notifempty
create 0644 root root
postrotate
/usr/bin/docker ps -a --format '{{.Names}}' | xargs -I {} docker kill -s HUP {} 2>/dev/null || true
endscript
}
EOF

Long-term:

# 1. Set ephemeral-storage limits on pods
kubectl patch deployment api-service -n production --type=json -p='[
{
"op": "add",
"path": "/spec/template/spec/containers/0/resources/limits/ephemeral-storage",
"value": "2Gi"
},
{
"op": "add",
"path": "/spec/template/spec/containers/0/resources/requests/ephemeral-storage",
"value": "1Gi"
}
]'

# 2. Enable disk usage monitoring
kubectl apply -f - <<EOF
apiVersion: v1
kind: ConfigMap
metadata:
name: node-exporter-config
namespace: monitoring
data:
entrypoint.sh: |
#!/bin/sh
exec /bin/node_exporter \
--collector.filesystem \
--collector.filesystem.mount-points-exclude="^/(dev|proc|sys|var/lib/docker/.+)(\$|/)" \
--web.listen-address=:9100
EOF

# 3. Set up automated cleanup CronJob
kubectl apply -f - <<EOF
apiVersion: batch/v1
kind: CronJob
metadata:
name: disk-cleanup
namespace: kube-system
spec:
schedule: "0 2 * * *" # 2 AM daily
jobTemplate:
spec:
template:
spec:
hostPID: true
hostNetwork: true
containers:
- name: cleanup
image: alpine:latest
command:
- /bin/sh
- -c
- |
# Clean old logs
find /host/var/log/pods -name "*.log" -mtime +7 -delete
# Clean old containers
nsenter --mount=/proc/1/ns/mnt -- docker system prune -af --filter "until=72h"
securityContext:
privileged: true
volumeMounts:
- name: host
mountPath: /host
volumes:
- name: host
hostPath:
path: /
restartPolicy: OnFailure
EOF

# 4. Implement disk alerts
kubectl apply -f - <<EOF
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: disk-space-alerts
namespace: monitoring
spec:
groups:
- name: disk
rules:
- alert: NodeDiskSpaceHigh
expr: (node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}) < 0.2
for: 5m
labels:
severity: warning
annotations:
summary: "Node disk space <20%"

- alert: NodeDiskSpaceCritical
expr: (node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}) < 0.1
for: 2m
labels:
severity: critical
annotations:
summary: "Node disk space <10%"
EOF

Monitoring & Prevention

Metrics to Track:

- node_filesystem_avail_bytes (available disk space)
- node_filesystem_size_bytes (total disk space)
- container_fs_usage_bytes (container filesystem usage)
- kubelet_volume_stats_used_bytes (PV usage)
- log_file_size (application log sizes)

Dashboards:

- Node disk usage per mount point
- Pod ephemeral storage usage
- PV usage trends
- Log growth rate
- Image/container count over time

Verification Steps

# 1. Disk space recovered
ssh <node-name> df -h /

# 2. No disk pressure
kubectl describe node <node-name> | grep DiskPressure

# 3. Pods stable
kubectl get pods -n production -o wide | grep <node-name>

# 4. Services healthy
curl -i https://api.company.com/health

# 5. No pod evictions
kubectl get events --field-selector reason=Evicted -n production

Post-Incident Actions

  • Analyze what caused disk fill
  • Implement proper log management strategy
  • Set ephemeral-storage limits on all pods
  • Configure automated cleanup
  • Add capacity planning for storage
  • Review and optimize logging verbosity
  • Document disk space requirements

6. Certificate Expiration

Metadata

  • Severity: SEV-1 (Critical) if expired, SEV-3 (Low) if approaching
  • MTTR Target: < 15 minutes for renewal, 0 minutes for prevention
  • On-Call Team: SRE, Security
  • Last Updated: 2024-11-27

Symptoms

  • "SSL certificate has expired" errors
  • Browsers showing security warnings
  • API clients unable to connect
  • Services failing TLS handshake
  • Certificate validation errors in logs

Detection

# Check certificate expiration
echo | openssl s_client -servername api.company.com -connect api.company.com:443 2>/dev/null | \
openssl x509 -noout -dates

# Check all Kubernetes TLS secrets
kubectl get secrets -A -o json | \
jq -r '.items[] | select(.type=="kubernetes.io/tls") | "\(.metadata.namespace)/\(.metadata.name)"' | \
while read secret; do
namespace=$(echo $secret | cut -d/ -f1)
name=$(echo $secret | cut -d/ -f2)
expiry=$(kubectl get secret $name -n $namespace -o jsonpath='{.data.tls\.crt}' | \
base64 -d | openssl x509 -noout -enddate 2>/dev/null | cut -d= -f2)
echo "$secret: $expiry"
done

# Check certificate in

Highy Available System Design

· 16 min read
Femi Adigun
Senior Software Engineer & Coach

Executive Summary

This document outlines the architecture for a globally distributed, highly available ordering platform designed to serve millions of users with 99.99% uptime, fault tolerance, and resilience.

Key Metrics:

  • Target Availability: 99.99% (52 minutes downtime/year)
  • Global Users: Millions
  • Order Processing: Real-time with eventual consistency
  • Recovery Time Objective (RTO): < 1 minute
  • Recovery Point Objective (RPO): < 5 minutes

1. High-Level Architecture

┌─────────────────────────────────────────────────────────────────┐
│ Global Users │
└─────────────────────────────────────────────────────────────────┘


┌─────────────────────────────────────────────────────────────────┐
│ Global CDN + DDoS Protection │
│ (CloudFlare / Akamai) │
└─────────────────────────────────────────────────────────────────┘


┌─────────────────────────────────────────────────────────────────┐
│ Global Load Balancer (DNS-based) │
│ Route by: Geography, Health, Latency │
└─────────────────────────────────────────────────────────────────┘

┌─────────────────────┼─────────────────────┐
▼ ▼ ▼
┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│ Region 1 │ │ Region 2 │ │ Region 3 │
│ (US-EAST) │ │ (EU-WEST) │ │ (ASIA-PAC) │
└──────────────┘ └──────────────┘ └──────────────┘

2. Regional Architecture (Per Region)

Each region is fully self-contained and can operate independently:

┌────────────────────────────────────────────────────────────┐
│ Regional Load Balancer │
│ (AWS ALB / Azure App Gateway) │
└────────────────────────────────────────────────────────────┘

┌───────────────────┼───────────────────┐
▼ ▼ ▼
┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│ AZ-1 │ │ AZ-2 │ │ AZ-3 │
│ API Gateway │ │ API Gateway │ │ API Gateway │
└──────────────┘ └──────────────┘ └──────────────┘
│ │ │
└───────────────────┼───────────────────┘

┌────────────────────────────────────────────────────────────┐
│ Service Mesh (Istio) │
│ + Circuit Breakers │
└────────────────────────────────────────────────────────────┘

3. Microservices Layer

3.1 Core Services Architecture

┌─────────────────┐  ┌─────────────────┐  ┌─────────────────┐
│ User Service │ │ Auth Service │ │ Catalog Service │
│ (3+ instances) │ │ (3+ instances) │ │ (3+ instances) │
└─────────────────┘ └─────────────────┘ └─────────────────┘

┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
│ Order Service │ │ Payment Service │ │Inventory Service│
│ (5+ instances) │ │ (5+ instances) │ │ (5+ instances) │
└─────────────────┘ └─────────────────┘ └─────────────────┘

┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
│Notification Svc │ │Fulfillment Svc │ │ Analytics Svc │
│ (3+ instances) │ │ (3+ instances) │ │ (3+ instances) │
└─────────────────┘ └─────────────────┘ └─────────────────┘

3.2 Service Characteristics

Order Service (Critical Path):

  • Horizontally scalable with auto-scaling (5-100 instances)
  • Stateless design
  • Circuit breaker pattern for downstream dependencies
  • Retry logic with exponential backoff
  • Request timeout: 3 seconds
  • Bulkhead pattern to isolate critical operations

Payment Service (Critical Path):

  • Idempotent operations (prevent double charging)
  • Transaction log for audit trail
  • Saga pattern for distributed transactions
  • PCI-DSS compliant
  • Rate limiting per user/IP
  • Fallback to queued processing if gateway unavailable

Inventory Service:

  • Optimistic locking for inventory updates
  • Real-time inventory with eventual consistency
  • Cache-aside pattern with Redis
  • Event sourcing for inventory changes

4. Data Layer Architecture

4.1 Multi-Region Database Strategy

Primary Databases (Per Region):
┌─────────────────────────────────────────────────────┐
│ PostgreSQL / Aurora Global Database │
│ │
│ Region 1 (PRIMARY) → Region 2 (READ REPLICA) │
│ ↓ ↓ │
│ Multi-AZ Setup Multi-AZ Setup │
│ - Master (AZ-1) - Replica (AZ-1) │
│ - Standby (AZ-2) - Replica (AZ-2) │
│ - Replica (AZ-3) - Replica (AZ-3) │
└─────────────────────────────────────────────────────┘

Replication Lag Target: < 1 second

4.2 Database Sharding Strategy

Sharding Key: User ID (consistent hashing)

Shard Distribution:

  • 16 logical shards per region
  • Each shard has 3 physical replicas (across AZs)
  • Allows horizontal scaling to 64, 128, 256 shards

Data Partitioning:

  • Users: Sharded by user_id
  • Orders: Sharded by user_id (co-located with user data)
  • Products: Replicated across all shards (read-heavy)
  • Inventory: Sharded by product_id with cache layer

4.3 Caching Strategy

┌─────────────────────────────────────────────────────┐
│ Redis Cluster (Per Region) │
│ │
│ Cache Tier 1: User Sessions (TTL: 30 min) │
│ Cache Tier 2: Product Catalog (TTL: 5 min) │
│ Cache Tier 3: Inventory Counts (TTL: 30 sec) │
│ Cache Tier 4: Hot Order Data (TTL: 10 min) │
│ │
│ Configuration: │
│ - 6 nodes per region (2 per AZ) │
│ - Clustering mode enabled │
│ - Automatic failover │
│ - Backup to S3 every 6 hours │
└─────────────────────────────────────────────────────┘

Cache Invalidation Strategy:

  • Write-through for critical data (orders, payments)
  • Cache-aside for read-heavy data (products, users)
  • Event-driven invalidation via message queue
  • Lazy expiration with active monitoring

5. Event-Driven Architecture

5.1 Message Queue Infrastructure

┌─────────────────────────────────────────────────────┐
│ Apache Kafka / Amazon MSK (Per Region) │
│ │
│ Topics: │
│ - order.created (partitions: 50) │
│ - order.confirmed (partitions: 50) │
│ - payment.processed (partitions: 30) │
│ - inventory.updated (partitions: 40) │
│ - notification.email (partitions: 20) │
│ - notification.sms (partitions: 20) │
│ - analytics.events (partitions: 100) │
│ │
│ Configuration: │
│ - Replication Factor: 3 │
│ - Min In-Sync Replicas: 2 │
│ - Retention: 7 days │
│ - Cross-region replication for critical topics │
└─────────────────────────────────────────────────────┘

5.2 Order Processing Flow

1. User Places Order

2. Order Service validates → Publishes "order.created"

3. Multiple Consumers:
- Inventory Service (reserves items)
- Payment Service (processes payment)
- Notification Service (confirms to user)
- Analytics Service (tracks metrics)

4. Saga Coordinator monitors completion

5. If all succeed → Publish "order.confirmed"
If any fail → Publish compensating events

6. Fulfillment Service picks up confirmed orders

Benefits:

  • Decoupling of services
  • Async processing reduces latency
  • Natural retry mechanism
  • Event log for debugging
  • Scalable consumer groups

6. Resilience Patterns

6.1 Circuit Breaker Implementation

Circuit States:
- CLOSED: Normal operation, requests flow through
- OPEN: Failure threshold exceeded, fail fast
- HALF_OPEN: Testing if service recovered

Configuration (per service):
- Failure Threshold: 50% of requests in 10 seconds
- Timeout: 3 seconds
- Half-Open Retry: After 30 seconds
- Success Threshold: 3 consecutive successes to close

6.2 Retry Strategy

Exponential Backoff with Jitter:

Attempt 1: Immediate
Attempt 2: 100ms + random(0-50ms)
Attempt 3: 200ms + random(0-100ms)
Attempt 4: 400ms + random(0-200ms)
Attempt 5: 800ms + random(0-400ms)
Max Attempts: 5

Idempotency Keys:

  • All write operations require idempotency key
  • Stored for 24 hours to detect duplicates
  • Ensures safe retries without side effects

6.3 Bulkhead Pattern

Resource Isolation:

  • Separate thread pools for different operation types
  • Critical operations: 60% of resources
  • Non-critical operations: 30% of resources
  • Admin operations: 10% of resources

Rate Limiting:

  • Per-user: 100 requests/minute
  • Per-IP: 1000 requests/minute
  • Global: 1M requests/second per region
  • Token bucket algorithm with Redis

6.4 Timeout Strategy

Service Timeouts (Cascading):
- Gateway → Service: 5 seconds
- Service → Service: 3 seconds
- Service → Database: 2 seconds
- Service → Cache: 500ms
- Service → External API: 10 seconds

7. Disaster Recovery & High Availability

7.1 Multi-Region Failover Strategy

Active-Active Configuration:

  • All regions actively serve traffic
  • DNS-based routing with health checks
  • Automatic failover in < 30 seconds

Failover Procedure:

1. Health Check Failure Detected
- 3 consecutive failures in 15 seconds

2. DNS Update Triggered
- Remove failed region from DNS
- TTL: 60 seconds

3. Traffic Rerouted
- Users automatically routed to healthy regions
- No manual intervention required

4. Alert Engineering Team
- PagerDuty/OpsGenie notification
- Automated runbook execution

5. Failed Region Investigation
- Automated diagnostics
- Log aggregation and analysis

6. Recovery and Validation
- Gradual traffic restoration (10% → 50% → 100%)
- Synthetic transaction testing

7.2 Data Backup Strategy

Automated Backups:

  • Database: Continuous backup with PITR (Point-in-Time Recovery)
  • Snapshots every 6 hours to S3/Glacier
  • Cross-region replication of backups
  • 30-day retention for operational backups
  • 7-year retention for compliance backups

Testing:

  • Monthly disaster recovery drills
  • Quarterly regional failover tests
  • Backup restoration tests every week

7.3 Chaos Engineering

Automated Fault Injection:

  • Random pod termination (5% daily)
  • Network latency injection (200ms-2s)
  • Service dependency failure simulation
  • Database connection pool exhaustion
  • Cache cluster node failures

GameDays (Quarterly):

  • Simulated regional outage
  • Database failover scenarios
  • Multi-service cascading failures
  • Payment gateway unavailability

8. Monitoring & Observability

8.1 Metrics Collection

Infrastructure Metrics:

  • CPU, Memory, Disk, Network per instance
  • Request rate, error rate, latency (RED metrics)
  • Database connection pool utilization
  • Cache hit/miss ratios
  • Queue depth and lag

Business Metrics:

  • Orders per second
  • Order success rate
  • Payment success rate
  • Average order value
  • Cart abandonment rate
  • Time to checkout

Tools:

  • Prometheus for metrics collection
  • Grafana for visualization
  • Custom dashboards per service and region

8.2 Distributed Tracing

Implementation:

  • OpenTelemetry for instrumentation
  • Jaeger/Tempo for trace storage
  • Trace every order through the system
  • Correlation IDs in all logs
  • Service mesh automatic tracing

Key Traces:

  • Order placement (end-to-end)
  • Payment processing
  • Inventory reservation
  • Cross-service calls

8.3 Logging Strategy

Centralized Logging:

  • ELK Stack (Elasticsearch, Logstash, Kibana) or DataDog
  • Structured JSON logging
  • Log levels: ERROR, WARN, INFO, DEBUG
  • Retention: 30 days hot, 180 days warm, 365 days cold

Log Aggregation:

  • Application logs
  • Access logs
  • Audit logs (immutable)
  • Security logs
  • Database query logs (slow queries)

8.4 Alerting Strategy

Alert Levels:

Critical (Page immediately):

  • Service availability < 99.9%
  • Error rate > 5%
  • Payment failure rate > 1%
  • Database replication lag > 10 seconds
  • Regional outage detected

High (Page during business hours):

  • Error rate > 1%
  • Response time p99 > 2 seconds
  • Cache hit rate < 80%
  • Queue lag > 5 minutes

Medium (Slack/Email):

  • Error rate > 0.5%
  • Disk usage > 75%
  • Memory usage > 80%
  • API rate limit approaching

Alert Routing:

  • PagerDuty for critical alerts
  • Slack for high/medium alerts
  • Weekly summary emails for trends

9. Security Architecture

9.1 Defense in Depth

Layer 1: Network Security

  • VPC with private subnets
  • Security groups (whitelist approach)
  • NACLs for additional filtering
  • WAF (Web Application Firewall) at edge
  • DDoS protection (CloudFlare/AWS Shield)

Layer 2: Application Security

  • OAuth 2.0 + JWT for authentication
  • RBAC (Role-Based Access Control)
  • API rate limiting per user/IP
  • Input validation and sanitization
  • SQL injection prevention (parameterized queries)
  • XSS protection headers

Layer 3: Data Security

  • Encryption at rest (AES-256)
  • Encryption in transit (TLS 1.3)
  • Database column-level encryption for PII
  • Key rotation every 90 days
  • HSM for payment data

Layer 4: Compliance

  • PCI-DSS Level 1 compliance
  • GDPR compliance (data residency, right to deletion)
  • SOC 2 Type II certification
  • Regular penetration testing
  • Vulnerability scanning (weekly)

9.2 Secrets Management

  • HashiCorp Vault or AWS Secrets Manager
  • Secrets rotation every 30 days
  • No secrets in code or environment variables
  • Service accounts with minimal permissions
  • Audit log of all secret access

10. Scalability Strategy

10.1 Horizontal Scaling

Auto-Scaling Policies:

Scale-Out Triggers:

  • CPU > 70% for 3 minutes
  • Memory > 80% for 3 minutes
  • Request queue depth > 100
  • Response time p95 > 1 second

Scale-In Triggers:

  • CPU < 30% for 10 minutes
  • Memory < 50% for 10 minutes
  • Connection draining (2-minute grace period)

Limits:

  • Min instances: 3 per service per AZ
  • Max instances: 100 per service per AZ
  • Scale-out: +50% of current capacity
  • Scale-in: -25% of current capacity (gradual)

10.2 Database Scaling

Read Scaling:

  • Read replicas (5-10 per region)
  • Connection pooling (PgBouncer)
  • Read/write splitting at application layer
  • Cache-first strategy

Write Scaling:

  • Sharding by user_id
  • Batch writes where possible
  • Async writes for non-critical data
  • Queue-based write buffering

10.3 Global Capacity Planning

Current Capacity (per region):

  • 100,000 orders per second
  • 5 million concurrent users
  • 500 TB storage
  • 10 Gbps network egress

Scaling Roadmap:

  • Add region when hitting 70% capacity
  • Shard databases when write load > 50K TPS
  • Add cache nodes proactively (before hit rate drops)

11. Performance Optimization

11.1 Latency Targets

Operation               Target      P50      P95      P99
─────────────────────────────────────────────────────────
Order Placement < 500ms 300ms 450ms 500ms
Product Search < 200ms 100ms 180ms 200ms
Cart Update < 100ms 50ms 90ms 100ms
Payment Processing < 2s 1.2s 1.8s 2s
Order History < 300ms 150ms 250ms 300ms

11.2 Optimization Techniques

Frontend:

  • CDN for static assets (99% cache hit rate)
  • HTTP/2 and HTTP/3
  • Lazy loading images
  • Code splitting
  • Service workers for offline capability

Backend:

  • Database query optimization (indexed queries)
  • Connection pooling
  • Response compression (gzip/brotli)
  • API response pagination
  • GraphQL for flexible queries

Network:

  • Keep-alive connections
  • Connection multiplexing
  • Regional edge locations
  • Anycast IP routing

12. Order Consistency Guarantees

12.1 ACID vs BASE Trade-offs

ACID Operations (Strong Consistency):

  • Payment transactions
  • Inventory deduction
  • Order status updates
  • User account balance

BASE Operations (Eventual Consistency):

  • Product catalog updates
  • Analytics and reporting
  • Notification delivery
  • Search index updates

12.2 Distributed Transaction Pattern (SAGA)

Order Saga Flow:

1. Create Order (Compensate: Cancel Order)

2. Reserve Inventory (Compensate: Release Inventory)

3. Process Payment (Compensate: Refund Payment)

4. Update Order Status (Compensate: Revert Status)

5. Send Confirmation (No Compensation)

Saga Coordinator:
- Tracks saga state in database
- Executes compensating transactions on failure
- Ensures eventual consistency
- Idempotent operations for safe retries

12.3 Idempotency Implementation

Idempotency Key Table:
- idempotency_key (PK)
- user_id
- operation_type
- request_hash
- response_data
- created_at
- expires_at (24 hours)

On Duplicate Key:
- Return cached response
- No side effects executed
- Log duplicate attempt for monitoring

13. Cost Optimization

13.1 Resource Optimization

Compute:

  • Spot instances for batch jobs (70% savings)
  • Reserved instances for baseline (40% savings)
  • Right-sizing (monitoring actual usage)
  • Scheduled scaling (reduce capacity during off-peak)

Storage:

  • S3 lifecycle policies (archive old data)
  • Database storage optimization (partition pruning)
  • Compression for logs and backups
  • CDN reduces origin bandwidth costs

Network:

  • VPC endpoints (avoid NAT gateway charges)
  • Direct Connect for inter-region traffic
  • CloudFront reduces origin requests

Estimated Monthly Cost (per region):

  • Compute: $50,000
  • Databases: $30,000
  • Storage: $15,000
  • Network: $20,000
  • Caching: $10,000
  • Message Queue: $8,000
  • Monitoring: $5,000 Total: ~$138,000/region (3 regions = $414,000/month)

14. Deployment Strategy

14.1 CI/CD Pipeline

1. Code Commit (GitHub)

2. Automated Tests
- Unit tests
- Integration tests
- Security scanning

3. Build Container Image
- Docker build
- Tag with version
- Push to registry

4. Deploy to Staging
- Terraform/CloudFormation
- Kubernetes rollout

5. Automated Testing (Staging)
- Smoke tests
- Load tests
- E2E tests

6. Manual Approval

7. Blue-Green Deployment (Production)
- Deploy to green environment
- Run health checks
- Shift traffic gradually (10% → 50% → 100%)
- Monitor for errors

8. Rollback Capability
- Instant rollback if errors detected
- Automated rollback on critical alerts

14.2 Zero-Downtime Deployment

Rolling Update Strategy:

  • Update one AZ at a time
  • Wait 10 minutes between AZs
  • Health check validation after each update
  • Automatic rollback on failure

Database Migrations:

  • Backward-compatible changes only
  • Shadow writes to new schema
  • Gradual read cutover
  • Old schema support for 2 releases

15. Technical Stack Summary

15.1 Core Technologies

Frontend:

  • React/Next.js
  • TypeScript
  • Mobile: React Native or Native (iOS/Android)

Backend:

  • Language: Java/Go/Node.js (polyglot)
  • API: GraphQL + REST
  • Framework: Spring Boot / Express

Infrastructure:

  • Cloud: AWS/GCP/Azure (multi-cloud)
  • Orchestration: Kubernetes (EKS/GKE/AKS)
  • Service Mesh: Istio
  • IaC: Terraform

Data:

  • Primary DB: PostgreSQL / Aurora
  • Cache: Redis Cluster
  • Search: Elasticsearch
  • Message Queue: Kafka / MSK
  • Object Storage: S3 / GCS

Monitoring:

  • Metrics: Prometheus + Grafana
  • Logging: ELK Stack / DataDog
  • Tracing: Jaeger / Tempo
  • APM: New Relic / DataDog

16. Risk Mitigation

16.1 Identified Risks and Mitigations

RiskImpactProbabilityMitigation
Regional AWS outageHighLowMulti-region active-active
DDoS attackHighMediumCloudFlare + rate limiting
Database corruptionHighLowContinuous backups + PITR
Payment gateway downHighMediumMultiple payment providers
Data breachCriticalLowEncryption + security monitoring
Code bug causing outagesMediumMediumAutomated testing + gradual rollout
Cache failureMediumLowCache-aside pattern + DB fallback
Human errorMediumMediumIaC + peer reviews + access controls

17. Future Enhancements

17.1 Roadmap

Q1: Enhanced Observability

  • AI-powered anomaly detection
  • Predictive scaling
  • Automated root cause analysis

Q2: Global Expansion

  • Add 2 more regions (South America, Middle East)
  • Edge computing for ultra-low latency
  • Regional data residency compliance

Q3: Advanced Features

  • Machine learning for fraud detection
  • Personalized recommendations engine
  • Real-time inventory optimization

Q4: Cost Optimization

  • FinOps implementation
  • Multi-cloud arbitrage
  • Serverless migration for burst workloads

18. Conclusion

This architecture provides:

High Availability: 99.99% uptime through multi-region, multi-AZ deployment
Fault Tolerance: Circuit breakers, retries, and graceful degradation
Resilience: Self-healing systems and automated recovery
Scalability: Horizontal scaling to millions of users
Performance: Sub-second response times globally
Security: Defense-in-depth with encryption and compliance
Observability: Comprehensive monitoring and alerting
Cost-Effective: Optimized resource utilization

The system is production-ready and battle-tested against common failure scenarios, ensuring reliable order processing for millions of users globally.


Appendix A: Health Check Specifications

api_health_check:
path: /health
interval: 10s
timeout: 3s
healthy_threshold: 2
unhealthy_threshold: 3
checks:
- database_connection
- cache_connection
- message_queue_connection
- disk_space
- memory_usage

deep_health_check:
path: /health/deep
interval: 60s
timeout: 10s
checks:
- end_to_end_order_flow
- payment_gateway_reachable
- inventory_system_reachable

Appendix B: SLA Definitions

availability_sla:
target: 99.99%
measurement: per month
exclusions:
- scheduled_maintenance (with 7 days notice)
- force_majeure

performance_sla:
api_latency_p99: 500ms
api_latency_p50: 200ms
order_processing: 2s

support_sla:
critical_response: 15 minutes
high_response: 1 hour
medium_response: 4 hours

Taints and Affinity

· 4 min read
Femi Adigun
Senior Software Engineer & Coach

These are core Kubernetes scheduling concepts, and they often trip people up because they sound similar but serve different purposes.

Let’s make them visual, memorable, and practical — so you’ll never forget them again.


🎯 TL;DR Memory Hook

🧲 Affinity = attraction (where Pods want to go) ☠️ Taint = repulsion (where Pods cannot go — unless they tolerate it)


TAINTS and TOLERATIONS (Think: “KEEP OUT” signs)

Analogy:

Imagine Kubernetes nodes as rooms in a hotel.

  • Some rooms have “Do Not Disturb” signs (🚫) — those are taints.
  • Only guests (Pods) with matching “permission slips” (🪪 tolerations) can enter those rooms.

💡 Real-world Example:

Let’s say you have a GPU node for machine learning jobs. You don’t want random web servers to run there — only GPU workloads.

You’d taint the node like this:

kubectl taint nodes gpu-node gpu=true:NoSchedule

This means:

“Don’t schedule any Pod here unless it tolerates this taint.”

Now, a Pod that can run on GPUs adds a toleration:

tolerations:
- key: "gpu"
operator: "Equal"
value: "true"
effect: "NoSchedule"

That Pod is now “immune” to the taint — it can run on that node.


Quick Rule of Thumb

ConceptPurposeMemory Trick
TaintMarks a node as restricted“KEEP OUT” sign on node
TolerationLets a pod enter despite taint“Permission slip” on pod

Types of Taint Effects

EffectMeaning
NoSchedulePod won’t be scheduled unless it tolerates the taint
PreferNoScheduleTry to avoid scheduling, but not strict
NoExecuteExisting Pods are evicted if they don’t tolerate the taint

NODE and POD AFFINITY (Think: “PREFERENCE”)

Now, affinity and anti-affinity are about where Pods prefer to be scheduled — not forced, but guided.


Analogy:

  • Affinity = “I like to be near X.”
  • Anti-affinity = “I don’t want to be near X.”

Example: You may want all backend Pods to run on the same node zone as your database for low latency.

That’s node affinity.

Or maybe you want replicas of a web service spread across different nodes for high availability — that’s pod anti-affinity.


💡 Node Affinity Example:

affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: zone
operator: In
values:
- us-east-1a

Meaning:

“Only schedule this Pod on nodes in us-east-1a.”


💡 Pod Affinity Example:

affinity:
podAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchExpressions:
- key: app
operator: In
values:
- frontend
topologyKey: "kubernetes.io/hostname"

Meaning:

“Try to schedule this Pod near another Pod with label app=frontend on the same node.”


Pod Anti-Affinity Example:

affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: app
operator: In
values:
- web
topologyKey: "kubernetes.io/hostname"

Meaning:

“Don’t put two Pods with label app=web on the same node.” Used for high availability — prevents all replicas landing on one machine.


Simple Way to Remember

ConceptWhat it ControlsEasy Way to Remember
TaintNode says “I don’t want certain Pods”Node → “Stay away!”
TolerationPod says “I’m okay with that taint”Pod → “I can handle it.”
Node AffinityPod prefers certain NodesPod → “I like those nodes.”
Pod AffinityPod prefers to be near other PodsPod → “I like being near X.”
Pod Anti-AffinityPod avoids certain PodsPod → “I don’t like being near X.”

Bonus Tip:

You can mix them:

  • Taints/tolerations = hard rules (enforcement)
  • Affinity/anti-affinity = soft preferences (placement)

Think of it like:

“Taints control who can enter the building; affinities control where they sit inside.”


Kubernetes YML Breakdown

· 6 min read
Femi Adigun
Senior Software Engineer & Coach

Here is a universal structure that applies to ALL Kubernetes YAML files, plus memory tricks to never forget it!

The Universal Kubernetes YAML Structure

Every Kubernetes resource follows this 4-part structure:

apiVersion: <api-group>/<version>
kind: <ResourceType>
metadata:
name: <resource-name>
namespace: <namespace>
labels:
key: value
annotations:
key: value
spec:
# Resource-specific configuration goes here

Memory Trick: "AKM-S"

Api version Kind Metadata Spec

Or remember: "A Kind Metadata Spec" - like you're specifying a kind of metadata!


The 4 Core Fields Explained

1. apiVersion - Which API to use

Format: <api-group>/<version> or just <version>

Common values:

apiVersion: v1                           # Core resources
apiVersion: apps/v1 # Deployments, StatefulSets
apiVersion: batch/v1 # Jobs
apiVersion: networking.k8s.io/v1 # Ingress
apiVersion: autoscaling/v2 # HPA
apiVersion: rbac.authorization.k8s.io/v1 # RBAC

2. kind - Type of resource

kind: Pod
kind: Deployment
kind: Service
kind: ConfigMap
kind: Secret
# ... etc

3. metadata - Identifying information

Always has:

  • name (required)
  • namespace (optional, defaults to "default")
  • labels (optional but recommended)
  • annotations (optional)
metadata:
name: my-app
namespace: production
labels:
app: my-app
version: v1
env: prod
annotations:
description: "My application"

4. spec - Desired state (varies by resource)

This is where the resource-specific configuration goes. Changes based on kind.


Common Kubernetes Resources Cheat Sheet

Pod (Basic building block)

apiVersion: v1
kind: Pod
metadata:
name: nginx-pod
labels:
app: nginx
spec:
containers:
- name: nginx
image: nginx:latest
ports:
- containerPort: 80

Spec structure for Pod:

  • containers[] - List of containers
    • name, image, ports[], env[], resources, volumeMounts[]
  • volumes[] - Storage volumes
  • restartPolicy - Always/OnFailure/Never

Deployment (Manages Pods)

apiVersion: apps/v1
kind: Deployment
metadata:
name: nginx-deployment
spec:
replicas: 3
selector:
matchLabels:
app: nginx
template:
metadata:
labels:
app: nginx
spec:
containers:
- name: nginx
image: nginx:1.19
ports:
- containerPort: 80

Spec structure for Deployment:

  • replicas - Number of pods
  • selector - How to find pods to manage
  • template - Pod template (has its own metadata + spec)

Service (Networking)

apiVersion: v1
kind: Service
metadata:
name: nginx-service
spec:
type: ClusterIP # ClusterIP, NodePort, LoadBalancer
selector:
app: nginx
ports:
- protocol: TCP
port: 80 # Service port
targetPort: 80 # Container port

Spec structure for Service:

  • type - ClusterIP/NodePort/LoadBalancer
  • selector - Which pods to send traffic to
  • ports[] - Port mappings

ConfigMap (Configuration data)

apiVersion: v1
kind: ConfigMap
metadata:
name: app-config
data:
database_url: "postgres://db:5432"
log_level: "info"

Spec structure for ConfigMap:

  • Uses data instead of spec
  • Key-value pairs

Secret (Sensitive data)

apiVersion: v1
kind: Secret
metadata:
name: app-secret
type: Opaque
data:
username: YWRtaW4= # base64 encoded
password: cGFzc3dvcmQ=

Spec structure for Secret:

  • Uses data or stringData instead of spec
  • Values must be base64 encoded in data

Interview Strategy: The Template Method

Step 1: Start with the skeleton (10 seconds)

apiVersion:
kind:
metadata:
name:
spec:

Step 2: Fill in the basics (20 seconds)

apiVersion: apps/v1 # Know common ones
kind: Deployment # What they asked for
metadata:
name: my-app # Descriptive name
spec:
# Now focus here

Step 3: Add spec details (30 seconds)

For each resource type, remember the key spec fields:

ResourceKey Spec Fields
Podcontainers[]
Deploymentreplicas, selector, template
Servicetype, selector, ports[]
ConfigMapdata (not spec!)
Secretdata or stringData (not spec!)
Ingressrules[]
PersistentVolumeClaimaccessModes, resources

Most Common Interview Questions

Q: "Write a Deployment with 3 replicas"

apiVersion: apps/v1
kind: Deployment
metadata:
name: my-app
spec:
replicas: 3
selector:
matchLabels:
app: my-app
template:
metadata:
labels:
app: my-app
spec:
containers:
- name: app
image: nginx:latest

Key points:

  • selector.matchLabels MUST match template.metadata.labels
  • Template is a Pod spec inside Deployment spec

Q: "Expose this Deployment as a Service"

apiVersion: v1
kind: Service
metadata:
name: my-app-service
spec:
selector:
app: my-app # Match the Deployment's pod labels
ports:
- port: 80
targetPort: 8080
type: ClusterIP

Q: "Create a ConfigMap and mount it in a Pod"

ConfigMap:

apiVersion: v1
kind: ConfigMap
metadata:
name: app-config
data:
config.yaml: |
database:
host: localhost
port: 5432

Pod using ConfigMap:

apiVersion: v1
kind: Pod
metadata:
name: app
spec:
containers:
- name: app
image: myapp
volumeMounts:
- name: config
mountPath: /etc/config
volumes:
- name: config
configMap:
name: app-config

Quick Reference Card (Memorize This!)

┌─────────────────────────────────────────┐
│ KUBERNETES YAML STRUCTURE │
├─────────────────────────────────────────┤
│ apiVersion: <group>/<version> │
│ kind: <ResourceType> │
│ metadata: │
│ name: <name> │
│ namespace: <namespace> │
│ labels: │
│ <key>: <value> │
│ spec: │
│ <resource-specific-fields> │
└─────────────────────────────────────────┘

COMMON apiVersion VALUES:
• v1 → Pod, Service, ConfigMap, Secret
• apps/v1 → Deployment, StatefulSet, DaemonSet
• batch/v1 → Job, CronJob
• networking.k8s.io/v1 → Ingress

DEPLOYMENT PATTERN:
spec:
replicas: <number>
selector:
matchLabels:
app: <app-name>
template:
metadata:
labels:
app: <app-name> ← Must match selector!
spec:
containers:
- name: <container-name>
image: <image>

🎓 Interview Tips

1. Always start with AKM-S structure

Write the skeleton first, then fill in details.

2. Know these by heart:

  • Pod spec: containers[]
  • Deployment spec: replicas, selector, template
  • Service spec: selector, ports[], type

3. Remember the "selector-label" rule

Service selector must match Pod labels Deployment selector.matchLabels must match template.metadata.labels

4. Container essentials:

containers:
- name: <name>
image: <image>
ports:
- containerPort: <port>

5. If you forget, explain your thinking

"I know the structure is apiVersion, kind, metadata, and spec. For a Deployment, I need replicas, a selector to match pods, and a template that defines the pod specification..."

6. Common mistakes to avoid:

  • ❌ Forgetting selector.matchLabels in Deployment
  • ❌ Mismatching labels between selector and pod
  • ❌ Wrong apiVersion (use apps/v1 for Deployments)
  • ❌ Using spec in ConfigMap/Secret (it's data)

🚀 Practice Exercise

Try writing these from memory:

  1. A Deployment with 2 replicas running nginx:1.19
  2. A Service exposing the above deployment on port 80
  3. A ConfigMap with database connection string
  4. A Pod that uses the ConfigMap as environment variables

📚 Bonus: Multi-Resource YAML

You can combine multiple resources in one file with ---:

apiVersion: apps/v1
kind: Deployment
metadata:
name: my-app
spec:
replicas: 3
selector:
matchLabels:
app: my-app
template:
metadata:
labels:
app: my-app
spec:
containers:
- name: app
image: nginx
---
apiVersion: v1
kind: Service
metadata:
name: my-app-service
spec:
selector:
app: my-app
ports:
- port: 80
---
apiVersion: v1
kind: ConfigMap
metadata:
name: my-app-config
data:
env: production

Final Memory Aid

Before the interview, write this on your paper:

A-K-M-S
v1, apps/v1, batch/v1
Pod: containers
Deploy: replicas, selector, template
Service: selector, ports, type

With this structure in mind, you can write any Kubernetes YAML!

Trading Error Analysis

· 6 min read
Femi Adigun
Senior Software Engineer & Coach

THE CRITICAL MISTAKE YOU MADE:

You looked at one red candle in isolation instead of the overall structure and context.


WHAT YOU SHOULD HAVE CHECKED FIRST:

1. TREND STRUCTURE (Most Important)

Before entering any trade, ask:

  • Where is price relative to recent highs/lows?
  • Are we making higher highs and higher lows (bullish)?
  • Or lower highs and lower lows (bearish)?

In your case:

  • Price was making higher lows: 663.20 → 664.00 → 664.60
  • Price was making higher highs: 664.80 → 665.20 → 665.40
  • This is BULLISH structure - not bearish

One red candle doesn't change the trend. You needed to see a series of lower highs forming to confirm bearish reversal.


2. SUPPORT/RESISTANCE LEVELS

Where was price when you entered?

  • If it was sitting ON strong support (665.00-665.20), that red candle was likely just a pullback, not a reversal
  • Red candles at support often bounce - that's where buyers step in

Key question before shorting:

  • Has price broken below any major support level?
  • Or is it just touching support and rejecting?

In your case, price was holding above 665.00 support - shorting there was fighting the level.


3. MOVING AVERAGES POSITION

Look at where price is relative to the green MAs:

  • Price above multiple green MAs = bullish bias
  • Price below multiple red MAs = bearish bias

When you entered:

  • Price was clearly above the green moving averages
  • Green MAs were sloping upward
  • This tells you: trend is up, buyers in control

Don't short when price is above rising MAs unless you see clear breakdown.


4. VOLUME CONTEXT

One red candle means nothing without volume context:

  • Was it on high volume (real selling) or low volume (pause)?
  • Were the prior green candles on higher volume than your red candle?

If the rally had strong volume and your red candle was light volume, it's just a breath - not a reversal.


5. PATTERN RECOGNITION

What pattern was forming?

  • Was it a bull flag (consolidation in uptrend)?
  • Was it a head and shoulders top (reversal)?
  • Was it just a pullback in uptrend?

In your case, it looked like a bull flag or consolidation - not a reversal pattern. Red candles in consolidation are normal and often lead to continuation UP.


THE TRAP PREVENTION CHECKLIST:

Before entering ANY trade, check these in order:

1. What is the TREND? (30 seconds)

  • Higher highs + higher lows = Uptrend (favor CALLS, be careful with PUTS)
  • Lower highs + lower lows = Downtrend (favor PUTS, be careful with CALLS)
  • Sideways = Range (trade the bounces)

2. Where is price relative to KEY LEVELS? (30 seconds)

  • At resistance? Could reverse down
  • At support? Could bounce up
  • In the middle of nowhere? Wait for a level

3. Where is price relative to MOVING AVERAGES? (10 seconds)

  • Above rising MAs = bullish bias
  • Below falling MAs = bearish bias

4. What is the VOLUME telling me? (20 seconds)

  • Strong volume on up moves = buyers strong
  • Strong volume on down moves = sellers strong
  • Light volume moves = not meaningful

5. What is the TIME OF DAY? (5 seconds)

  • 9:30-10:30 AM: Volatile, traps common
  • 10:30-11:30 AM: Trend becomes clearer
  • 11:30-2:00 PM: Chop/lunch doldrums
  • 2:00-4:00 PM: Afternoon trend

6. What is the RISK/REWARD? (10 seconds)

  • Where's my stop?
  • Where's my target?
  • Is it at least 2:1 reward:risk?

Total time: ~2 minutes to check everything


WHAT YOU ACTUALLY DID:

You saw: "Red candle = exhaustion = time for PUTS"

This is emotional/reactive trading, not systematic trading.

You skipped:

  • Trend check (bullish)
  • Level check (at support)
  • MA check (above MAs)
  • Volume check (likely light on that red candle)
  • Pattern check (bull flag, not reversal)

Result: You entered a PUT into a bullish trend at support with price above MAs. That's a low-probability trade.


THE ONE RED CANDLE TRAP:

This is one of the most common traps for newer traders:

"I see a red candle after green ones, so it must be reversing!"

Reality:

  • One red candle is often just a pause or pullback
  • In an uptrend, pullbacks are buying opportunities, not short opportunities
  • You need multiple red candles making lower highs to confirm reversal

In an uptrend:

  • Red candles at support = buy signal
  • Red candles at resistance = maybe watch for reversal

In a downtrend:

  • Green candles at resistance = short signal
  • Green candles at support = maybe watch for reversal

THE GOLDEN RULE:

"Trade WITH the trend, not against it, unless you have STRONG evidence of reversal"

Strong evidence means:

  • Breaking major support/resistance
  • Multiple candles confirming
  • Volume supporting the move
  • Pattern completion (head & shoulders, double top, etc.)

One red candle is NOT strong evidence.


HOW TO FIX YOUR PROCESS:

Before EVERY trade, write down:

  1. Trend: Up/Down/Sideways
  2. Key Level: What level is price at?
  3. Position: Above or below MAs?
  4. Signal: What pattern/setup am I seeing?
  5. Entry: Specific price
  6. Stop: Specific price
  7. Target: Specific price
  8. R:R Ratio: Calculate it

If you can't answer all of these in 2 minutes, DON'T TAKE THE TRADE.


SPECIFIC TO YOUR SITUATION:

When you saw that red candle, you should have asked:

"Is the TREND still up?"

  • Yes → Don't short unless at major resistance

"Are we at a KEY LEVEL?"

  • Yes, at 665.00 support → Support means buyers likely step in

"Is price still above MAs?"

  • Yes → Still bullish structure

"Is this ONE red candle enough evidence?"

  • No → Need multiple candles to confirm reversal

"What's my R:R if I short here?"

  • Stop would need to be above 665.40 (recent high)
  • Target would be... where? 664.50?
  • That's 0.60 reward vs 0.40 risk = 1.5:1 at best
  • Not compelling given you're fighting the trend

CONCLUSION: Don't take the trade.


THE HARSH TRUTH:

Trading based on "I see a red candle" is gambling, not trading.

You need a system that considers:

  • Trend
  • Levels
  • Structure
  • Volume
  • Risk/reward

One candle tells you almost nothing without context.


ACTION PLAN MOVING FORWARD:

  1. Print out the checklist above
  2. Before EVERY trade, go through it
  3. If trend is UP, be very suspicious of PUT entries (only at major resistance)
  4. If trend is DOWN, be very suspicious of CALL entries (only at major support)
  5. Wait for the pullback in the direction of the trend instead of fighting it
  6. One candle is never enough - look for patterns and confirmation

The market will always be there. Missing a trade is better than taking a bad one. Your job is to wait for HIGH PROBABILITY setups that check all the boxes, not react to every red or green candle.

Trading Change

· 5 min read
Femi Adigun
Senior Software Engineer & Coach

ANALYZING THIS CHART - TREND CHANGE CONFIRMATION

Looking at your current chart, this is NOT forming higher highs yet. Let me explain what I'm seeing:


CURRENT STRUCTURE:

What I see:

  • Major high: 670.74 (top left, early morning)
  • Price collapsed from 670.74 → 667.00 area
  • Now attempting bounces but each bounce is LOWER than 670.74
  • Recent highs: ~669.70, ~669.40, ~668.60 (declining)
  • This is LOWER HIGHS = still bearish structure

Red moving averages are pointing DOWN and acting as resistance. Price is below them. This confirms downtrend is still intact.


WHAT WOULD CONFIRM TREND CHANGE TO BULLISH:

You need THREE things to happen in order:

1. BREAK the downtrend structure (First sign)

Price must break ABOVE the most recent significant high.

Looking at your chart:

  • Recent resistance: 669.70-670.00 zone
  • Price needs to break ABOVE 670.00 and HOLD for at least 15-30 minutes
  • Just touching it isn't enough - needs to close multiple candles above it

2. MAKE a higher low (Confirmation)

After breaking above 670.00, price will pull back. This pullback must find support HIGHER than the previous low.

Example sequence:

  • Current low: 667.00
  • Breaks above 670.00
  • Pulls back to 668.00 (this is HIGHER than 667.00) ✓
  • This creates a higher low

3. BREAK to a NEW higher high (Trend change complete)

From that higher low (668.00), price must rally and break the previous high.

Example:

  • Previous high: 670.00
  • Must break above 670.20+ on the next rally
  • This confirms: Higher low + Higher high = UPTREND

VISUAL CHECKLIST FOR TREND CHANGE:

DOWNTREND → UPTREND requires this sequence:

1. Break resistance ✓ (break above 670.00)
2. Pull back ✓ (must hold above last low)
3. Make new high ✓ (break above 670.20+)

Only when ALL THREE happen = Valid trend change

CURRENT STATUS:

Right now you're seeing:

  • Bounces within the downtrend (lower highs)
  • Red MAs still pointing down
  • Price still below major resistance
  • This is NOT a trend change yet
  • These are bear market bounces / dead cat bounces

This is still a downtrend until proven otherwise.


WHEN TO STOP CALLING IT A TRAP:

A bounce becomes a valid reversal (not a trap) when:

Minimum requirements:

  1. Breaks above the downtrend line (red curved resistance)
  2. Closes above previous significant high (670.00+)
  3. Holds above it for 15-30 minutes (multiple green candles)
  4. Makes a higher low on pullback
  5. Volume confirms (strong volume on the breakout)

Strong confirmation:

  1. Price reclaims moving averages (gets back above red MAs)
  2. Moving averages start curling up (red turns to green)
  3. Multiple higher highs and higher lows form
  4. Time factor - sustains for 30+ minutes

SPECIFIC LEVELS FOR YOUR CHART:

For bulls to take control, watch these in order:

  1. First resistance: 669.70-670.00

    • Must break AND hold above this
    • Just touching and rejecting = still bearish
  2. Second resistance: 670.30-670.50

    • Breaking this shows real strength
    • This is where you'd see higher high forming
  3. Key moving averages

    • Must reclaim the red curved lines
    • They're around 669-670 area
    • Getting above them = major shift

For bears to maintain control:

  • Keep price below 669.50
  • Reject at 670.00 resistance
  • Break below 667.50 again

WHAT YOU'RE SEEING NOW:

These bounces from 667.00 → 668.80 → 669.40 are:

  • Relief rallies in a downtrend
  • Lower highs (not reaching 670.74)
  • Dead cat bounces
  • Not trend changes (yet)

This is why they keep failing - the overall structure is still bearish.


HOW TO TRADE THIS:

If you believe downtrend continues:

Short the bounces to resistance (669.50-670.00 area)

If you want to catch the reversal:

DON'T try to guess the bottom. Wait for:

  1. Break above 670.20
  2. Pullback to 668.50-669.00
  3. THEN buy calls
  4. Stop below the higher low

Trying to buy calls now at 667-668 = catching falling knives


THE TRAP vs. REAL REVERSAL:

TRAP characteristics (what you're seeing now):

  • Bounces that fail at resistance
  • Lower highs still forming
  • Red MAs providing resistance
  • Each bounce weaker than the last
  • Can't sustain gains

REAL REVERSAL characteristics (what to wait for):

  • Breaks through resistance with volume
  • Holds above breakout level
  • Makes higher lows on pullbacks
  • Breaks to new higher highs
  • MAs turn from red to green
  • Sustains for 30+ minutes

HONEST ASSESSMENT:

Based on this chart, you're still in a downtrend. Those green candles you're seeing are bounces, not reversals. The trend hasn't changed yet.

Don't confuse a bounce with a trend change.

Wait for the structure I described above before calling it bullish. Until then, these are trading opportunities to SHORT at resistance, not buy.

Current bias: Still bearish until 670.20+ breaks and holds.

Trading Analysis

· 24 min read
Femi Adigun
Senior Software Engineer & Coach

Looking at this SPY 5-minute chart from 9/29/25, analyzing the first two 15-minute candles (9:30-10:00):

Analysis:

The first 30 minutes shows:

  1. Initial spike - Strong upward momentum at open, pushing above 665.6
  2. Immediate rejection - Failed to hold highs, sharp selloff begins
  3. Breakdown - Price crashes through multiple support levels
  4. Lower highs, lower lows - Classic bearish structure forming

Key observations:

  • Red curved lines showing downward momentum
  • Price unable to reclaim early highs
  • Broken below opening range
  • Support levels (horizontal lines) being tested to the downside
  • Volume spike on the selloff (visible in the massive red candles)

Verdict: This is a PUT day 🐻

Trading setup:

  • The failed breakout at open is a classic "bull trap"
  • Once price broke below the opening range low (~664.5), that confirmed bearish intent
  • You'd be looking for puts on bounces/retracements to the downside
  • Target those support zones (664.2, 664, 663.8 area)

Risk note: That sharp V-bottom around 9:45 shows some buying interest, so you'd want to be nimble with scalps and watch for any reversal patterns. But the overall structure of the first 30min is definitively bearish - lower highs being printed after the failed open.Full Day Analysis - SPY 9/29/25 📊

Opening Structure (9:30-10:00)

BEARISH - Classic bull trap

  • Gap up at open to ~665.60
  • Immediate rejection and violent selloff
  • Broke opening range low = bearish signal confirmed
  • Waterfall decline to 663.89 low (~1.70 point drop in 45 mins)

Why it happened: Likely gap fill from previous day, sellers positioned for the move, weak hands trapped long at the highs.


First Reversal (10:00-10:45)

BULLISH ATTEMPT - Failed short squeeze

  • Massive V-bottom from 663.89
  • Explosive rally back to 665.05 area
  • Key Opportunity #1: Calls on the bounce from 663.89-664.00 (high risk/high reward)
  • However, failed to reclaim 665.20-665.30 = warning sign
  • Created lower high vs opening high (665.60)

Why it failed: Resistance from morning sellers still in control, not enough buying conviction to break through.


Mid-Morning Chop (10:45-12:00)

RANGE-BOUND/SLIGHT BEAR

  • Consolidation between 664.50-665.05
  • Multiple tests of support at 664.50-664.60
  • Each bounce getting weaker (lower highs forming)
  • Red curved resistance pressing down
  • Opportunity #2: Range scalps - buy 664.50, sell 664.90-665.00 (tight stops required)

Lunch Breakdown (12:00-12:30)

BEARISH CONTINUATION

  • Support at 664.50 finally broke
  • This confirmed the bear trend from morning was still intact
  • Dropped to new lows around 663.40
  • Opportunity #3: Puts on the break of 664.40 targeting 663.50-663.00

Why it happened: Lunch time = low liquidity, stops got run, no buyers stepping in.


Afternoon Recovery Attempts (12:30-2:30)

CHOPPY BULLISH - Multiple failed rallies

  • Several bounces: 663.40 → 664.00 → failed
  • Another: 663.00 → 664.20 → failed
  • Creating a series of higher lows but still under heavy resistance
  • Red curved lines acting as dynamic resistance throughout
  • Opportunity #4: Quick scalp calls on bounces from 663.00-663.50, but had to take profit fast at 664.00-664.20

Pattern: Bear flag / descending channel - each rally met with selling pressure


Late Day Action (2:30-Close)

FINAL PUSH - Small bullish momentum

  • Last attempt to reclaim 663.50-664.00 zone
  • Some buying came in but still capped
  • Closed near 663.70 area (down ~1.40 from HOD)
  • Volume dying off = indecision

KEY LEVELS THAT MATTERED:

  1. 665.30-665.60 = Major resistance (morning highs)
  2. 664.50-664.60 = Critical support that broke midday
  3. 663.80-664.00 = Battle zone all afternoon
  4. 663.00-663.40 = Low of day support

OVERALL DAY ASSESSMENT:

Primary Bias: PUT DAY 🐻

Reasons:

  1. ✅ Failed breakout at open (bull trap)
  2. ✅ Lower highs all day (665.60 → 665.05 → 664.80 → 664.20)
  3. ✅ Lower lows (663.89 → 663.40 → 663.00)
  4. ✅ Every rally attempt failed at resistance
  5. ✅ Red curved moving averages providing constant downward pressure
  6. ✅ Closed near lows of day

Best Opportunities:

  • #1 BEST: Puts after break of opening range low (~9:45 AM) - that confirmed the trend
  • #2: Puts on failed rally at 665.00 (10:15-10:30) - lower high rejection
  • #3: Puts on breakdown of 664.50 support (~12:15 PM) - clear break of key level
  • Scalp plays: Quick calls on oversold bounces at 663.00-663.50, but required discipline to take profit

What NOT to do:

  • ❌ Hold calls hoping for reversal - every bounce failed
  • ❌ Short the absolute lows (663.00) - risky, better to short rallies
  • ❌ Fight the trend - red curved lines were your guide all day

Pattern Recognition: This was a classic trending day down with weak counter-trend bounces. The tell was the first 30 minutes - when that opening range broke to the downside with conviction, the script was written. LESSONS LEARNED FROM 9/29/25 SPY TRADING DAY 📚


🎯 CRITICAL LESSONS

1. The First 30 Minutes Sets The Tone

LESSON: When the opening range breaks with conviction (especially on high volume), that's often your directional bias for the session.

  • The break below opening range at 9:45 AM was THE signal
  • Don't fight it, trade with it
  • Rule: Wait for opening range to establish (9:30-10:00), then trade the break

2. Failed Breakouts Are High-Probability Setups

LESSON: That gap-up to 665.60 followed by immediate rejection = bull trap

  • When price makes a new high and can't hold it = reversal signal
  • The "trap" creates a pool of trapped longs who become sellers
  • Rule: Look for quick rejections at key levels - they often lead to strong moves in opposite direction

3. Lower Highs + Lower Lows = Stay Bearish

LESSON: The trend was crystal clear all day:

  • 665.60 → 665.05 → 664.80 → 664.20 (lower highs)
  • 663.89 → 663.40 → 663.00 (lower lows)
  • Rule: Don't fight established trend structure. Trade WITH the trend, not against it

4. Counter-Trend Bounces Require Quick Profits

LESSON: Those big green candles looked tempting, but EVERY rally failed

  • The 663.89 → 665.05 bounce? Faded
  • The 663.40 → 664.00 bounce? Faded
  • The 663.00 → 664.20 bounce? Faded
  • Rule: In a strong trend, counter-trend trades = scalps only. Take profit FAST (30-50 cents, not hoping for full reversal)

⚠️ TRAPS & GOTCHAS

TRAP #1: The Morning Bull Trap 🪤

  • What happened: Open at 665.60 looked bullish, but was a trap
  • The gotcha: FOMO buying the high, then getting crushed
  • How to avoid: Wait 15-30 mins before trading. Let the real trend reveal itself
  • Better play: Sold puts when it broke 665.20 support

TRAP #2: "It's Oversold, Time to Buy!" 🪤

  • What happened: Multiple times hit oversold (663.89, 663.40, 663.00)
  • The gotcha: Each bounce looked like "the bottom" but wasn't
  • How to avoid: Oversold can stay oversold in trending markets
  • Better play: Short the bounces, don't try to catch falling knives

TRAP #3: The 10:00-10:30 Fake Reversal 🪤

  • What happened: Huge green candles from 663.89 → 665.05
  • The gotcha: "Bears are done! It's a V-bottom recovery!"
  • How to avoid: Check if it reclaims ABOVE key resistance (665.30). It didn't = still bearish
  • Better play: Sell calls into that strength at 664.90-665.05

TRAP #4: Holding Through Lunch 🪤

  • What happened: 664.50 support looked solid for 1.5 hours, then broke at lunch
  • The gotcha: Low liquidity lunch = stops get run, violent moves
  • How to avoid: Tighten stops or close positions before 12:00-1:00 PM
  • Better play: Exit before lunch or trade the lunch breakdown fresh

TRAP #5: "Every Dip Is A Buy" 🪤

  • What happened: Multiple dips (664.50, 664.00, 663.50) that continued lower
  • The gotcha: Averaging down on calls in a downtrend = death by 1000 cuts
  • How to avoid: In downtrends, every dip leads to lower dips
  • Better play: Buy dips only when trend changes (higher highs formed)

✅ DO's

  1. DO wait for confirmation

    • Opening range break = confirmation
    • Don't assume direction pre-market
  2. DO respect key levels

    • 665.30, 664.50, 663.89 were critical
    • Price action at these levels tells the story
  3. DO use the moving averages

    • Those red curved lines were resistance all day
    • When price under MAs + MAs pointing down = stay bearish
  4. DO take profits on counter-trend trades

    • That 663.89 → 665.05 bounce? 1+ points!
    • Should've taken profit at 664.80-665.00, not hoped for full reversal
  5. DO trade the pattern

    • Lower highs + lower lows = short rallies
    • Don't overcomplicate it
  6. DO scale in/out

    • Don't go all-in on one entry
    • Build positions as confirmation comes
  7. DO use proper stops

    • Puts: stop if it reclaims 665.30+ strongly
    • Calls: stop if breaks below key support levels

❌ DON'Ts

  1. DON'T fight the trend

    • Biggest mistake = buying calls after 10:00 AM hoping for reversal
    • The trend was down, period
  2. DON'T hold losers hoping

    • If your call is down 30-50% and trend hasn't changed, cut it
    • Hope is not a strategy
  3. DON'T ignore volume

    • Big volume on the opening dump = conviction
    • Low volume on bounces = weak, likely to fail
  4. DON'T revenge trade

    • Missed the morning breakdown? Don't chase
    • Wait for next setup (failed rally to short)
  5. DON'T overtrade the chop

    • That 10:45-12:00 range was tight
    • Sometimes sitting out is the best trade
  6. DON'T buy the "V-bottom"

    • Just because it bounced hard doesn't mean trend changed
    • Need higher highs confirmed, not just bounce
  7. DON'T ignore resistance

    • Every rally hit 664.80-665.05 and failed
    • That's telling you something - listen!
  8. DON'T hold through major time frames

    • Lunch (12-1 PM) and Power Hour (3-4 PM) can reverse positions
    • Take profit or tighten stops before these times

🎓 STRATEGIC LESSONS

Position Sizing

  • Early morning (9:30-10:30): Smaller size, volatility is WILD
  • Confirmed trend (10:30-12:00): Can size up on high-probability setups
  • Lunch (12:00-1:00): Reduce size or sit out
  • Afternoon (1:00-3:30): Moderate size, watch for reversal patterns

Risk Management

  • Max loss per trade: 20-30% of position
  • If wrong 2-3 times in a row: Step away, reassess
  • Daily loss limit: Hit it? Done for the day. Live to trade tomorrow

Entry Timing

  • Best entries today:
    1. Short at 665.00-665.20 (failed opening range)
    2. Short at 665.00-665.05 (failed V-bottom recovery)
    3. Short at 664.50 break (lunch breakdown)
    4. Quick calls at 663.00-663.40 (oversold bounces) with TIGHT profit targets

Exit Timing

  • In trend: Trail stops, let winners run (but not through lunch!)
  • Counter-trend: FAST profits (30-50 cents for scalps)
  • If thesis breaks: Exit immediately (if shorting and breaks above 665.30)

📊 PATTERN RECOGNITION RULES

When you see this pattern again (Trending Down Day):

  1. ✅ Short failed rallies (not blindly short support)
  2. ✅ Use resistance levels as entry (665.00, 664.80, 664.50)
  3. ✅ Quick scalp calls only at major support with tight stops
  4. ✅ Trail stops down as trend continues
  5. ✅ Tighten stops before lunch and end of day
  6. ✅ Don't hope for reversals - trade what IS, not what you want

Confirmation checklist for trend:

  • Lower highs being made?
  • Lower lows being made?
  • Price below moving averages?
  • Failed bounces at resistance?
  • High volume on trend moves, low volume on corrections?

If all checked = TREND IS YOUR FRIEND


💰 PROFIT OPTIMIZATION

If you traded this day perfectly:

  1. Short at 665.20 → cover at 663.89 = +1.30
  2. Quick call at 664.00 → sell at 665.00 = +1.00
  3. Short at 665.00 → cover at 664.40 = +0.60
  4. Short at 664.40 → cover at 663.40 = +1.00
  5. Quick call at 663.40 → sell at 663.90 = +0.50

Total potential: ~4.40 points in SPY = massive day in options

Reality check: You won't catch them all. 2-3 good trades = winning day.


🧠 PSYCHOLOGICAL LESSONS

  1. FOMO is expensive - Don't chase after missing initial move
  2. Patience pays - Wait for your setup, don't force trades
  3. Accept losses quickly - Small losses are tuition, not failure
  4. Don't marry your bias - Be flexible when price action says you're wrong
  5. Profit is profit - Don't regret taking +50% gain just because it went +200%

FINAL WISDOM: This was a textbook trending down day. The market showed its hand in the first 30 minutes. The trap was thinking each bounce was THE reversal. The opportunity was recognizing the pattern and trading with it, not against it.

Best traders don't predict - they react and adapt. 📈📉 🎯 REAL-TIME INFLECTION POINTS & DECISION SIGNALS

Let me break down the exact moments where the market TOLD you what was coming next, and how to read them in real-time.


⏰ TIME-STAMPED INFLECTION POINTS

INFLECTION #1: 9:30-9:35 AM - The Opening Print

What happened: Gapped up to 665.60

Signals to watch:

  • Volume on first 5-min candle - Was it climactic? (Yes = exhaustion)
  • Wick rejection - Did it make a high and immediately reject? (Yes = sellers in control)
  • Failed to hold gap - Is it already trading back below open? (Yes = weak)

Decision framework:

IF (big wick at top + immediate selloff + high volume)
THEN → Likely bull trap, prepare for reversal
WAIT for confirmation: break of 5-min low

IF (grinds higher slowly with increasing volume)
THEN → Genuine breakout, can buy calls

What chart showed: Huge upper wick at 665.60, immediate red candle The signal: This is a trap, not a breakout. Don't chase. Wait for short entry.


INFLECTION #2: 9:42-9:45 AM - Opening Range Low Break

What happened: Broke below the low of the first 15-min range (~665.00)

THIS WAS THE #1 SIGNAL OF THE DAY 🚨

Signals confirming the move:

  • Clean break with volume - Not just a wick, but body close below
  • No immediate reclaim - Didn't bounce right back = real break
  • Acceleration - Selling picked up speed after break
  • Prior resistance becomes support - 665.00 level flipped

Decision point:

Opening Range Low Break (9:42-9:45 AM):
→ Enter PUTS here
→ Stop: Back above opening range high (665.60)
→ Target: Next support levels (664.50, 664.00, 663.50)

Risk/Reward: ~0.60 risk for 1.50+ reward

Why this matters: 70% of the time, when opening range breaks with conviction, that's the direction for the session.

How to confirm in real-time:

  • Price broke 665.00
  • Next candle continued lower (didn't bounce back immediately)
  • Volume increased on the break
  • ENTRY SIGNAL CONFIRMED

INFLECTION #3: 9:50-10:00 AM - Waterfall Breakdown

What happened: Accelerated selling from 664.50 → 663.89

Signals:

  • Consecutive red candles with no bounces (5+ in a row)
  • Increasing range - Each candle bigger than last = panic
  • Breaking support levels rapidly - 664.50, 664.20, 664.00 all broke fast
  • Moving averages turning down - Red curves curling over

Decision point:

IF you missed the opening range break:
→ DON'T CHASE HERE (too extended)
→ WAIT for bounce to SHORT into
→ Mark this area (663.80-664.00) as key support

IF you're already in PUTS from 665.00:
→ Trail stops down
→ Take partial profits at 664.00
→ Let runners go for 663.50

Inflection signal: When you see 5+ consecutive strong candles in one direction, you're in a momentum phase. Don't fade it, ride it or wait for exhaustion.


INFLECTION #4: 10:00-10:05 AM - The V-Bottom

What happened: Violent reversal from 663.89 → 664.80

CRITICAL ANALYSIS NEEDED HERE 🔍

Signals to determine if it's REAL or FAKE:

For a REAL reversal, you need:

  • Higher high than previous swing (needs to break 665.60+)
  • Reclaim above key moving averages
  • Volume increasing on up moves
  • Multiple green candles with higher lows
  • Break above downtrend resistance line

What actually happened:

  • ❌ Failed at 665.05 (couldn't even reach 665.30)
  • ❌ Stayed below moving averages
  • ❌ Volume was lower on bounce vs. the selloff
  • ❌ Created LOWER HIGH (665.05 vs 665.60)
  • ❌ Red resistance line held

Decision point:

At 10:05 AM when price hit 665.00-665.05:

Check:
1. Did it break ABOVE 665.30? NO
2. Is it above moving averages? NO
3. Did downtrend line break? NO

Result: FAILED REVERSAL = SHORT OPPORTUNITY

→ Enter PUTS at 665.00-665.05
→ Stop: Above 665.35
→ Target: Retest of lows (664.00, 663.50)

The key lesson: A V-bottom only counts if it makes a higher high. This failed = bear flag = continuation lower coming.


INFLECTION #5: 10:15-10:30 AM - Lower High Confirmation

What happened: Failed to reclaim 665.05, making lower highs at 664.90, 664.80

This confirmed the downtrend continuation

Signals:

  • Series of lower highs - Each push weaker than last
  • Red candles at resistance - Getting rejected
  • Shrinking volume on bounces - No conviction
  • Red moving averages acting as ceiling - Dynamic resistance

Decision point:

Pattern recognition: BEAR FLAG
→ Big drop (flagpole) ✓
→ Consolidation/bounce (flag) ✓
→ Lower highs in consolidation ✓
→ Awaiting: Break lower (continuation)

Action: Wait for 664.40 break, then enter PUTS
Target: New lows below 663.89

How to spot in real-time:

  • Draw a line connecting the highs: 665.05 → 664.90 → 664.80
  • It's sloping down = bear flag
  • When price breaks the flag's low = entry signal

INFLECTION #6: 12:10-12:15 PM - Support Break at Lunch

What happened: 664.50 support (held for 1.5 hours) finally broke

Pre-signals that break was coming:

  • Multiple tests of support - Each test weakens it (3rd-4th test)
  • Lower highs still forming - Coiling tighter
  • Low lunch volume - Easy to break stops
  • No strong bounces - Each bounce getting weaker

Decision point:

12:00 PM - Before the break:
→ Notice: 664.50 tested 3 times already
→ Each bounce weaker: 664.90 → 664.80 → 664.70
→ Conclusion: 4th test likely breaks

Action:
→ Wait for 664.40 break
→ Enter PUTS on break
→ Target: 663.50, 663.00 (measured move from range)

The rule: Support isn't support forever. 3rd-4th test often breaks. Position for the break, not the hold.


INFLECTION #7: 12:20-12:35 PM - New Low Made

What happened: Dropped to 663.40, then 663.00 (below morning's 663.89 low)

Signal: Lower low = downtrend still intact

Decision point:

At 663.00 (new low):
→ Check: Is this climactic? (Volume spike? Yes)
→ Check: Oversold on short-term? (Yes)
→ Conclusion: Likely to get a bounce, BUT trend still down

Action for scalpers:
→ Quick CALLS at 663.00-663.20
→ Target: 663.80-664.00 (resistance)
→ TIGHT STOP: Below 662.80
→ Plan to flip back to PUTS at 664.00

Action for trend followers:
→ Stay in PUTS
→ Trail stop to 664.50
→ Let it run

Key insight: New lows in established downtrend = trend continuation. Bounces are for selling, not reversing.


INFLECTION #8: 1:00-2:00 PM - Multiple Failed Rallies

What happened: 663.00 → 664.20, then failed. Then 663.20 → 664.00, failed again.

Signals this is still bearish:

  • Can't break above 664.20-664.50 zone (previous support now resistance)
  • Each rally smaller than previous (momentum dying)
  • Red moving averages still providing resistance
  • Higher lows BUT still lower highs = coiling, likely breaks down

Decision point:

Pattern: Descending triangle forming
→ Lower highs: 665.05 → 664.20 → 664.00
→ Flat support: 663.00 area

Descending triangle = bearish continuation pattern

Action:
→ Wait for 663.00 break again
→ Enter PUTS on break
→ OR short rallies to 664.00-664.20 resistance

🎯 SIGNAL HIERARCHY (Most Important First)

TIER 1 - HIGHEST PROBABILITY SIGNALS:

  1. Opening Range Break (9:42 AM) - 70%+ win rate

    • Clear, decisive, early signal
    • Sets tone for entire session
  2. Lower High After Failed Reversal (10:05 AM) - 65%+ win rate

    • Failed V-bottom + lower high = continuation
    • High probability short setup
  3. Support Break After Multiple Tests (12:15 PM) - 65%+ win rate

    • 3rd-4th test breaks more often than holds
    • Especially true during low-volume lunch

TIER 2 - STRONG CONFIRMATION SIGNALS:

  1. Momentum Clusters (5+ consecutive candles same color)

    • Shows dominant force
    • Don't fade, ride or wait
  2. Moving Average Rejection (Throughout day)

    • Price stayed below red curves all day
    • Each touch = short opportunity
  3. Volume Divergence

    • High volume on down moves
    • Low volume on up moves
    • = Bearish dominance

TIER 3 - SUPPORTING SIGNALS:

  1. Lower highs progression (Visual trend structure)
  2. Support/Resistance flips (665.00 support became resistance)
  3. Time of day patterns (Lunch breakdown, afternoon fade)

📋 REAL-TIME DECISION CHECKLIST

At ANY point in the day, ask yourself:

TREND ASSESSMENT:

[ ] Are we making higher highs & higher lows? (Bullish)
[ ] Are we making lower highs & lower lows? (Bearish)
[ ] Are we in a range? (Choppy - reduce size)

TODAY'S ANSWER: Lower highs + lower lows all day = BEARISH


POSITION EVALUATION:

[ ] Am I trading WITH the trend or AGAINST it?
[ ] Is my stop placement logical? (Above resistance for puts, below support for calls)
[ ] Do I have a profit target, or am I hoping?
[ ] What will I do if price reaches X level?

TODAY'S ANSWER: Trade WITH downtrend (puts), take profits at support levels, cut if reclaims 665.30+


ENTRY TIMING:

[ ] Is this a breakout? (Wait for confirmation)
[ ] Is this a support/resistance touch? (Wait for reaction)
[ ] Is this a pullback in a trend? (Entry signal)
[ ] Am I chasing? (If yes, WAIT)

TODAY'S BEST ENTRIES:

  • 9:42 AM: Opening range break ✅
  • 10:05 AM: Failed bounce short ✅
  • 12:15 PM: Support break ✅
  • 1:00-2:00 PM: Resistance rejection shorts ✅

EXIT TIMING:

[ ] Did my profit target hit? (Take it)
[ ] Did my stop hit? (Honor it)
[ ] Did the trend change? (Higher high made = exit puts)
[ ] Is major time window approaching? (Lunch, close = take profit or tighten)

🧠 PATTERN RECOGNITION IN REAL-TIME

How to know what's coming next:

PATTERN: Bull Trap → Reversal

Signal sequence:
1. Gap up / new high made ✓
2. Immediate rejection with volume ✓
3. Break below gap/open level ✓
→ Prediction: Trend lower

Action: Enter puts on confirmation

Today: Happened at 9:30-9:45 AM


PATTERN: Failed V-Bottom → Continuation

Signal sequence:
1. Sharp selloff to support ✓
2. Strong bounce ✓
3. Fails to make higher high ✓
4. Makes lower high instead ✓
→ Prediction: Downtrend continues

Action: Short the lower high

Today: Happened at 10:00-10:15 AM


PATTERN: Bear Flag → Breakdown

Signal sequence:
1. Strong down move (flagpole) ✓
2. Consolidation with lower highs (flag) ✓
3. Multiple tests of support ✓
4. Support breaks ✓
→ Prediction: Next leg down

Action: Enter puts on break, target measured move

Today: Happened at 10:30-12:15 PM


PATTERN: Support Becomes Resistance

Signal sequence:
1. Price breaks support level ✓
2. Rallies back to test broken level ✓
3. Gets rejected at old support (now resistance) ✓
→ Prediction: Another leg down

Action: Short the retest of broken support

Today: 665.00 and 664.50 both flipped


⚡ RAPID-FIRE DECISION RULES

For Entry:

  • With trend + at key level + confirmation = ENTER
  • With trend + chasing + no setup = WAIT
  • Against trend + hope/prediction = NO TRADE

For Adding to Position:

  • In profit + trend confirming + next level in sight = ADD
  • In loss + hoping for reversal = CUT, don't add

For Exits:

  • Hit target = Take profit
  • Trend change signal = Exit
  • Hit stop = Honor it
  • "Let me see what happens" = Emotional trading, have a plan

🎓 THE ULTIMATE INFLECTION POINT FRAMEWORK

Every inflection point has 3 phases:

PHASE 1: SETUP (Recognition)

  • What pattern is forming?
  • What level is it testing?
  • What's the current trend?

PHASE 2: TRIGGER (Confirmation)

  • Did it break/hold the level?
  • Was there volume?
  • Did the next candle confirm?

PHASE 3: FOLLOW-THROUGH (Management)

  • Is it acting as expected?
  • Where's my stop?
  • Where's my target?

Applying to today's KEY moment (9:42 AM opening range break):

SETUP:

  • Pattern: Opening range forming
  • Level: 665.00 (opening range low)
  • Trend: Testing lower after failed high

TRIGGER:

  • Break: YES, closed below 665.00 at 9:42
  • Volume: YES, increasing
  • Confirmation: Next candle continued lower (didn't immediately reclaim)

FOLLOW-THROUGH:

  • Acting as expected? YES, accelerating lower
  • Stop: Above 665.60 (opening range high)
  • Target: 664.50, 664.00, 663.50 (previous support levels)

RESULT: HIGH PROBABILITY TRADE


💡 MENTAL MODEL FOR LIVE TRADING

Think of the market like a pinball machine:

Support/Resistance = Bumpers

  • Price bounces between them
  • Each hit weakens the bumper
  • Eventually breaks through

Trend = Tilt of the machine

  • Downtrend = tilted left (easier to go down)
  • Uptrend = tilted right (easier to go up)
  • Range = flat (bounces equally)

Your job:

  • Identify the tilt (trend)
  • Play the bumpers (levels)
  • Know when the tilt changes (inflection points)

Today's machine: Tilted heavily left (down) all day. Every bounce hit red bumpers (moving averages) and went lower.


🚦 TRAFFIC LIGHT SYSTEM

At any given moment, classify the situation:

🟢 GREEN LIGHT (High confidence entry):

  • Opening range break (9:42 AM)
  • Failed reversal lower high (10:05 AM)
  • Support break after 3+ tests (12:15 PM)
  • Trade WITH trend at key levels

🟡 YELLOW LIGHT (Proceed with caution):

  • Oversold bounces in downtrend (quick scalps only)
  • First test of major support (might hold)
  • Lunch time trades (low liquidity)

🔴 RED LIGHT (Don't trade):

  • Chasing after big move (9:55 AM after waterfall)
  • Fighting the trend without reversal signal
  • Unclear structure / choppy range
  • On tilt / emotional

📊 THE 3-QUESTION FRAMEWORK (Before every trade)

1. WHERE IS THE TREND?

  • Today: DOWN (lower highs + lower lows)

2. WHERE IS THE PRICE RELATIVE TO KEY LEVELS?

  • At resistance? → Short opportunity
  • At support? → Might bounce (quick scalp) or break (continuation)
  • In middle of nowhere? → WAIT

3. WHAT'S THE CONFIRMATION?

  • Clean break with volume? → GO
  • Rejection with volume? → GO
  • Unclear? → WAIT

If all 3 align = HIGH PROBABILITY TRADE


FINAL WISDOM:

The market doesn't hide its intentions - it broadcasts them through price action. Your job isn't to predict, it's to recognize the signals and react accordingly.

Today's signals were LOUD:

  1. Bull trap at open
  2. Opening range break
  3. Failed bounce making lower high
  4. Support breaks
  5. Lower highs all day

The traders who made money saw these signals and acted. The traders who lost money ignored them and hoped.

Be the former. 📈

Trading Strategy

· 12 min read
Femi Adigun
Senior Software Engineer & Coach

Table of Contents

  1. Core Strategy Overview
  2. 30-Minute Bias Determination
  3. Fibonacci Retracement Levels
  4. Entry and Exit Rules
  5. Risk Management
  6. Do's and Don'ts
  7. Industry Best Practices
  8. Common Mistakes and How to Avoid Them
  9. Volume Analysis
  10. Alternative Trend Confirmation Methods
  11. Psychology and Discipline
  12. Advanced Techniques

Core Strategy Overview

The System

This strategy combines intraday trend analysis with Fibonacci retracement levels to identify high-probability options trading opportunities on SPY using 5-minute charts.

Key Components

  • Daily Bias Determination: First 30 minutes (two 15-minute candles)
  • Fibonacci Levels: 23.6%, 38.2%, 50%, 61.8%, 78.6% retracements
  • Primary Signals: 78.6% (CALL entries) and 23.6% (PUT entries/CALL exits)
  • Risk Management: Trend confirmation and position sizing

30-Minute Bias Determination

Morning Analysis (9:30-10:00 AM)

Examine the first two 15-minute candles:

CALL Day Criteria

  • Both candles are green (bullish)
  • Strong opening momentum upward
  • Price trading above opening level
  • Bullish engulfing patterns

PUT Day Criteria

  • Both candles are red (bearish)
  • Strong opening momentum downward
  • Price trading below opening level
  • Bearish engulfing patterns

Neutral/Choppy Day

  • Mixed candle colors
  • Small body candles with long wicks
  • Price oscillating around opening level
  • Action: Reduce position sizes or avoid trading

Bias Confirmation

  • Volume: Higher volume validates the bias
  • Gap Behavior: Gaps filling or extending support the bias
  • Pre-market Action: Overnight movement alignment

Fibonacci Retracement Levels

Daily Range Calculation

  • High: Highest point of the current trading day
  • Low: Lowest point of the current trading day
  • Range: Daily High - Daily Low

Key Levels and Their Significance

78.6% Retracement (Blue Line)

  • Primary CALL Entry Zone
  • Strong statistical support level
  • Institutional buying often occurs here
  • Signal: Price crosses below = BUY CALLS

61.8% Retracement (Green Line)

  • Secondary support/resistance
  • Golden ratio level
  • Often acts as a pause zone

50% Retracement (Yellow Line)

  • Psychological level
  • Common retracement depth
  • Decision point for trend continuation

38.2% Retracement (Light Red Line)

  • Early resistance level
  • First Fibonacci resistance
  • Watch for rejections

23.6% Retracement (Red Line)

  • Primary PUT Entry Zone
  • Strong resistance level
  • Signal: Price crosses above = SELL CALLS/BUY PUTS

Entry and Exit Rules

CALL Entries

Primary Signal: Price touches or crosses below 78.6% level

Entry Criteria

  • Price reaches 78.6% retracement
  • Preferably on a CALL day (bullish bias)
  • Volume confirmation (higher than average)
    • RSI showing oversold conditions (<30)

Position Sizing

  • WITH trend bias: Full position (1-3% of account)
  • AGAINST trend bias: Half position (0.5-1.5% of account)

PUT Entries

Primary Signal: Price touches or crosses above 23.6% level

Entry Criteria

  • Price reaches 23.6% retracement
  • Preferably on a PUT day (bearish bias)
  • Volume confirmation (higher than average)
  • RSI showing overbought conditions (>70)

Exit Strategies

Profit Targets

  • Conservative: 25-50% profit
  • Aggressive: 100-200% profit
  • Swing: Hold until opposite Fibonacci level

Stop Losses

  • Time-based: Exit by 3:30 PM if no movement
  • Price-based: Exit if price moves 0.5% against position
  • Fibonacci-based: Exit if price breaks next Fibonacci level

Emergency Exits

  • Exit immediately if daily bias changes dramatically
  • Exit if volume spikes against your position
  • Exit if major news breaks

Risk Management

Position Sizing Rules

  • Maximum risk per trade: 1-3% of total account
  • Daily maximum risk: 5% of total account
  • Weekly maximum risk: 10% of total account

Portfolio Protection

  • Never risk more than you can afford to lose
  • Diversify expiration dates (don't put all trades in same week)
  • Limit number of concurrent positions (maximum 3-5 positions)

Options-Specific Risk Management

  • Avoid trading options with less than 7 days to expiration
    • Choose options with adequate liquidity (bid-ask spread <$0.10)
  • Monitor theta decay especially on longer holds
  • Be aware of upcoming earnings or FOMC meetings

Do's and Don'ts

✅ DO's

Strategy Execution

  • DO wait for clear Fibonacci level touches
  • DO confirm with volume when possible
  • DO respect the 30-minute bias for position sizing
  • DO use the alert system to avoid emotional decisions
  • DO keep detailed trade logs with entry/exit reasons
  • DO practice proper position sizing
  • DO have predetermined exit strategies before entering

Risk Management

  • DO cut losses quickly when wrong
  • DO take profits when targets are met
  • DO respect daily and weekly loss limits
  • DO trade smaller when bias conflicts with Fibonacci signals
  • DO monitor economic calendar for high-impact events

Discipline

  • DO stick to your predetermined rules
  • DO review trades weekly for improvement opportunities
  • DO maintain consistent trading hours
  • DO take breaks after significant losses

❌ DON'Ts

Strategy Violations

  • DON'T chase price away from Fibonacci levels
  • DON'T ignore the 30-minute bias completely
  • DON'T trade without volume confirmation on major moves
  • DON'T overtrade when signals are unclear
  • DON'T modify stop losses to avoid taking losses

Options-Specific Don'ts

  • DON'T buy options with less than 7 days to expiration
  • DON'T hold options through earnings without planning
  • DON'T ignore bid-ask spreads (avoid wide spreads)
  • DON'T buy far out-of-the-money options hoping for lottery tickets

Risk Management Violations

  • DON'T risk more than 3% per trade
  • DON'T add to losing positions (no averaging down)
  • DON'T trade when emotionally compromised
  • DON'T ignore predetermined stop losses
  • DON'T trade during major news events without experience

Psychological Pitfalls

  • DON'T revenge trade after losses
  • DON'T get overconfident after winning streaks
  • DON'T change strategy mid-trade
  • DON'T trade to recover losses quickly

Industry Best Practices

Professional Trading Standards

  1. Always have a plan before entering any trade
  2. Risk management is more important than being right
  3. Consistency beats home runs
  4. Keep detailed records of all trades
  5. Continuously educate yourself on market conditions

Options Trading Best Practices

  1. Understand Greeks (Delta, Gamma, Theta, Vega)
  2. Trade liquid options with tight bid-ask spreads
  3. Be aware of upcoming events that affect volatility
  4. Don't fight time decay unnecessarily
  5. Consider implied volatility when entering positions

Technical Analysis Standards

  1. Multiple timeframe analysis (use higher timeframes for context)
  2. Volume confirmation on breakouts and reversals
  3. Support and resistance respect market structure
  4. Trend following is statistically more profitable than counter-trend
  5. Wait for confirmation rather than predicting

Risk Management Industry Standards

  1. 2% rule: Never risk more than 2% of account on single trade
  2. 6% rule: Stop trading if you lose 6% in a day
  3. Position sizing: Base on account size, not emotions
  4. Diversification: Don't put all capital in one strategy
  5. Regular review: Analyze performance monthly

Common Mistakes and How to Avoid Them

Mistake 1: Ignoring Volume

Problem: Taking signals without volume confirmation Solution: Always check volume on major moves; high volume validates signals

Mistake 2: Fighting the Fibonacci Levels

Problem: Holding positions that go against key levels Solution: Respect 78.6% and 23.6% levels as decision points

Mistake 3: Overleveraging

Problem: Risking too much per trade or trading too many contracts Solution: Stick to 1-3% risk per trade regardless of confidence level

Mistake 4: Emotional Trading

Problem: Making decisions based on fear or greed Solution: Use alerts and predetermined rules; step away when emotional

Mistake 5: Ignoring Time Decay

Problem: Holding options too long without price movement Solution: Set time-based exits; don't hold stagnant positions

Mistake 6: Chasing Entries

Problem: Entering trades after price has moved away from Fibonacci levels Solution: Wait for the next setup; missing a trade is better than a bad entry

Mistake 7: No Exit Strategy

Problem: Entering trades without knowing when to exit Solution: Define profit targets and stop losses before entering


Volume Analysis

Setting Up Volume in ThinkOrSwim

  1. Right-click chart → Studies → Add Study
  2. Search for "Volume" and add it
  3. Add "Volume SMA" (20-period) for context

Volume Interpretation

High Volume Signals:

  • Volume 2x above 20-period average
  • Validates price moves and breakouts
  • Institutional participation

Low Volume Signals:

  • Volume below 20-period average
  • Weak moves likely to reverse
  • Lack of institutional interest

Volume at Key Levels:

  • High volume at 78.6% = Strong support, bounce likely
  • Low volume at 78.6% = Weak support, could break
  • High volume at 23.6% = Strong resistance, reversal likely
  • Low volume at 23.6% = Weak resistance, could break through

Alternative Trend Confirmation Methods

Opening Range Breakout (ORB)

Setup: Mark first 30-60 minutes high/low range

  • Bullish: Break above opening range high
  • Bearish: Break below opening range low
  • Use with: Combine with 30-minute bias for confirmation

Pre-Market Analysis

Factors to Consider:

  • Overnight futures movement
  • Gap up/down at market open
  • Pre-market volume and direction
  • Key level breaks overnight

First Hour Momentum

Method: Count green vs red candles in first hour

  • 3-4 green candles = Strong bullish bias
  • 3-4 red candles = Strong bearish bias
  • Mixed candles = Choppy, reduce position sizes

Market Internals

Key Indicators:

  • TICK (upticks vs downticks)
  • TRIN (Arms Index)
  • VIX (volatility index)
  • Sector rotation patterns

Psychology and Discipline

Mental Framework

  1. Accept that losses are part of trading
  2. Focus on process, not outcomes
  3. Consistency is more important than being right
  4. Emotional decisions are usually wrong decisions

Daily Routine

Pre-Market (8:30-9:30 AM):

  • Review overnight news and futures
  • Mark key levels and Fibonacci retracements
  • Set alerts for entry points
  • Review economic calendar

Trading Hours (9:30 AM-4:00 PM):

  • Execute predetermined plan
  • Monitor positions without overanalyzing
  • Take notes on market behavior
  • Stick to risk management rules

Post-Market (4:00-5:00 PM):

  • Review all trades taken
  • Update trade journal
  • Analyze what worked and what didn't
  • Plan for next trading day

Dealing with Losses

  1. Accept the loss quickly
  2. Analyze what went wrong objectively
  3. Don't try to "get even" immediately
  4. Take a break if needed
  5. Return to systematic trading

Managing Winning Streaks

  1. Don't increase position sizes dramatically
  2. Continue following the same rules that created success
  3. Take some profits off the table
  4. Stay humble and focused
  5. Remember that losing streaks will come

Advanced Techniques

Multi-Timeframe Analysis

Higher Timeframe Context:

  • Use 15-minute charts for broader trend
  • Use 1-hour charts for major support/resistance
  • Use daily charts for overall market direction

Fibonacci Extensions

Beyond Retracements:

  • 127.2% extension for profit targets
  • 161.8% extension for major moves
  • Use when price breaks key retracement levels

Options Greeks Management

Delta: Measure of price sensitivity

  • Higher delta = More responsive to price moves
  • Lower delta = Less responsive, cheaper options

Theta: Time decay

  • Avoid high theta options close to expiration
  • Factor in weekends and holidays

Vega: Volatility sensitivity

  • High vega = More affected by volatility changes
  • Low vega = Less affected by volatility

Advanced Entry Techniques

Scaling In:

  • Enter 50% position at first Fibonacci touch
  • Add remaining 50% on confirmation

Layered Entries:

  • Multiple small entries around Fibonacci levels
  • Average better entry price

Market Regime Awareness

Trending Markets:

  • Fibonacci retracements work better
  • Trend-following strategies excel
  • Breakouts more reliable

Range-Bound Markets:

  • Mean reversion strategies work better
  • Support and resistance more reliable
  • Breakouts often fail

High Volatility Markets:

  • Wider stops needed
  • Smaller position sizes
  • More frequent whipsaws

Economic Calendar Integration

High-Impact Events:

  • FOMC meetings and announcements
  • Non-farm payrolls
  • CPI/inflation data
  • GDP releases

Event Trading Strategy:

  • Avoid trading 30 minutes before/after major announcements
  • Reduce position sizes on event days
  • Have predetermined exit plan for unexpected news

Final Checklist

Before Every Trade

  • 30-minute bias determined
  • Fibonacci levels clearly marked
  • Volume analysis completed
  • Risk amount predetermined (1-3% of account)
  • Exit strategy defined
  • Economic calendar checked
  • Alerts set for key levels

During the Trade

  • Monitor volume for confirmation
  • Stick to predetermined exit rules
  • Avoid emotional decisions
  • Don't modify stops to avoid losses
  • Take notes on market behavior

After Every Trade

  • Record entry/exit in trade journal
  • Analyze what worked/didn't work
  • Calculate actual risk/reward
  • Update running P&L
  • Plan improvements for next trade

Emergency Procedures

If Technology Fails

  1. Have backup platform ready
  2. Know how to exit positions by phone
  3. Keep broker phone number accessible
  4. Have mobile trading app installed

If Account Hits Daily Loss Limit

  1. Stop trading immediately
  2. Close all open positions
  3. Analyze what went wrong
  4. Don't trade again until next day
  5. Review and adjust strategy if needed

If Market Conditions Change Dramatically

  1. Exit all positions quickly
  2. Wait for conditions to stabilize
  3. Reassess market regime
  4. Adjust strategy accordingly
  5. Start with smaller positions when resuming

Success Metrics

Daily Metrics

  • Win rate (aim for >50%)
  • Average win vs average loss (aim for 1.5:1 or better)
  • Maximum drawdown (keep under 5% daily)
  • Number of trades (quality over quantity)

Weekly/Monthly Metrics

  • Overall profitability
  • Sharpe ratio (risk-adjusted returns)
  • Maximum consecutive losses
  • Strategy adherence rate

Continuous Improvement

  • Monthly strategy review
  • Identify patterns in losses
  • Refine entry/exit criteria
  • Adapt to changing market conditions
  • Stay educated on market developments

Remember: This strategy is a framework for disciplined trading. Market conditions change, and adaptation is key to long-term success. Always prioritize risk management over profit maximization.

Add SSL To Kubernetes Deployment

· 2 min read
Femi Adigun
Senior Software Engineer & Coach

step-by-step guide to setting up HTTPS with cert-manager and Let’s Encrypt on your Kubernetes cluster using NGINX Ingress.


📘 Full Guide: Enabling HTTPS with cert-manager + Let’s Encrypt

🧱 Prerequisites

  • Kubernetes cluster (e.g., Linode LKE)
  • NGINX Ingress Controller installed via Helm
  • DNS A record pointing to your domain. Mine is demo.zemo.app → your LoadBalancer IP
  • Helm installed and configured

1️⃣ Install cert-manager

kubectl create namespace cert-manager
kubectl apply -f https://github.com/cert-manager/cert-manager/releases/latest/download/cert-manager.yaml

Verify:

kubectl get pods -n cert-manager

2️⃣ Create a ClusterIssuer

Create k8s/clusterissuer.yaml:

apiVersion: cert-manager.io/v1
kind: ClusterIssuer
metadata:
name: letsencrypt-prod
spec:
acme:
server: https://acme-v02.api.letsencrypt.org/directory
email: your-email@example.com
privateKeySecretRef:
name: letsencrypt-prod
solvers:
- http01:
ingress:
class: nginx

Apply:

kubectl apply -f k8s/clusterissuer.yaml

3️⃣ Patch NGINX Config to Allow ACME Challenge

A. Confirm NGINX is installed via Helm

helm list -n default

B. Patch the ConfigMap

kubectl patch configmap ingress-nginx-controller -n default \
--type merge \
-p '{"data":{"strict-validate-path-type":"false"}}'

Restart the controller:

kubectl rollout restart deployment ingress-nginx-controller -n default

4️⃣ Disable NGINX Admission Webhook (if blocking cert-manager)

helm upgrade ingress-nginx ingress-nginx/ingress-nginx \
--namespace default \
--set controller.admissionWebhooks.enabled=false

Confirm webhook is removed:

kubectl get validatingwebhookconfiguration

5️⃣ Create Your Ingress Resource

Create k8s/ingress.yaml:

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: zemo-app-ingress
annotations:
nginx.ingress.kubernetes.io/rewrite-target: /
cert-manager.io/cluster-issuer: letsencrypt-prod
nginx.ingress.kubernetes.io/ssl-redirect: "false"
cert-manager.io/http01-edit-in-place: "true"
spec:
ingressClassName: nginx
rules:
- host: demo.zemo.app
http:
paths:
- path: /health
pathType: Prefix
backend:
service:
name: zemo-app-service
port:
number: 80
- path: /
pathType: Prefix
backend:
service:
name: zemo-app-service
port:
number: 80
tls:
- hosts:
- demo.zemo.app
secretName: zemo-app-tls

Apply:

kubectl apply -f k8s/ingress.yaml

6️⃣ Monitor Certificate Issuance

Check certificate status:

kubectl get certificate
kubectl describe certificate zemo-app-tls

Check challenge:

kubectl get challenge
kubectl describe challenge <name>

7️⃣ Final Steps

Once the certificate is issued (READY: True):

  • Remove ssl-redirect: "false" from annotations
  • Optionally add:
    nginx.ingress.kubernetes.io/force-ssl-redirect: "true"

Reapply Ingress:

kubectl apply -f k8s/ingress.yaml

Visit:

https://demo.zemo.app

🧠 Optional Enhancements

  • Use staging issuer for testing:
    server: https://acme-staging-v02.api.letsencrypt.org/directory
  • Add cert-manager.io/issue-temporary-certificate: "true" to serve HTTPS while waiting
  • Automate cert renewal and Ingress updates via GitOps or CI/CD

Dockerize A Fullstack App on VPS

· 3 min read
Femi Adigun
Senior Software Engineer & Coach

Deploy A Fullstack App to VPS with Docker

update your droplet

sudo apt update && sudo apt upgrade -y

Pull the frontend and backend code into the /opt directory i.e /opt/backend and /opt/frontend

Install Docker and certbot


sudo apt-get update

# Docker
curl -fsSL https://get.docker.com -o get-docker.sh
sh get-docker.sh

# Docker Compose
apt-get update
apt-get install docker-compose-plugin
or
sudo apt install -y docker-compose

# Certbot
apt-get install certbot

verify docker

docker --version
docker-compose --version

Map A record to your domain's DNS

Enable Docker to run on boot

sudo systemctl enable --now docker

Get SSL ertificates before running docker containers.

# Stop any running web servers first
systemctl stop nginx # if nginx is running

# Get certs for both domains
certbot certonly --standalone -d yourdomain.com
certbot certonly --standalone -d api.yourdomain.com

Set up backend


cd /opt/backend

git pull origin main

# Back on the droplet:
# Create/edit .env file
nano .env
# Add your environment variables:
POSTGRES_USER=youruser
POSTGRES_PASSWORD=yourpassword
POSTGRES_DB=yourdb
DOMAIN=yourdomain.com
CORS_ORIGINS=https://yourdomain.com

# Start the backend services
docker-compose up -d

Here is the Dockerfile for the backend application


FROM python:3.11.1-slim

# Set env vars
ENV PYTHONDONTWRITEBYTECODE=1
ENV PYTHONUNBUFFERED=1

WORKDIR /app

#install deps
RUN apt-get update \
&& apt-get install -y --no-install-recommends \
gcc \
libpq-dev \
&& apt-get clean \
&& rm -rf /var/lib/apt/lists/*

# install python deps
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

# copy fastapi code
COPY . .

# Expose port
EXPOSE 8000

# set entry point
CMD ["uvicorn", "api.app.main:app", "--host", "0.0.0.0", "--port", "8000"]

Here is the Docker compose file for the database, environment variables and environment management. Make sure the environment variables are corectly captured in .env file


version: "3"

services:
nginx:
image: nginx:alpine
ports:
- "80:80"
- "443:443"
volumes:
- ./nginx.conf:/etc/nginx/conf.d/default.conf
- /etc/letsencrypt:/etc/letsencrypt
depends_on:
- frontend
- app

frontend:
build: ../frontend # Adjust this path to your frontend directory
environment:
- NEXT_PUBLIC_API_URL=https://api.${DOMAIN}
- NEXT_PUBLIC_DOMAIN=${DOMAIN}
depends_on:
- app

db:
env_file: .env
image: postgres
environment:
POSTGRES_USER: ${POSTGRES_USER}
POSTGRES_PASSWORD: ${POSTGRES_PASSWORD}
POSTGRES_DB: ${POSTGRES_DB}
volumes:
- db-data:/var/lib/postgresql/data

app:
build: .
environment:
DATABASE_URL: postgresql://${POSTGRES_USER}:${POSTGRES_PASSWORD}@db:5432/${POSTGRES_DB}
DOMAIN: ${DOMAIN}
depends_on:
- db

volumes:
db-data:


Here is the NGINX for SSL and request routing.


server {
listen 80;
server_name ${DOMAIN};
return 301 https://$server_name$request_uri;
}

server {
listen 443 ssl;
server_name ${DOMAIN};

ssl_certificate /etc/letsencrypt/live/${DOMAIN}/fullchain.pem;
ssl_certificate_key /etc/letsencrypt/live/${DOMAIN}/privkey.pem;

# Frontend
location / {
proxy_pass http://frontend:3000;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
}

# Backend API
location /api {
proxy_pass http://app:8000;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
}
}

Create deployment script

#!/bin/bash
# deploy.sh

# Pull latest changes
git pull

# Build and restart containers
docker-compose down
docker-compose build
docker-compose up -d

# Check logs
docker-compose logs -f

Create SSL renewal script

# Create renewal script
echo "#!/bin/bash
certbot renew
docker-compose restart nginx" > /root/renew-certs.sh

chmod +x /root/renew-certs.sh

# Add to crontab
(crontab -l 2>/dev/null; echo "0 0 1 * * /root/renew-certs.sh") | crontab -