Skip to main content

Highy Available System Design

· 16 min read
Femi Adigun
Senior Software Engineer & Coach

Executive Summary

This document outlines the architecture for a globally distributed, highly available ordering platform designed to serve millions of users with 99.99% uptime, fault tolerance, and resilience.

Key Metrics:

  • Target Availability: 99.99% (52 minutes downtime/year)
  • Global Users: Millions
  • Order Processing: Real-time with eventual consistency
  • Recovery Time Objective (RTO): < 1 minute
  • Recovery Point Objective (RPO): < 5 minutes

1. High-Level Architecture

┌─────────────────────────────────────────────────────────────────┐
│ Global Users │
└─────────────────────────────────────────────────────────────────┘


┌─────────────────────────────────────────────────────────────────┐
│ Global CDN + DDoS Protection │
│ (CloudFlare / Akamai) │
└─────────────────────────────────────────────────────────────────┘


┌─────────────────────────────────────────────────────────────────┐
│ Global Load Balancer (DNS-based) │
│ Route by: Geography, Health, Latency │
└─────────────────────────────────────────────────────────────────┘

┌─────────────────────┼─────────────────────┐
▼ ▼ ▼
┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│ Region 1 │ │ Region 2 │ │ Region 3 │
│ (US-EAST) │ │ (EU-WEST) │ │ (ASIA-PAC) │
└──────────────┘ └──────────────┘ └──────────────┘

2. Regional Architecture (Per Region)

Each region is fully self-contained and can operate independently:

┌────────────────────────────────────────────────────────────┐
│ Regional Load Balancer │
│ (AWS ALB / Azure App Gateway) │
└────────────────────────────────────────────────────────────┘

┌───────────────────┼───────────────────┐
▼ ▼ ▼
┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│ AZ-1 │ │ AZ-2 │ │ AZ-3 │
│ API Gateway │ │ API Gateway │ │ API Gateway │
└──────────────┘ └──────────────┘ └──────────────┘
│ │ │
└───────────────────┼───────────────────┘

┌────────────────────────────────────────────────────────────┐
│ Service Mesh (Istio) │
│ + Circuit Breakers │
└────────────────────────────────────────────────────────────┘

3. Microservices Layer

3.1 Core Services Architecture

┌─────────────────┐  ┌─────────────────┐  ┌─────────────────┐
│ User Service │ │ Auth Service │ │ Catalog Service │
│ (3+ instances) │ │ (3+ instances) │ │ (3+ instances) │
└─────────────────┘ └─────────────────┘ └─────────────────┘

┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
│ Order Service │ │ Payment Service │ │Inventory Service│
│ (5+ instances) │ │ (5+ instances) │ │ (5+ instances) │
└─────────────────┘ └─────────────────┘ └─────────────────┘

┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
│Notification Svc │ │Fulfillment Svc │ │ Analytics Svc │
│ (3+ instances) │ │ (3+ instances) │ │ (3+ instances) │
└─────────────────┘ └─────────────────┘ └─────────────────┘

3.2 Service Characteristics

Order Service (Critical Path):

  • Horizontally scalable with auto-scaling (5-100 instances)
  • Stateless design
  • Circuit breaker pattern for downstream dependencies
  • Retry logic with exponential backoff
  • Request timeout: 3 seconds
  • Bulkhead pattern to isolate critical operations

Payment Service (Critical Path):

  • Idempotent operations (prevent double charging)
  • Transaction log for audit trail
  • Saga pattern for distributed transactions
  • PCI-DSS compliant
  • Rate limiting per user/IP
  • Fallback to queued processing if gateway unavailable

Inventory Service:

  • Optimistic locking for inventory updates
  • Real-time inventory with eventual consistency
  • Cache-aside pattern with Redis
  • Event sourcing for inventory changes

4. Data Layer Architecture

4.1 Multi-Region Database Strategy

Primary Databases (Per Region):
┌─────────────────────────────────────────────────────┐
│ PostgreSQL / Aurora Global Database │
│ │
│ Region 1 (PRIMARY) → Region 2 (READ REPLICA) │
│ ↓ ↓ │
│ Multi-AZ Setup Multi-AZ Setup │
│ - Master (AZ-1) - Replica (AZ-1) │
│ - Standby (AZ-2) - Replica (AZ-2) │
│ - Replica (AZ-3) - Replica (AZ-3) │
└─────────────────────────────────────────────────────┘

Replication Lag Target: < 1 second

4.2 Database Sharding Strategy

Sharding Key: User ID (consistent hashing)

Shard Distribution:

  • 16 logical shards per region
  • Each shard has 3 physical replicas (across AZs)
  • Allows horizontal scaling to 64, 128, 256 shards

Data Partitioning:

  • Users: Sharded by user_id
  • Orders: Sharded by user_id (co-located with user data)
  • Products: Replicated across all shards (read-heavy)
  • Inventory: Sharded by product_id with cache layer

4.3 Caching Strategy

┌─────────────────────────────────────────────────────┐
│ Redis Cluster (Per Region) │
│ │
│ Cache Tier 1: User Sessions (TTL: 30 min) │
│ Cache Tier 2: Product Catalog (TTL: 5 min) │
│ Cache Tier 3: Inventory Counts (TTL: 30 sec) │
│ Cache Tier 4: Hot Order Data (TTL: 10 min) │
│ │
│ Configuration: │
│ - 6 nodes per region (2 per AZ) │
│ - Clustering mode enabled │
│ - Automatic failover │
│ - Backup to S3 every 6 hours │
└─────────────────────────────────────────────────────┘

Cache Invalidation Strategy:

  • Write-through for critical data (orders, payments)
  • Cache-aside for read-heavy data (products, users)
  • Event-driven invalidation via message queue
  • Lazy expiration with active monitoring

5. Event-Driven Architecture

5.1 Message Queue Infrastructure

┌─────────────────────────────────────────────────────┐
│ Apache Kafka / Amazon MSK (Per Region) │
│ │
│ Topics: │
│ - order.created (partitions: 50) │
│ - order.confirmed (partitions: 50) │
│ - payment.processed (partitions: 30) │
│ - inventory.updated (partitions: 40) │
│ - notification.email (partitions: 20) │
│ - notification.sms (partitions: 20) │
│ - analytics.events (partitions: 100) │
│ │
│ Configuration: │
│ - Replication Factor: 3 │
│ - Min In-Sync Replicas: 2 │
│ - Retention: 7 days │
│ - Cross-region replication for critical topics │
└─────────────────────────────────────────────────────┘

5.2 Order Processing Flow

1. User Places Order

2. Order Service validates → Publishes "order.created"

3. Multiple Consumers:
- Inventory Service (reserves items)
- Payment Service (processes payment)
- Notification Service (confirms to user)
- Analytics Service (tracks metrics)

4. Saga Coordinator monitors completion

5. If all succeed → Publish "order.confirmed"
If any fail → Publish compensating events

6. Fulfillment Service picks up confirmed orders

Benefits:

  • Decoupling of services
  • Async processing reduces latency
  • Natural retry mechanism
  • Event log for debugging
  • Scalable consumer groups

6. Resilience Patterns

6.1 Circuit Breaker Implementation

Circuit States:
- CLOSED: Normal operation, requests flow through
- OPEN: Failure threshold exceeded, fail fast
- HALF_OPEN: Testing if service recovered

Configuration (per service):
- Failure Threshold: 50% of requests in 10 seconds
- Timeout: 3 seconds
- Half-Open Retry: After 30 seconds
- Success Threshold: 3 consecutive successes to close

6.2 Retry Strategy

Exponential Backoff with Jitter:

Attempt 1: Immediate
Attempt 2: 100ms + random(0-50ms)
Attempt 3: 200ms + random(0-100ms)
Attempt 4: 400ms + random(0-200ms)
Attempt 5: 800ms + random(0-400ms)
Max Attempts: 5

Idempotency Keys:

  • All write operations require idempotency key
  • Stored for 24 hours to detect duplicates
  • Ensures safe retries without side effects

6.3 Bulkhead Pattern

Resource Isolation:

  • Separate thread pools for different operation types
  • Critical operations: 60% of resources
  • Non-critical operations: 30% of resources
  • Admin operations: 10% of resources

Rate Limiting:

  • Per-user: 100 requests/minute
  • Per-IP: 1000 requests/minute
  • Global: 1M requests/second per region
  • Token bucket algorithm with Redis

6.4 Timeout Strategy

Service Timeouts (Cascading):
- Gateway → Service: 5 seconds
- Service → Service: 3 seconds
- Service → Database: 2 seconds
- Service → Cache: 500ms
- Service → External API: 10 seconds

7. Disaster Recovery & High Availability

7.1 Multi-Region Failover Strategy

Active-Active Configuration:

  • All regions actively serve traffic
  • DNS-based routing with health checks
  • Automatic failover in < 30 seconds

Failover Procedure:

1. Health Check Failure Detected
- 3 consecutive failures in 15 seconds

2. DNS Update Triggered
- Remove failed region from DNS
- TTL: 60 seconds

3. Traffic Rerouted
- Users automatically routed to healthy regions
- No manual intervention required

4. Alert Engineering Team
- PagerDuty/OpsGenie notification
- Automated runbook execution

5. Failed Region Investigation
- Automated diagnostics
- Log aggregation and analysis

6. Recovery and Validation
- Gradual traffic restoration (10% → 50% → 100%)
- Synthetic transaction testing

7.2 Data Backup Strategy

Automated Backups:

  • Database: Continuous backup with PITR (Point-in-Time Recovery)
  • Snapshots every 6 hours to S3/Glacier
  • Cross-region replication of backups
  • 30-day retention for operational backups
  • 7-year retention for compliance backups

Testing:

  • Monthly disaster recovery drills
  • Quarterly regional failover tests
  • Backup restoration tests every week

7.3 Chaos Engineering

Automated Fault Injection:

  • Random pod termination (5% daily)
  • Network latency injection (200ms-2s)
  • Service dependency failure simulation
  • Database connection pool exhaustion
  • Cache cluster node failures

GameDays (Quarterly):

  • Simulated regional outage
  • Database failover scenarios
  • Multi-service cascading failures
  • Payment gateway unavailability

8. Monitoring & Observability

8.1 Metrics Collection

Infrastructure Metrics:

  • CPU, Memory, Disk, Network per instance
  • Request rate, error rate, latency (RED metrics)
  • Database connection pool utilization
  • Cache hit/miss ratios
  • Queue depth and lag

Business Metrics:

  • Orders per second
  • Order success rate
  • Payment success rate
  • Average order value
  • Cart abandonment rate
  • Time to checkout

Tools:

  • Prometheus for metrics collection
  • Grafana for visualization
  • Custom dashboards per service and region

8.2 Distributed Tracing

Implementation:

  • OpenTelemetry for instrumentation
  • Jaeger/Tempo for trace storage
  • Trace every order through the system
  • Correlation IDs in all logs
  • Service mesh automatic tracing

Key Traces:

  • Order placement (end-to-end)
  • Payment processing
  • Inventory reservation
  • Cross-service calls

8.3 Logging Strategy

Centralized Logging:

  • ELK Stack (Elasticsearch, Logstash, Kibana) or DataDog
  • Structured JSON logging
  • Log levels: ERROR, WARN, INFO, DEBUG
  • Retention: 30 days hot, 180 days warm, 365 days cold

Log Aggregation:

  • Application logs
  • Access logs
  • Audit logs (immutable)
  • Security logs
  • Database query logs (slow queries)

8.4 Alerting Strategy

Alert Levels:

Critical (Page immediately):

  • Service availability < 99.9%
  • Error rate > 5%
  • Payment failure rate > 1%
  • Database replication lag > 10 seconds
  • Regional outage detected

High (Page during business hours):

  • Error rate > 1%
  • Response time p99 > 2 seconds
  • Cache hit rate < 80%
  • Queue lag > 5 minutes

Medium (Slack/Email):

  • Error rate > 0.5%
  • Disk usage > 75%
  • Memory usage > 80%
  • API rate limit approaching

Alert Routing:

  • PagerDuty for critical alerts
  • Slack for high/medium alerts
  • Weekly summary emails for trends

9. Security Architecture

9.1 Defense in Depth

Layer 1: Network Security

  • VPC with private subnets
  • Security groups (whitelist approach)
  • NACLs for additional filtering
  • WAF (Web Application Firewall) at edge
  • DDoS protection (CloudFlare/AWS Shield)

Layer 2: Application Security

  • OAuth 2.0 + JWT for authentication
  • RBAC (Role-Based Access Control)
  • API rate limiting per user/IP
  • Input validation and sanitization
  • SQL injection prevention (parameterized queries)
  • XSS protection headers

Layer 3: Data Security

  • Encryption at rest (AES-256)
  • Encryption in transit (TLS 1.3)
  • Database column-level encryption for PII
  • Key rotation every 90 days
  • HSM for payment data

Layer 4: Compliance

  • PCI-DSS Level 1 compliance
  • GDPR compliance (data residency, right to deletion)
  • SOC 2 Type II certification
  • Regular penetration testing
  • Vulnerability scanning (weekly)

9.2 Secrets Management

  • HashiCorp Vault or AWS Secrets Manager
  • Secrets rotation every 30 days
  • No secrets in code or environment variables
  • Service accounts with minimal permissions
  • Audit log of all secret access

10. Scalability Strategy

10.1 Horizontal Scaling

Auto-Scaling Policies:

Scale-Out Triggers:

  • CPU > 70% for 3 minutes
  • Memory > 80% for 3 minutes
  • Request queue depth > 100
  • Response time p95 > 1 second

Scale-In Triggers:

  • CPU < 30% for 10 minutes
  • Memory < 50% for 10 minutes
  • Connection draining (2-minute grace period)

Limits:

  • Min instances: 3 per service per AZ
  • Max instances: 100 per service per AZ
  • Scale-out: +50% of current capacity
  • Scale-in: -25% of current capacity (gradual)

10.2 Database Scaling

Read Scaling:

  • Read replicas (5-10 per region)
  • Connection pooling (PgBouncer)
  • Read/write splitting at application layer
  • Cache-first strategy

Write Scaling:

  • Sharding by user_id
  • Batch writes where possible
  • Async writes for non-critical data
  • Queue-based write buffering

10.3 Global Capacity Planning

Current Capacity (per region):

  • 100,000 orders per second
  • 5 million concurrent users
  • 500 TB storage
  • 10 Gbps network egress

Scaling Roadmap:

  • Add region when hitting 70% capacity
  • Shard databases when write load > 50K TPS
  • Add cache nodes proactively (before hit rate drops)

11. Performance Optimization

11.1 Latency Targets

Operation               Target      P50      P95      P99
─────────────────────────────────────────────────────────
Order Placement < 500ms 300ms 450ms 500ms
Product Search < 200ms 100ms 180ms 200ms
Cart Update < 100ms 50ms 90ms 100ms
Payment Processing < 2s 1.2s 1.8s 2s
Order History < 300ms 150ms 250ms 300ms

11.2 Optimization Techniques

Frontend:

  • CDN for static assets (99% cache hit rate)
  • HTTP/2 and HTTP/3
  • Lazy loading images
  • Code splitting
  • Service workers for offline capability

Backend:

  • Database query optimization (indexed queries)
  • Connection pooling
  • Response compression (gzip/brotli)
  • API response pagination
  • GraphQL for flexible queries

Network:

  • Keep-alive connections
  • Connection multiplexing
  • Regional edge locations
  • Anycast IP routing

12. Order Consistency Guarantees

12.1 ACID vs BASE Trade-offs

ACID Operations (Strong Consistency):

  • Payment transactions
  • Inventory deduction
  • Order status updates
  • User account balance

BASE Operations (Eventual Consistency):

  • Product catalog updates
  • Analytics and reporting
  • Notification delivery
  • Search index updates

12.2 Distributed Transaction Pattern (SAGA)

Order Saga Flow:

1. Create Order (Compensate: Cancel Order)

2. Reserve Inventory (Compensate: Release Inventory)

3. Process Payment (Compensate: Refund Payment)

4. Update Order Status (Compensate: Revert Status)

5. Send Confirmation (No Compensation)

Saga Coordinator:
- Tracks saga state in database
- Executes compensating transactions on failure
- Ensures eventual consistency
- Idempotent operations for safe retries

12.3 Idempotency Implementation

Idempotency Key Table:
- idempotency_key (PK)
- user_id
- operation_type
- request_hash
- response_data
- created_at
- expires_at (24 hours)

On Duplicate Key:
- Return cached response
- No side effects executed
- Log duplicate attempt for monitoring

13. Cost Optimization

13.1 Resource Optimization

Compute:

  • Spot instances for batch jobs (70% savings)
  • Reserved instances for baseline (40% savings)
  • Right-sizing (monitoring actual usage)
  • Scheduled scaling (reduce capacity during off-peak)

Storage:

  • S3 lifecycle policies (archive old data)
  • Database storage optimization (partition pruning)
  • Compression for logs and backups
  • CDN reduces origin bandwidth costs

Network:

  • VPC endpoints (avoid NAT gateway charges)
  • Direct Connect for inter-region traffic
  • CloudFront reduces origin requests

Estimated Monthly Cost (per region):

  • Compute: $50,000
  • Databases: $30,000
  • Storage: $15,000
  • Network: $20,000
  • Caching: $10,000
  • Message Queue: $8,000
  • Monitoring: $5,000 Total: ~$138,000/region (3 regions = $414,000/month)

14. Deployment Strategy

14.1 CI/CD Pipeline

1. Code Commit (GitHub)

2. Automated Tests
- Unit tests
- Integration tests
- Security scanning

3. Build Container Image
- Docker build
- Tag with version
- Push to registry

4. Deploy to Staging
- Terraform/CloudFormation
- Kubernetes rollout

5. Automated Testing (Staging)
- Smoke tests
- Load tests
- E2E tests

6. Manual Approval

7. Blue-Green Deployment (Production)
- Deploy to green environment
- Run health checks
- Shift traffic gradually (10% → 50% → 100%)
- Monitor for errors

8. Rollback Capability
- Instant rollback if errors detected
- Automated rollback on critical alerts

14.2 Zero-Downtime Deployment

Rolling Update Strategy:

  • Update one AZ at a time
  • Wait 10 minutes between AZs
  • Health check validation after each update
  • Automatic rollback on failure

Database Migrations:

  • Backward-compatible changes only
  • Shadow writes to new schema
  • Gradual read cutover
  • Old schema support for 2 releases

15. Technical Stack Summary

15.1 Core Technologies

Frontend:

  • React/Next.js
  • TypeScript
  • Mobile: React Native or Native (iOS/Android)

Backend:

  • Language: Java/Go/Node.js (polyglot)
  • API: GraphQL + REST
  • Framework: Spring Boot / Express

Infrastructure:

  • Cloud: AWS/GCP/Azure (multi-cloud)
  • Orchestration: Kubernetes (EKS/GKE/AKS)
  • Service Mesh: Istio
  • IaC: Terraform

Data:

  • Primary DB: PostgreSQL / Aurora
  • Cache: Redis Cluster
  • Search: Elasticsearch
  • Message Queue: Kafka / MSK
  • Object Storage: S3 / GCS

Monitoring:

  • Metrics: Prometheus + Grafana
  • Logging: ELK Stack / DataDog
  • Tracing: Jaeger / Tempo
  • APM: New Relic / DataDog

16. Risk Mitigation

16.1 Identified Risks and Mitigations

RiskImpactProbabilityMitigation
Regional AWS outageHighLowMulti-region active-active
DDoS attackHighMediumCloudFlare + rate limiting
Database corruptionHighLowContinuous backups + PITR
Payment gateway downHighMediumMultiple payment providers
Data breachCriticalLowEncryption + security monitoring
Code bug causing outagesMediumMediumAutomated testing + gradual rollout
Cache failureMediumLowCache-aside pattern + DB fallback
Human errorMediumMediumIaC + peer reviews + access controls

17. Future Enhancements

17.1 Roadmap

Q1: Enhanced Observability

  • AI-powered anomaly detection
  • Predictive scaling
  • Automated root cause analysis

Q2: Global Expansion

  • Add 2 more regions (South America, Middle East)
  • Edge computing for ultra-low latency
  • Regional data residency compliance

Q3: Advanced Features

  • Machine learning for fraud detection
  • Personalized recommendations engine
  • Real-time inventory optimization

Q4: Cost Optimization

  • FinOps implementation
  • Multi-cloud arbitrage
  • Serverless migration for burst workloads

18. Conclusion

This architecture provides:

High Availability: 99.99% uptime through multi-region, multi-AZ deployment
Fault Tolerance: Circuit breakers, retries, and graceful degradation
Resilience: Self-healing systems and automated recovery
Scalability: Horizontal scaling to millions of users
Performance: Sub-second response times globally
Security: Defense-in-depth with encryption and compliance
Observability: Comprehensive monitoring and alerting
Cost-Effective: Optimized resource utilization

The system is production-ready and battle-tested against common failure scenarios, ensuring reliable order processing for millions of users globally.


Appendix A: Health Check Specifications

api_health_check:
path: /health
interval: 10s
timeout: 3s
healthy_threshold: 2
unhealthy_threshold: 3
checks:
- database_connection
- cache_connection
- message_queue_connection
- disk_space
- memory_usage

deep_health_check:
path: /health/deep
interval: 60s
timeout: 10s
checks:
- end_to_end_order_flow
- payment_gateway_reachable
- inventory_system_reachable

Appendix B: SLA Definitions

availability_sla:
target: 99.99%
measurement: per month
exclusions:
- scheduled_maintenance (with 7 days notice)
- force_majeure

performance_sla:
api_latency_p99: 500ms
api_latency_p50: 200ms
order_processing: 2s

support_sla:
critical_response: 15 minutes
high_response: 1 hour
medium_response: 4 hours