Zero-downtime deploys with ArgoCD + Helm

Step-by-step guide to zero-downtime deploys.

When to run this play

Teams moving to Kubernetes often inherit outages from handcrafted deploy scripts or big-bang releases. If deploy windows still rely on maintenance pages, or if rollback means manually re-running Helm with old values, it is time to adopt blue/green backed by ArgoCD and Argo Rollouts.

Readiness check

  • Production traffic flows through an ingress that supports progressive delivery (Istio, Linkerd, AWS ALB, or NGINX with canary support).
  • Helm charts expose health probes, versioned container images, and idempotent value files.
  • Observability stack (Prometheus, Datadog, New Relic) can surface latency, error rate, and saturation per release candidate.
  • Teams understand the incident response model and can page responders during rollout.

Core plays

  1. Model environments explicitly. Define blue and green as separate ArgoCD applications referencing the same repo. Capture infrastructure dependencies (databases, queues) in documentation so surprises do not appear mid-cutover.
  2. Codify release gates. Automate smoke tests, load checks, and error budget comparisons. Store gate definitions alongside Helm charts and make failures block promotion automatically.
  3. Adopt progressive traffic shifting. Use Argo Rollouts or service mesh routing weights (10/30/60/100). After each increment, wait for metrics to stabilize and surface dashboards directly in the deployment runbook.
  4. Plan rollback from the start. Ensure blue stays warm until green is proven, version database migrations for safe roll-forward/roll-back, and script traffic flips so they are reversible without human kubectl commands.
  5. Close with learning. Capture deployment duration, incidents avoided or triggered, and improvement ideas in a lightweight retro. Feed insights into the checklist and runbooks.

Operating cadence

  • Weekly rehearsal in lower environments to keep the workflow sharp.
  • Change review highlighting upcoming blue/green cutovers and risk mitigation.
  • Quarterly chaos or gameday exercises targeting failover, rollback, and automation gaps.

Signals you are succeeding

  • Deployments finish without user-visible downtime or broken sessions.
  • Rollbacks are exercised quarterly and complete in minutes without paging senior engineers.
  • Release metrics (error rate, latency) stay within SLOs during traffic shifts.

Supporting material

  • Zero-downtime checklist for hands-on execution steps.
  • FAQ to align executives and finance on cost, risk, and tooling expectations.