Zero-downtime deploys with ArgoCD + Helm
Step-by-step guide to zero-downtime deploys.
When to run this play
Teams moving to Kubernetes often inherit outages from handcrafted deploy scripts or big-bang releases. If deploy windows still rely on maintenance pages, or if rollback means manually re-running Helm with old values, it is time to adopt blue/green backed by ArgoCD and Argo Rollouts.
Readiness check
- Production traffic flows through an ingress that supports progressive delivery (Istio, Linkerd, AWS ALB, or NGINX with canary support).
- Helm charts expose health probes, versioned container images, and idempotent value files.
- Observability stack (Prometheus, Datadog, New Relic) can surface latency, error rate, and saturation per release candidate.
- Teams understand the incident response model and can page responders during rollout.
Core plays
- Model environments explicitly. Define blue and green as separate ArgoCD applications referencing the same repo. Capture infrastructure dependencies (databases, queues) in documentation so surprises do not appear mid-cutover.
- Codify release gates. Automate smoke tests, load checks, and error budget comparisons. Store gate definitions alongside Helm charts and make failures block promotion automatically.
- Adopt progressive traffic shifting. Use Argo Rollouts or service mesh routing weights (10/30/60/100). After each increment, wait for metrics to stabilize and surface dashboards directly in the deployment runbook.
- Plan rollback from the start. Ensure blue stays warm until green is proven, version database migrations for safe roll-forward/roll-back, and script traffic flips so they are reversible without human kubectl commands.
- Close with learning. Capture deployment duration, incidents avoided or triggered, and improvement ideas in a lightweight retro. Feed insights into the checklist and runbooks.
Operating cadence
- Weekly rehearsal in lower environments to keep the workflow sharp.
- Change review highlighting upcoming blue/green cutovers and risk mitigation.
- Quarterly chaos or gameday exercises targeting failover, rollback, and automation gaps.
Signals you are succeeding
- Deployments finish without user-visible downtime or broken sessions.
- Rollbacks are exercised quarterly and complete in minutes without paging senior engineers.
- Release metrics (error rate, latency) stay within SLOs during traffic shifts.
Supporting material
- Zero-downtime checklist for hands-on execution steps.
- FAQ to align executives and finance on cost, risk, and tooling expectations.
