RESOURCE · CI/CD TEMPLATE
A CI/CD template for zero-downtime Kubernetes deploys
OIDC-based AWS auth, ECR push, ArgoCD sync, smoke test, and automatic rollback — all in one workflow file.
What you get
- A main deploy.yml that handles staging and production from a single file, resolved by branch name or manual trigger
- A ci.yml for pull requests: lint, type-check, integration tests with a real Postgres service container, and a dry-run Docker build
- Automatic rollback via helm rollback if the post-deploy smoke test fails, with a Slack notification on both success and failure
What it assumes
- GitHub Actions enabled on the target repository
- An ECR registry and an EKS cluster already provisioned (use the Terraform module above)
- ArgoCD installed in the cluster and reachable from the GitHub Actions runner, with a service account token stored as a GitHub secret
The code
name: Deploy
on:
push:
branches:
- main # Triggers production deploy
- staging # Triggers staging deploy
workflow_dispatch:
inputs:
environment:
description: 'Target environment'
required: true
type: choice
options: [staging, production]
concurrency:
group: deploy-${{ github.ref }}
cancel-in-progress: false # Never cancel a running deploy — wait for it
permissions:
contents: read
id-token: write # Required for OIDC AWS auth
jobs:
# ─────────────────────────────────────────────
# Resolve which environment to deploy to
# ─────────────────────────────────────────────
resolve-env:
runs-on: ubuntu-latest
outputs:
environment: ${{ steps.resolve.outputs.environment }}
aws_role: ${{ steps.resolve.outputs.aws_role }}
ecr_registry: ${{ steps.resolve.outputs.ecr_registry }}
cluster_name: ${{ steps.resolve.outputs.cluster_name }}
steps:
- id: resolve
run: |
if [[ "${{ github.event_name }}" == "workflow_dispatch" ]]; then
ENV="${{ github.event.inputs.environment }}"
elif [[ "${{ github.ref }}" == "refs/heads/main" ]]; then
ENV="production"
else
ENV="staging"
fi
echo "environment=$ENV" >> $GITHUB_OUTPUT
echo "aws_role=${{ vars[format('{0}_AWS_ROLE_ARN', upper(ENV))] }}" >> $GITHUB_OUTPUT
echo "ecr_registry=${{ vars.ECR_REGISTRY }}" >> $GITHUB_OUTPUT
echo "cluster_name=${{ vars[format('{0}_CLUSTER_NAME', upper(ENV))] }}" >> $GITHUB_OUTPUT
# ─────────────────────────────────────────────
# Build and push to ECR
# ─────────────────────────────────────────────
build:
runs-on: ubuntu-latest
needs: resolve-env
outputs:
image_tag: ${{ steps.meta.outputs.version }}
steps:
- uses: actions/checkout@v4
- name: Configure AWS credentials (OIDC)
uses: aws-actions/configure-aws-credentials@v4
with:
role-to-assume: ${{ needs.resolve-env.outputs.aws_role }}
aws-region: us-east-1
- name: Log in to ECR
id: login-ecr
uses: aws-actions/amazon-ecr-login@v2
- name: Extract Docker metadata
id: meta
uses: docker/metadata-action@v5
with:
images: ${{ needs.resolve-env.outputs.ecr_registry }}/my-app
tags: |
type=sha,prefix=,format=short
type=ref,event=branch
- name: Set up Docker Buildx
uses: docker/setup-buildx-action@v3
- name: Build and push
uses: docker/build-push-action@v5
with:
context: .
push: true
tags: ${{ steps.meta.outputs.tags }}
labels: ${{ steps.meta.outputs.labels }}
cache-from: type=gha
cache-to: type=gha,mode=max
provenance: false # Avoids multi-arch manifest issues with ECR
# ─────────────────────────────────────────────
# Run tests against the built image
# ─────────────────────────────────────────────
test:
runs-on: ubuntu-latest
needs: build
steps:
- uses: actions/checkout@v4
- name: Run integration tests
run: |
docker run --rm \
-e DATABASE_URL=${{ secrets.TEST_DATABASE_URL }} \
${{ needs.resolve-env.outputs.ecr_registry }}/my-app:${{ needs.build.outputs.image_tag }} \
npm run test:integration
# ─────────────────────────────────────────────
# Deploy via Helm + ArgoCD sync
# ─────────────────────────────────────────────
deploy:
runs-on: ubuntu-latest
needs: [resolve-env, build, test]
environment: ${{ needs.resolve-env.outputs.environment }}
steps:
- uses: actions/checkout@v4
- name: Configure AWS credentials (OIDC)
uses: aws-actions/configure-aws-credentials@v4
with:
role-to-assume: ${{ needs.resolve-env.outputs.aws_role }}
aws-region: us-east-1
- name: Update kubeconfig
run: |
aws eks update-kubeconfig \
--name ${{ needs.resolve-env.outputs.cluster_name }} \
--region us-east-1
- name: Update image tag in Helm values
run: |
IMAGE_TAG=${{ needs.build.outputs.image_tag }}
ENV=${{ needs.resolve-env.outputs.environment }}
# Update the image.tag in the environment-specific values file
# This commits back to the GitOps repo so ArgoCD picks it up
yq -i ".image.tag = \"$IMAGE_TAG\"" deploy/helm/envs/$ENV/values.yaml
- name: Trigger ArgoCD sync
run: |
argocd app sync my-app-${{ needs.resolve-env.outputs.environment }} \
--server ${{ secrets.ARGOCD_SERVER }} \
--auth-token ${{ secrets.ARGOCD_TOKEN }} \
--timeout 300 \
--retry-limit 3
- name: Wait for rollout
run: |
kubectl rollout status deployment/my-app \
--namespace ${{ needs.resolve-env.outputs.environment }} \
--timeout=300s
# ─────────────────────────────────────────────
# Smoke test the deployed endpoint
# ─────────────────────────────────────────────
smoke-test:
runs-on: ubuntu-latest
needs: [resolve-env, deploy]
steps:
- name: Health check
run: |
ENV=${{ needs.resolve-env.outputs.environment }}
BASE_URL=${{ secrets[format('{0}_BASE_URL', upper(ENV))] }}
for i in {1..10}; do
STATUS=$(curl -s -o /dev/null -w "%{http_code}" "$BASE_URL/healthz")
if [[ "$STATUS" == "200" ]]; then
echo "Smoke test passed (attempt $i)"
exit 0
fi
echo "Attempt $i: got $STATUS, retrying in 10s..."
sleep 10
done
echo "Smoke test failed after 10 attempts"
exit 1
# ─────────────────────────────────────────────
# Rollback on smoke test failure
# ─────────────────────────────────────────────
rollback:
runs-on: ubuntu-latest
needs: [resolve-env, deploy, smoke-test]
if: failure() && needs.deploy.result == 'success'
steps:
- name: Configure AWS credentials (OIDC)
uses: aws-actions/configure-aws-credentials@v4
with:
role-to-assume: ${{ needs.resolve-env.outputs.aws_role }}
aws-region: us-east-1
- name: Rollback Helm release
run: |
aws eks update-kubeconfig \
--name ${{ needs.resolve-env.outputs.cluster_name }} \
--region us-east-1
helm rollback my-app --namespace ${{ needs.resolve-env.outputs.environment }}
- name: Notify Slack — rollback
uses: slackapi/slack-github-action@v1.26.0
with:
payload: |
{
"text": ":warning: *Deploy rolled back*",
"attachments": [{
"color": "danger",
"fields": [
{"title": "Env", "value": "${{ needs.resolve-env.outputs.environment }}", "short": true},
{"title": "Commit", "value": "${{ github.sha }}", "short": true},
{"title": "Actor", "value": "${{ github.actor }}", "short": true}
]
}]
}
env:
SLACK_WEBHOOK_URL: ${{ secrets.SLACK_WEBHOOK_URL }}
# ─────────────────────────────────────────────
# Notify Slack on success
# ─────────────────────────────────────────────
notify:
runs-on: ubuntu-latest
needs: [resolve-env, smoke-test]
if: success()
steps:
- name: Notify Slack — success
uses: slackapi/slack-github-action@v1.26.0
with:
payload: |
{
"text": ":white_check_mark: *Deploy succeeded*",
"attachments": [{
"color": "good",
"fields": [
{"title": "Env", "value": "${{ needs.resolve-env.outputs.environment }}", "short": true},
{"title": "Image", "value": "${{ needs.build.outputs.image_tag }}", "short": true},
{"title": "Actor", "value": "${{ github.actor }}", "short": true}
]
}]
}
env:
SLACK_WEBHOOK_URL: ${{ secrets.SLACK_WEBHOOK_URL }}The explanation
OIDC instead of long-lived AWS credentials
The workflow uses GitHub Actions OIDC to assume an AWS IAM role, which means noAWS_ACCESS_KEY_ID or AWS_SECRET_ACCESS_KEY stored in GitHub secrets. OIDC tokens are short-lived (valid for a single workflow run), issued by GitHub, and validated by AWS against a trust policy you configure once. If a token leaks, it's already expired. If you rotate a key in the old approach, you have to find and update it in every repo that uses it.
Setting up OIDC requires two one-time steps: creating an OIDC provider in your AWS account pointing at token.actions.githubusercontent.com, and configuring a trust policy on the IAM role that restricts which GitHub org, repo, and branch can assume it. Theaws-actions/configure-aws-credentials action handles the token exchange automatically.
Concurrency: never cancel a running deploy
The concurrency group is set per github.ref withcancel-in-progress: false. This is deliberate. Cancelling a running deploy mid-flight can leave your cluster in a partially updated state — some pods on the new image, some on the old. If you need to cancel a deploy that's gone wrong, use the rollback job instead.
For CI (the ci.yml workflow), the same concurrency group usescancel-in-progress: true, because cancelling a lint or test run on a superseded PR commit is always safe.
Environment resolution without duplication
A common pattern is to have separate workflow files for staging and production, duplicating identical steps with slightly different environment variables. The resolve-envjob eliminates this by mapping branch name (or manual input) to environment-specific variables using GitHub Actions variable interpolation:
vars[format('{0}_AWS_ROLE_ARN', upper(ENV))]This reads STAGING_AWS_ROLE_ARN or PRODUCTION_AWS_ROLE_ARN from your GitHub repository or environment variables — no duplication. The outputs fromresolve-env are then used by all downstream jobs.
Why the smoke test retries 10 times
A rolling deployment doesn't complete atomically. After kubectl rollout statusreturns, traffic is still being moved from old pods to new ones through the ingress controller. The smoke test polls with 10-second backoff to handle this transition period. If the health check fails after all attempts, the rollback job triggers automatically — but only if the deploy job itself succeeded. A deploy that failed (e.g. image pull error) does not trigger rollback, because there's nothing to roll back to.
The ArgoCD sync step
The workflow updates the image tag in the Helm values file for the target environment, then triggers an ArgoCD sync. This keeps your GitOps repository as the source of truth — the deployed image tag is always visible in Git, every deploy is an auditable commit, and ArgoCD's self-heal means any manual kubectl changes get reverted automatically.
The alternative (using helm upgrade directly from CI) works but loses the GitOps audit trail and means ArgoCD doesn't know the actual desired state. If you want the operational benefits of ArgoCD (diff view, rollback UI, sync windows for change control), the CI pipeline should update Git and let ArgoCD handle the actual apply.
Slack notifications are not optional
Both success and failure notifications go to Slack. The failure notification is obvious. The success notification is equally important — it gives the deploying engineer confirmation that the deploy completed and tells the rest of the team that a new version is live. Without it, you're relying on engineers remembering to check the Actions UI, which they don't.
Customisation notes
To add a new environment (e.g. a QA environment), add a branch pattern to theon.push.branches list, add a new case to the resolve-env step, and add the corresponding GitHub repository variables (QA_AWS_ROLE_ARN, QA_CLUSTER_NAME, etc.).
The smoke test currently checks a single health endpoint. Extend it with your most critical user-facing flow — for a web service, a curl to an API endpoint that exercises the database connection is usually enough. For a payment service, check that the checkout initiation endpoint returns 200 (without completing a payment). The goal is to catch the most common failure modes (app can't connect to DB, config is wrong) within 60 seconds of deploy.
Replace the yq command in the "Update image tag" step with a proper Git commit if you want the change to appear in your GitOps repository's history. This requires a deploy key or a GitHub App token with write access to the GitOps repository.
Need this customised for your stack?
The resource above covers the general case. If you need it adapted to your specific cloud account, VPC layout, security requirements, or team structure, that's what engagements are for.
Let's talk