KH.

RESOURCE · CI/CD TEMPLATE

A CI/CD template for zero-downtime Kubernetes deploys

OIDC-based AWS auth, ECR push, ArgoCD sync, smoke test, and automatic rollback — all in one workflow file.

GitHub ActionsAWS OIDCAmazon ECRArgoCDHelm 3Slack

What you get

  • A main deploy.yml that handles staging and production from a single file, resolved by branch name or manual trigger
  • A ci.yml for pull requests: lint, type-check, integration tests with a real Postgres service container, and a dry-run Docker build
  • Automatic rollback via helm rollback if the post-deploy smoke test fails, with a Slack notification on both success and failure

What it assumes

  • GitHub Actions enabled on the target repository
  • An ECR registry and an EKS cluster already provisioned (use the Terraform module above)
  • ArgoCD installed in the cluster and reachable from the GitHub Actions runner, with a service account token stored as a GitHub secret

The code

name: Deploy

on:
  push:
    branches:
      - main          # Triggers production deploy
      - staging       # Triggers staging deploy
  workflow_dispatch:
    inputs:
      environment:
        description: 'Target environment'
        required: true
        type: choice
        options: [staging, production]

concurrency:
  group: deploy-${{ github.ref }}
  cancel-in-progress: false  # Never cancel a running deploy — wait for it

permissions:
  contents: read
  id-token: write   # Required for OIDC AWS auth

jobs:
  # ─────────────────────────────────────────────
  # Resolve which environment to deploy to
  # ─────────────────────────────────────────────
  resolve-env:
    runs-on: ubuntu-latest
    outputs:
      environment: ${{ steps.resolve.outputs.environment }}
      aws_role: ${{ steps.resolve.outputs.aws_role }}
      ecr_registry: ${{ steps.resolve.outputs.ecr_registry }}
      cluster_name: ${{ steps.resolve.outputs.cluster_name }}
    steps:
      - id: resolve
        run: |
          if [[ "${{ github.event_name }}" == "workflow_dispatch" ]]; then
            ENV="${{ github.event.inputs.environment }}"
          elif [[ "${{ github.ref }}" == "refs/heads/main" ]]; then
            ENV="production"
          else
            ENV="staging"
          fi
          echo "environment=$ENV" >> $GITHUB_OUTPUT
          echo "aws_role=${{ vars[format('{0}_AWS_ROLE_ARN', upper(ENV))] }}" >> $GITHUB_OUTPUT
          echo "ecr_registry=${{ vars.ECR_REGISTRY }}" >> $GITHUB_OUTPUT
          echo "cluster_name=${{ vars[format('{0}_CLUSTER_NAME', upper(ENV))] }}" >> $GITHUB_OUTPUT

  # ─────────────────────────────────────────────
  # Build and push to ECR
  # ─────────────────────────────────────────────
  build:
    runs-on: ubuntu-latest
    needs: resolve-env
    outputs:
      image_tag: ${{ steps.meta.outputs.version }}
    steps:
      - uses: actions/checkout@v4

      - name: Configure AWS credentials (OIDC)
        uses: aws-actions/configure-aws-credentials@v4
        with:
          role-to-assume: ${{ needs.resolve-env.outputs.aws_role }}
          aws-region: us-east-1

      - name: Log in to ECR
        id: login-ecr
        uses: aws-actions/amazon-ecr-login@v2

      - name: Extract Docker metadata
        id: meta
        uses: docker/metadata-action@v5
        with:
          images: ${{ needs.resolve-env.outputs.ecr_registry }}/my-app
          tags: |
            type=sha,prefix=,format=short
            type=ref,event=branch

      - name: Set up Docker Buildx
        uses: docker/setup-buildx-action@v3

      - name: Build and push
        uses: docker/build-push-action@v5
        with:
          context: .
          push: true
          tags: ${{ steps.meta.outputs.tags }}
          labels: ${{ steps.meta.outputs.labels }}
          cache-from: type=gha
          cache-to: type=gha,mode=max
          provenance: false   # Avoids multi-arch manifest issues with ECR

  # ─────────────────────────────────────────────
  # Run tests against the built image
  # ─────────────────────────────────────────────
  test:
    runs-on: ubuntu-latest
    needs: build
    steps:
      - uses: actions/checkout@v4
      - name: Run integration tests
        run: |
          docker run --rm \
            -e DATABASE_URL=${{ secrets.TEST_DATABASE_URL }} \
            ${{ needs.resolve-env.outputs.ecr_registry }}/my-app:${{ needs.build.outputs.image_tag }} \
            npm run test:integration

  # ─────────────────────────────────────────────
  # Deploy via Helm + ArgoCD sync
  # ─────────────────────────────────────────────
  deploy:
    runs-on: ubuntu-latest
    needs: [resolve-env, build, test]
    environment: ${{ needs.resolve-env.outputs.environment }}
    steps:
      - uses: actions/checkout@v4

      - name: Configure AWS credentials (OIDC)
        uses: aws-actions/configure-aws-credentials@v4
        with:
          role-to-assume: ${{ needs.resolve-env.outputs.aws_role }}
          aws-region: us-east-1

      - name: Update kubeconfig
        run: |
          aws eks update-kubeconfig \
            --name ${{ needs.resolve-env.outputs.cluster_name }} \
            --region us-east-1

      - name: Update image tag in Helm values
        run: |
          IMAGE_TAG=${{ needs.build.outputs.image_tag }}
          ENV=${{ needs.resolve-env.outputs.environment }}
          # Update the image.tag in the environment-specific values file
          # This commits back to the GitOps repo so ArgoCD picks it up
          yq -i ".image.tag = \"$IMAGE_TAG\"" deploy/helm/envs/$ENV/values.yaml

      - name: Trigger ArgoCD sync
        run: |
          argocd app sync my-app-${{ needs.resolve-env.outputs.environment }} \
            --server ${{ secrets.ARGOCD_SERVER }} \
            --auth-token ${{ secrets.ARGOCD_TOKEN }} \
            --timeout 300 \
            --retry-limit 3

      - name: Wait for rollout
        run: |
          kubectl rollout status deployment/my-app \
            --namespace ${{ needs.resolve-env.outputs.environment }} \
            --timeout=300s

  # ─────────────────────────────────────────────
  # Smoke test the deployed endpoint
  # ─────────────────────────────────────────────
  smoke-test:
    runs-on: ubuntu-latest
    needs: [resolve-env, deploy]
    steps:
      - name: Health check
        run: |
          ENV=${{ needs.resolve-env.outputs.environment }}
          BASE_URL=${{ secrets[format('{0}_BASE_URL', upper(ENV))] }}
          for i in {1..10}; do
            STATUS=$(curl -s -o /dev/null -w "%{http_code}" "$BASE_URL/healthz")
            if [[ "$STATUS" == "200" ]]; then
              echo "Smoke test passed (attempt $i)"
              exit 0
            fi
            echo "Attempt $i: got $STATUS, retrying in 10s..."
            sleep 10
          done
          echo "Smoke test failed after 10 attempts"
          exit 1

  # ─────────────────────────────────────────────
  # Rollback on smoke test failure
  # ─────────────────────────────────────────────
  rollback:
    runs-on: ubuntu-latest
    needs: [resolve-env, deploy, smoke-test]
    if: failure() && needs.deploy.result == 'success'
    steps:
      - name: Configure AWS credentials (OIDC)
        uses: aws-actions/configure-aws-credentials@v4
        with:
          role-to-assume: ${{ needs.resolve-env.outputs.aws_role }}
          aws-region: us-east-1

      - name: Rollback Helm release
        run: |
          aws eks update-kubeconfig \
            --name ${{ needs.resolve-env.outputs.cluster_name }} \
            --region us-east-1
          helm rollback my-app --namespace ${{ needs.resolve-env.outputs.environment }}

      - name: Notify Slack — rollback
        uses: slackapi/slack-github-action@v1.26.0
        with:
          payload: |
            {
              "text": ":warning: *Deploy rolled back*",
              "attachments": [{
                "color": "danger",
                "fields": [
                  {"title": "Env", "value": "${{ needs.resolve-env.outputs.environment }}", "short": true},
                  {"title": "Commit", "value": "${{ github.sha }}", "short": true},
                  {"title": "Actor", "value": "${{ github.actor }}", "short": true}
                ]
              }]
            }
        env:
          SLACK_WEBHOOK_URL: ${{ secrets.SLACK_WEBHOOK_URL }}

  # ─────────────────────────────────────────────
  # Notify Slack on success
  # ─────────────────────────────────────────────
  notify:
    runs-on: ubuntu-latest
    needs: [resolve-env, smoke-test]
    if: success()
    steps:
      - name: Notify Slack — success
        uses: slackapi/slack-github-action@v1.26.0
        with:
          payload: |
            {
              "text": ":white_check_mark: *Deploy succeeded*",
              "attachments": [{
                "color": "good",
                "fields": [
                  {"title": "Env", "value": "${{ needs.resolve-env.outputs.environment }}", "short": true},
                  {"title": "Image", "value": "${{ needs.build.outputs.image_tag }}", "short": true},
                  {"title": "Actor", "value": "${{ github.actor }}", "short": true}
                ]
              }]
            }
        env:
          SLACK_WEBHOOK_URL: ${{ secrets.SLACK_WEBHOOK_URL }}

The explanation

OIDC instead of long-lived AWS credentials

The workflow uses GitHub Actions OIDC to assume an AWS IAM role, which means noAWS_ACCESS_KEY_ID or AWS_SECRET_ACCESS_KEY stored in GitHub secrets. OIDC tokens are short-lived (valid for a single workflow run), issued by GitHub, and validated by AWS against a trust policy you configure once. If a token leaks, it's already expired. If you rotate a key in the old approach, you have to find and update it in every repo that uses it.

Setting up OIDC requires two one-time steps: creating an OIDC provider in your AWS account pointing at token.actions.githubusercontent.com, and configuring a trust policy on the IAM role that restricts which GitHub org, repo, and branch can assume it. Theaws-actions/configure-aws-credentials action handles the token exchange automatically.

Concurrency: never cancel a running deploy

The concurrency group is set per github.ref withcancel-in-progress: false. This is deliberate. Cancelling a running deploy mid-flight can leave your cluster in a partially updated state — some pods on the new image, some on the old. If you need to cancel a deploy that's gone wrong, use the rollback job instead.

For CI (the ci.yml workflow), the same concurrency group usescancel-in-progress: true, because cancelling a lint or test run on a superseded PR commit is always safe.

Environment resolution without duplication

A common pattern is to have separate workflow files for staging and production, duplicating identical steps with slightly different environment variables. The resolve-envjob eliminates this by mapping branch name (or manual input) to environment-specific variables using GitHub Actions variable interpolation:

vars[format('{0}_AWS_ROLE_ARN', upper(ENV))]

This reads STAGING_AWS_ROLE_ARN or PRODUCTION_AWS_ROLE_ARN from your GitHub repository or environment variables — no duplication. The outputs fromresolve-env are then used by all downstream jobs.

Why the smoke test retries 10 times

A rolling deployment doesn't complete atomically. After kubectl rollout statusreturns, traffic is still being moved from old pods to new ones through the ingress controller. The smoke test polls with 10-second backoff to handle this transition period. If the health check fails after all attempts, the rollback job triggers automatically — but only if the deploy job itself succeeded. A deploy that failed (e.g. image pull error) does not trigger rollback, because there's nothing to roll back to.

The ArgoCD sync step

The workflow updates the image tag in the Helm values file for the target environment, then triggers an ArgoCD sync. This keeps your GitOps repository as the source of truth — the deployed image tag is always visible in Git, every deploy is an auditable commit, and ArgoCD's self-heal means any manual kubectl changes get reverted automatically.

The alternative (using helm upgrade directly from CI) works but loses the GitOps audit trail and means ArgoCD doesn't know the actual desired state. If you want the operational benefits of ArgoCD (diff view, rollback UI, sync windows for change control), the CI pipeline should update Git and let ArgoCD handle the actual apply.

Slack notifications are not optional

Both success and failure notifications go to Slack. The failure notification is obvious. The success notification is equally important — it gives the deploying engineer confirmation that the deploy completed and tells the rest of the team that a new version is live. Without it, you're relying on engineers remembering to check the Actions UI, which they don't.

Customisation notes

To add a new environment (e.g. a QA environment), add a branch pattern to theon.push.branches list, add a new case to the resolve-env step, and add the corresponding GitHub repository variables (QA_AWS_ROLE_ARN, QA_CLUSTER_NAME, etc.).

The smoke test currently checks a single health endpoint. Extend it with your most critical user-facing flow — for a web service, a curl to an API endpoint that exercises the database connection is usually enough. For a payment service, check that the checkout initiation endpoint returns 200 (without completing a payment). The goal is to catch the most common failure modes (app can't connect to DB, config is wrong) within 60 seconds of deploy.

Replace the yq command in the "Update image tag" step with a proper Git commit if you want the change to appear in your GitOps repository's history. This requires a deploy key or a GitHub App token with write access to the GitOps repository.

Need this customised for your stack?

The resource above covers the general case. If you need it adapted to your specific cloud account, VPC layout, security requirements, or team structure, that's what engagements are for.

Let's talk