RESOURCE · TERRAFORM MODULE

A minimal-but-real Terraform module for production EKS

VPC, EKS cluster, managed node groups, IRSA, and three essential add-ons — with sane defaults and explicit reasoning for every choice.

Terraform 1.5+AWS EKS 1.30IRSAHelm 3

What you get

Four root files: versions.tf, variables.tf, main.tf, outputs.tf — ready to use as a module or as a root configuration
Three add-ons pre-wired: cluster-autoscaler with IRSA, metrics-server (required for HPA), and AWS Load Balancer Controller
Output values for every downstream dependency: OIDC provider ARN/URL, cluster endpoint, private/public subnet IDs

What it assumes

Terraform 1.5+ and AWS CLI configured with credentials that can create EKS and VPC resources
An S3 bucket and DynamoDB table for remote state (recommended — not included in the module)
Familiarity with AWS IAM; IRSA requires understanding of OIDC trust relationships

The code

terraform {
  required_version = ">= 1.5.0"

  required_providers {
    aws = {
      source  = "hashicorp/aws"
      version = "~> 5.0"
    }
    kubernetes = {
      source  = "hashicorp/kubernetes"
      version = "~> 2.27"
    }
    helm = {
      source  = "hashicorp/helm"
      version = "~> 2.13"
    }
    tls = {
      source  = "hashicorp/tls"
      version = "~> 4.0"
    }
  }
}

View full repo on GitHub

The explanation

Why managed node groups instead of Karpenter

Karpenter is excellent, but it introduces operational complexity that isn't worth it for most teams at the start of their EKS journey. Managed node groups are boring in the right way: AWS handles the node lifecycle, AMI updates ship as a simple node group update, and the operational surface area is small.

Switch to Karpenter when you have workloads with highly variable compute shapes (some jobs need 4xlarge, others need spot 2xlarge), when your cluster-autoscaler scale-up is too slow for your traffic patterns, or when you're spending more than a few hours per month maintaining node group configurations. For a standard web application cluster, managed node groups and cluster-autoscaler are the right choice for the first 12-18 months.

Why the VPC uses multiple NAT gateways

The module sets single_nat_gateway = false, which provisions one NAT gateway per availability zone. This costs roughly $96/month more than a single NAT gateway. The tradeoff: if the AZ containing your single NAT gateway fails, all private subnet traffic is blocked — your nodes can't pull images, your app can't reach AWS APIs, and your cluster-autoscaler can't communicate. For production workloads, this is a hidden single point of failure. The comment in the module flags this explicitly so you can make a conscious choice rather than discovering it during an AZ outage.

Why private endpoint only

The module disables the public API server endpoint (cluster_endpoint_public_access = false). This means your API server is not reachable from the public internet — you must use a VPN, AWS Client VPN, or a bastion host to run kubectl. For CI/CD, your GitHub Actions runner must either be self-hosted inside the VPC or reach the cluster via VPC peering or AWS PrivateLink.

The alternative (public endpoint with IP allowlisting) is acceptable for teams without the infrastructure to run private access, but the allowlist is frequently misconfigured — people add 0.0.0.0/0 to unblock a developer and forget to remove it. Defaulting to private removes this risk entirely.

IRSA instead of node IAM roles

IAM Roles for Service Accounts (IRSA) allows individual Kubernetes workloads to assume an IAM role — without storing credentials anywhere and without granting the entire node's IAM role broad permissions. The module provisions IRSA roles for the EBS CSI driver, cluster-autoscaler, and Load Balancer Controller. Each role has exactly the permissions needed for that add-on, scoped to the specific service account in the specific namespace.

The alternative — attaching a broad IAM policy to the node IAM role — means any pod on that node can access AWS resources as if it were the node. A compromised pod can exfiltrate data from S3, create EC2 instances, or assume other roles. IRSA limits the blast radius to the permissions of the specific role the pod is allowed to assume.

Why observability is not bundled

This module deliberately excludes Prometheus, Grafana, Loki, and alerting. Observability stacks have strong opinions and significant configuration surface area — the right setup depends on whether you want managed (Amazon Managed Prometheus, Grafana Cloud) or self-hosted, how much retention you need, your alert routing preferences, and your budget. Bundling a default observability stack means you either accept defaults that might not fit your needs, or you spend more time configuring the module than if you'd set it up separately.

After provisioning with this module, deploy observability as a second step using its output values (oidc_provider_arn for IRSA, private_subnet_ids for placing agents). The outputs are designed to make this handoff clean.

Remote state

This module doesn't include a backend configuration — that belongs in your root module or workspace, not in a reusable module. Before running terraform init, create an S3 bucket and DynamoDB table for state locking, then configure your backend:

terraform {
  backend "s3" {
    bucket         = "your-org-tfstate"
    key            = "eks/production/terraform.tfstate"
    region         = "us-east-1"
    dynamodb_table = "terraform-state-lock"
    encrypt        = true
  }
}

Use workspaces or separate state files per environment — never share state between staging and production.

Customisation notes

The most common production customisation is adding a second node group for specialised workloads. A typical pattern is a general-purpose group using on-demand instances (for stateful or latency-sensitive services) and a batch group using spot instances (for background workers). Add taints to the batch group and tolerations to the batch workloads so they don't mix.

For EKS version upgrades: update the cluster_version variable, run terraform plan to verify only the expected resources change, then apply. EKS upgrades are in-place for the control plane; node groups require a rolling replacement (Terraform handles this via the force_update_version flag or you can drain nodes manually for more control).

If you need private cluster access from GitHub Actions, use an AWS CodeBuild project as a self-hosted runner inside the VPC, or configure AWS Client VPN. AWS Systems Manager Session Manager is another option for interactive access without a bastion host.

Need this customised for your stack?

The resource above covers the general case. If you need it adapted to your specific cloud account, VPC layout, security requirements, or team structure, that's what engagements are for.

Let's talk