Skip to main content
Career Paths
Concepts
Infrastructure As Code
The Simplified Tech

Role-based learning paths to help you master cloud engineering with clarity and confidence.

Product

  • Career Paths
  • Interview Prep
  • Scenarios
  • AI Features
  • Cloud Comparison
  • Resume Builder
  • Pricing

Community

  • Join Discord

Account

  • Dashboard
  • Credits
  • Updates
  • Sign in
  • Sign up
  • Contact Support

Stay updated

Get the latest learning tips and updates. No spam, ever.

Terms of ServicePrivacy Policy

© 2026 TheSimplifiedTech. All rights reserved.

BackBack
Interactive Explainer

Infrastructure as Code: Terraform & CloudFormation

Why declarative IaC exists, how Terraform's plan/apply/state workflow operates, what the state file is and why losing it is catastrophic, Terraform vs CloudFormation vs Pulumi trade-offs, and the module and remote state patterns used in production.

🎯Key Takeaways
IaC enables PR review for infrastructure changes, reproducible environments, and drift detection — console clicks have none of these
The Terraform state file maps HCL resources to actual cloud resources; losing it does not destroy infrastructure but makes it unmanageable
Always use remote state (S3 + DynamoDB) with versioning and locking for any team-based Terraform usage
Use terraform plan -out=tfplan to save the plan, then terraform apply tfplan to prevent plan drift between CI stages
Split large state files by domain (networking, compute, database) to limit blast radius and serialisation of IaC operations

Infrastructure as Code: Terraform & CloudFormation

Why declarative IaC exists, how Terraform's plan/apply/state workflow operates, what the state file is and why losing it is catastrophic, Terraform vs CloudFormation vs Pulumi trade-offs, and the module and remote state patterns used in production.

~8 min read
Be the first to complete!
What you'll learn
  • IaC enables PR review for infrastructure changes, reproducible environments, and drift detection — console clicks have none of these
  • The Terraform state file maps HCL resources to actual cloud resources; losing it does not destroy infrastructure but makes it unmanageable
  • Always use remote state (S3 + DynamoDB) with versioning and locking for any team-based Terraform usage
  • Use terraform plan -out=tfplan to save the plan, then terraform apply tfplan to prevent plan drift between CI stages
  • Split large state files by domain (networking, compute, database) to limit blast radius and serialisation of IaC operations

Lesson outline

Why IaC Exists and What It Actually Solves

Infrastructure as Code is the practice of defining and managing cloud resources through version-controlled configuration files rather than manual console clicks or one-off CLI commands. The core problem it solves is not laziness — it is drift, repeatability, and review.

Problems IaC solves that manual operations cannot

  • Drift: production diverges from documentation — A developer adds an S3 bucket in the console during an incident and forgets to document it. Three months later, a security audit finds an unmanaged bucket with sensitive data. IaC prevents this: if it is not in code, it is either flagged by drift detection or does not exist.
  • Repeatability: identical environments are impossible manually — Creating a dev environment that perfectly mirrors production requires either perfect documentation (never exists) or a script (which is IaC without the state management). Terraform apply on the same configuration creates identical environments every time.
  • Review: infrastructure changes go through pull requests — A Terraform PR shows exactly what will be created, modified, or destroyed. Security engineers, architects, and teammates review the change before it touches production. Console clicks have no review trail.
  • Disaster recovery: recreate everything from code — If your production account is compromised or accidentally deleted, IaC lets you rebuild the entire environment in hours. Without it, you are looking at days of console work from memory and screenshots.

Declarative vs imperative IaC

Terraform, CloudFormation, and Pulumi with a declarative style describe what you want (desired state). The tool computes the diff against current state and executes only the necessary changes. Ansible and shell scripts describe how to get there (imperative). Declarative IaC is idempotent — running it twice does nothing on the second run. Imperative scripts are not: running "aws ec2 create-security-group" twice creates two groups. Declarative is strongly preferred for infrastructure definitions.

Terraform: Plan, Apply, State

Terraform's workflow has three phases: write configuration (HCL), run terraform plan to see what will change, run terraform apply to make it so. The state file is what enables the plan to know the difference between "this resource needs to be created" and "this resource already exists and needs to be updated."

The Terraform workflow in a team environment

→

01

Write HCL configuration defining resources. Commit to a feature branch.

→

02

Open a pull request. CI runs terraform fmt (formatting) and terraform validate (syntax check).

→

03

CI runs terraform plan against a remote state backend. The plan output is posted as a PR comment showing exact changes.

→

04

Team reviews the plan. Security team checks for overly permissive IAM policies or public S3 buckets.

→

05

After approval, CI runs terraform apply using the reviewed plan. State file is updated automatically.

06

Monitor the apply output and confirm resources are created correctly. Tag the release.

1

Write HCL configuration defining resources. Commit to a feature branch.

2

Open a pull request. CI runs terraform fmt (formatting) and terraform validate (syntax check).

3

CI runs terraform plan against a remote state backend. The plan output is posted as a PR comment showing exact changes.

4

Team reviews the plan. Security team checks for overly permissive IAM policies or public S3 buckets.

5

After approval, CI runs terraform apply using the reviewed plan. State file is updated automatically.

6

Monitor the apply output and confirm resources are created correctly. Tag the release.

The state file is the source of truth — treat it like a production database

Terraform state (terraform.tfstate) records the mapping between your HCL resource definitions and the actual cloud resources they manage. If the state file is deleted, Terraform no longer knows which resources it manages. Running terraform apply after a state deletion can create duplicate resources, or worse — Terraform may try to create resources that already exist and fail midway, leaving the environment in an inconsistent state. The state file can also contain plaintext secrets (database passwords, API keys) as values of sensitive resource attributes. Never store state in a local file in a team environment. Never commit state to git.

Remote state backend requirements for teams

  • Centralised storage: S3, Terraform Cloud, or GCS — State must be accessible to all CI pipelines and team members. S3 with versioning enabled is the standard choice on AWS. Versioning lets you recover from a corrupted state by rolling back to a previous version.
  • State locking: DynamoDB (for S3 backend) or native locking (Terraform Cloud) — Prevents two concurrent applies from simultaneously modifying state and corrupting it. Without locking, two CI pipelines running apply at the same time on the same workspace will corrupt the state file.
  • Encryption: S3 SSE or Terraform Cloud encryption — State files can contain plaintext resource outputs including database passwords, private keys, and connection strings. The S3 bucket must have server-side encryption and strict IAM access policies.
  • Separate state per environment — Never share state between dev, staging, and production. A plan/apply mistake in dev that corrupts state should not affect production. Use separate S3 keys (dev/terraform.tfstate, prod/terraform.tfstate) or separate Terraform Cloud workspaces.
main.tf
1# terraform/main.tf — example remote state backend configuration
2 terraform {
3 required_version = ">= 1.5.0"
4 required_providers {
5 aws = {
6 source = "hashicorp/aws"
7 version = "~> 5.0"
8 }
9 }
10
11 backend "s3" {
Always use remote backend in team environments — local state causes corruption and secrets leakage
12 bucket = "my-company-terraform-state"
13 key = "production/vpc/terraform.tfstate"
14 region = "us-east-1"
DynamoDB table prevents concurrent applies from corrupting state
15 encrypt = true # SSE-S3 encryption
16 dynamodb_table = "terraform-state-locks" # DynamoDB for state locking
17 # Never hardcode credentials here — use IAM role or env vars
18 }
19 }
20
21 # Example VPC resource using a module
22 module "vpc" {
23 source = "terraform-aws-modules/vpc/aws"
24 version = "5.1.0"
25
26 name = "production-vpc"
27 cidr = "10.0.0.0/16"
28 azs = ["us-east-1a", "us-east-1b", "us-east-1c"]
29
30 private_subnets = ["10.0.1.0/24", "10.0.2.0/24", "10.0.3.0/24"]
31 public_subnets = ["10.0.101.0/24", "10.0.102.0/24", "10.0.103.0/24"]
32
33 enable_nat_gateway = true
34 single_nat_gateway = false # One NAT per AZ for resilience
35 enable_dns_hostnames = true
36
single_nat_gateway = false is more expensive but required for AZ-resilient private subnet egress
37 tags = {
38 Environment = "production"
39 ManagedBy = "terraform"
40 }
41 }
42
43 # View current state without applying
44 # terraform state list
45 # terraform state show module.vpc.aws_vpc.this[0]
46
47 # Import a resource created outside Terraform (emergency console fix)
48 # terraform import aws_s3_bucket.my_bucket my-existing-bucket-name
49
50 # Detect drift between state and actual infrastructure
51 # terraform plan -refresh-only

Terraform vs CloudFormation vs Pulumi

The IaC ecosystem has three primary tools, each with different design philosophies, ecosystem maturity, and operational trade-offs. Knowing when to use each — and what pain points each brings — is essential knowledge for cloud engineers.

DimensionTerraform (HCL)CloudFormation (YAML/JSON)Pulumi (Python/TypeScript/Go)
LanguageHCL (domain-specific)YAML / JSONGeneral-purpose (Python, TS, Go, C#)
Multi-cloudYes (900+ providers)AWS onlyYes (major providers)
State managementExternal (S3 + DynamoDB or TF Cloud)AWS-managed (S3 internally)Pulumi Cloud or self-managed
Drift detectionterraform plan -refresh-onlyCloudFormation Drift Detectionpulumi refresh
Destroy protectionprevent_destroy lifecycle ruleDeletionPolicy: Retainprotect option on resource
Nested / modularModules (local and registry)Nested stacksStacks + component resources
TestingTerratest, terraform test (1.6+)cfn-lint, taskcatPulumi testing SDK
Secrets handlingSensitive attribute, not encrypted in stateSSM/Secrets Manager referencesEncrypted secrets in state
Learning curveLow (HCL is simple)Medium (YAML is verbose)Low for existing programmers, high for ops

Choose Terraform for multi-cloud or when the team knows cloud infrastructure; CloudFormation when you need native AWS integrations

Terraform wins on multi-cloud and community module ecosystem. CloudFormation wins when you need native AWS integrations (StackSets for multi-account, Service Catalog, CloudFormation Hooks for policy enforcement) and AWS-managed state. Pulumi wins when your team is composed of developers who want to write real code with conditionals, loops, and unit tests rather than declarative configuration. For a team new to IaC, Terraform is the safest default due to the richest ecosystem and community.

Always use terraform plan -out to prevent plan drift between CI and apply

Running terraform plan and then terraform apply in separate CI steps can cause the apply to execute a different plan if infrastructure changed between the two steps. Use terraform plan -out=tfplan to save the exact plan, then terraform apply tfplan to execute exactly that plan. This is the correct production workflow.

Module Patterns and Organisation at Scale

As an IaC codebase grows, unstructured HCL becomes just as hard to maintain as unstructured code. Module patterns, workspace strategies, and the monorepo vs repo-per-service question significantly affect how productive your team is with Terraform over time.

IaC organisation patterns in production

  • Root module structure: separate by environment and service — terraform/environments/prod/vpc/, terraform/environments/prod/ecs/, terraform/environments/staging/vpc/. Each directory is an independent Terraform root module with its own state. Prevents a single large state file from becoming a blast-radius problem.
  • Shared modules: reuse with versioning via registry or git tags — module { source = "git::https://github.com/company/terraform-modules.git//vpc?ref=v2.1.0" }. Pin to a specific version tag. Updating a shared module should trigger a plan review in all consuming root modules before merging.
  • data sources for cross-state references (not outputs shared directly) — The networking module outputs its VPC ID. The application module uses a terraform_remote_state data source to read it: data.terraform_remote_state.vpc.outputs.vpc_id. This creates a dependency between state files without coupling their apply lifecycle.

Large state files are a reliability risk — split them early

A single Terraform state file managing 500 resources has two problems: (1) every plan operation must lock the entire state, serialising all IaC operations across the team; (2) a corrupted state file means 500 resources are all unmanaged simultaneously. The industry rule of thumb is no more than 100–150 resources per state file. Split by logical domain: networking state, ECS cluster state, database state, IAM state.

How this might come up in interviews

Cloud engineer, DevOps engineer, and platform engineer interviews universally ask about IaC. Both conceptual questions ("what is state?") and operational scenarios ("apply failed halfway — now what?") are common. Senior roles expect module design and team workflow patterns.

Common questions:

  • What is the Terraform state file and what happens if you lose it?
  • Explain the difference between terraform plan and terraform apply. Why does the -out flag matter?
  • How would you manage Terraform state in a team of 20 engineers working on the same infrastructure?
  • What is the difference between Terraform and CloudFormation? When would you choose one over the other?
  • A terraform apply failed halfway through. How do you determine the state of the infrastructure and recover?

Try this question: Is there an existing IaC codebase or is this greenfield? How many engineers will be applying Terraform? Is the workload multi-cloud or AWS-only? Are there compliance requirements for audit logging of infrastructure changes?

Strong answer: Mentions S3 versioning and DynamoDB locking unprompted when describing remote state. Explains the -out flag for plan/apply separation. Talks about splitting state files by domain for blast radius reduction. Mentions terraform state list and terraform import for state recovery scenarios.

Red flags: Does not know what the state file is. Suggests storing state in git. Cannot explain what happens when apply fails partway through. Treats Terraform and Ansible as equivalent tools (they solve different problems).

Key takeaways

  • IaC enables PR review for infrastructure changes, reproducible environments, and drift detection — console clicks have none of these
  • The Terraform state file maps HCL resources to actual cloud resources; losing it does not destroy infrastructure but makes it unmanageable
  • Always use remote state (S3 + DynamoDB) with versioning and locking for any team-based Terraform usage
  • Use terraform plan -out=tfplan to save the plan, then terraform apply tfplan to prevent plan drift between CI stages
  • Split large state files by domain (networking, compute, database) to limit blast radius and serialisation of IaC operations
🧠Mental Model

💡 Analogy

Infrastructure as Code is the architectural blueprint for your building. Terraform is the construction crew that reads the blueprint and builds what it specifies. The state file is the "as-built drawing" — a record of what was actually constructed, including which bolts were used, where every pipe runs, and what deviates from the original plan. Without the as-built drawing, even the original architect cannot safely renovate the building, because they do not know if what was built matches what was designed. Lose the as-built drawing and the crew treats the building as unmeasured — they cannot safely add or remove anything without risking structural damage.

⚡ Core Idea

Terraform computes the difference between desired state (HCL) and current state (state file + cloud API), then executes only the changes needed to reconcile them. The state file is the bridge between code and reality. Without it, Terraform cannot know which resources it manages.

🎯 Why It Matters

IaC is non-negotiable at any serious scale: it enables PR review for infrastructure, reproducible environments, disaster recovery, and drift detection. But the state file creates a new category of operational risk — it is a security-sensitive, business-critical file that must be stored securely, backed up, locked against concurrent access, and never deleted. Understanding the plan/apply/state workflow and its failure modes separates engineers who use Terraform as a productivity tool from those who create new categories of outage with it.

Related concepts

Explore topics that connect to this one.

  • Terraform deep dive
  • GitOps Principles
  • Configuration Management

Suggested next

Often learned after this topic.

Terraform deep dive

Ready to see how this works in the cloud?

Switch to Career Paths for structured paths (e.g. Developer, DevOps) and provider-specific lessons.

View role-based paths

Sign in to track your progress and mark lessons complete.

Continue learning

Terraform deep dive

Discussion

Questions? Discuss in the community or start a thread below.

Join Discord

In-app Q&A

Sign in to start or join a thread.