Cloud & Infrastructure

1. What Is "The Cloud"? 2. Cloud Providers (AWS, GCP, Azure) 3. Core Cloud Services 4. IaaS vs PaaS vs SaaS vs FaaS 5. Serverless 6. CI/CD (Continuous Integration & Deployment) 7. Infrastructure as Code 8. Load Balancers 9. CDNs (Content Delivery Networks) 10. Object Storage 11. Managed Databases 12. Cloud Costs & Avoiding Bill Shock

1. What Is "The Cloud"?

Strip away all the marketing, and "the cloud" is just other people's computers that you rent. That's it. When you deploy your app to AWS, your code runs on a physical server sitting in a warehouse (called a data center) owned by Amazon. You don't buy the server. You don't maintain the server. You pay for how much you use.

The Restaurant Analogy

Running your own servers is like owning a restaurant. You buy the building, install the kitchen, hire staff, handle plumbing when it breaks, pay rent even when nobody is eating. It's total control but total responsibility.

Using the cloud is like renting a commercial kitchen. You show up, cook your food, serve your customers, and leave. Someone else handles the building, the electricity, the fire suppression system, and the plumbing. You just pay for the time and space you use.

The tradeoff: you give up some control (you can't knock out a wall to expand), but you gain speed (start cooking tomorrow, not in 6 months after construction) and flexibility (rent a bigger kitchen next month if business grows).

What Does a Data Center Actually Look Like?

A data center is a massive building full of racks of servers, connected by fiber optic cables, cooled by industrial HVAC systems, and backed by redundant power supplies (including diesel generators for when the grid fails). AWS alone operates 100+ data centers worldwide. Each one looks like this:

  DATA CENTER
  +-------------------------------------------------+
  |                                                 |
  |   +-------+  +-------+  +-------+  +-------+   |
  |   | Rack  |  | Rack  |  | Rack  |  | Rack  |   |
  |   |Server1|  |Server1|  |Server1|  |Server1|   |
  |   |Server2|  |Server2|  |Server2|  |Server2|   |
  |   |Server3|  |Server3|  |Server3|  |Server3|   |
  |   |  ...  |  |  ...  |  |  ...  |  |  ...  |   |
  |   +---+---+  +---+---+  +---+---+  +---+---+   |
  |       |          |          |          |        |
  |   +---+----------+----------+----------+---+    |
  |   |         Network Switch Layer           |    |
  |   +---+------------------------------------+    |
  |       |                                         |
  |   +---+---+   +----------+   +-----------+     |
  |   |Router |   |  Cooling |   |  Backup   |     |
  |   |to ISP |   |  System  |   |  Power    |     |
  |   +-------+   +----------+   +-----------+     |
  |                                                 |
  +-------------------------------------------------+
        |
  To the internet (your users)

Why the Cloud Exists

Before the cloud, if you wanted to launch a web app, you had to:

Buy physical servers -- $5,000-$50,000 per machine, weeks of lead time
Rent rack space in a data center (or build your own)
Hire sysadmins to maintain hardware, replace failed disks, patch OSes
Guess your capacity -- buy too few servers and you crash on launch day, buy too many and you waste money
Handle disasters -- what happens when a server catches fire? When the building floods?

The cloud eliminates all of this. You click a button (or run a command), and a virtual server appears in seconds. Need 10 more? Click 10 more times. Traffic drops? Turn them off and stop paying. This is elastic computing, and it changed how software is built.

Key Cloud Concepts

On-demand: Get resources instantly, pay as you go
Elastic: Scale up when you need more, scale down when you don't
Regions: Data centers in different geographic locations (us-east-1, eu-west-1, ap-southeast-1). Deploy close to your users for lower latency.
Availability Zones (AZs): Separate physical buildings within a region. If one AZ loses power, your app stays up in another AZ.
Multi-region: Deploy in multiple regions for global coverage and disaster recovery

The Cloud Is Not Magic

Your code still runs on physical hardware. Hard drives still fail. Networks still have latency. Servers still crash. The cloud doesn't eliminate infrastructure problems -- it makes them someone else's problem and gives you tools to handle failures gracefully (redundancy, auto-scaling, health checks). But if you design your app assuming nothing ever fails, you'll have a bad time, cloud or not.

2. Cloud Providers (AWS, GCP, Azure)

Three companies dominate cloud computing: AWS (Amazon), Azure (Microsoft), and GCP (Google). Together they hold about 65% of the market. There are also strong alternatives: DigitalOcean, Linode (now Akamai), Hetzner, Vultr, and Cloudflare.

The Big Three Compared

Aspect	AWS	GCP	Azure
Market share	~31% (largest)	~12%	~25%
Best for	Everything. Most services, most mature.	Data/ML, Kubernetes, developer UX	Enterprise, Windows/.NET, Office 365 integration
Compute	EC2	Compute Engine	Virtual Machines
Object Storage	S3	Cloud Storage	Blob Storage
Serverless	Lambda	Cloud Functions	Azure Functions
Managed K8s	EKS	GKE (best K8s experience)	AKS
Managed DB	RDS, DynamoDB, Aurora	Cloud SQL, Spanner, Firestore	Azure SQL, Cosmos DB
CDN	CloudFront	Cloud CDN	Azure CDN
Free tier	12-month free tier + always-free tier	$300 credit for 90 days + always-free tier	$200 credit for 30 days + always-free tier
CLI	`aws`	`gcloud`	`az`
Pricing	Complex, many hidden costs	Simpler, per-second billing	Complex, enterprise-oriented
Learning curve	Steepest -- 200+ services	Moderate -- cleaner console	Moderate -- if you know Microsoft ecosystem

Which One Should You Learn?

For jobs: Learn AWS. It has the most market share, the most job listings, and the most services. Knowing AWS makes you employable almost anywhere.

For startups: GCP often has the best developer experience and generous free tiers. Their Kubernetes offering (GKE) is best-in-class since Google invented Kubernetes.

For enterprise: If the company already uses Microsoft (Active Directory, Office 365, .NET), Azure is the natural choice because of deep integration.

For personal projects: DigitalOcean, Hetzner, or Vultr. Simpler, cheaper, and you'll learn more about infrastructure because there's less hand-holding.

Service Name Rosetta Stone

Every provider has the same core services -- they just give them different names. Here's how to translate:

What You Need	AWS	GCP	Azure
Virtual server	EC2	Compute Engine	Virtual Machines
Object storage	S3	Cloud Storage	Blob Storage
SQL database	RDS	Cloud SQL	Azure SQL
NoSQL database	DynamoDB	Firestore	Cosmos DB
Serverless functions	Lambda	Cloud Functions	Azure Functions
Container orchestration	ECS / EKS	Cloud Run / GKE	ACI / AKS
Message queue	SQS	Pub/Sub	Service Bus
DNS	Route 53	Cloud DNS	Azure DNS
Secret management	Secrets Manager	Secret Manager	Key Vault
IAM (permissions)	IAM	IAM	Entra ID (was Azure AD)

3. Core Cloud Services

Every cloud provider offers hundreds of services, but you only need to understand about 5-6 core ones to be productive. Everything else is built on top of these fundamentals.

Compute (EC2 / VMs)

A virtual machine in the cloud. You choose the CPU, RAM, and disk size. You get root access. It's like having a remote computer you can SSH into. This is the most fundamental cloud service -- everything else can theoretically run on a VM.

# Launch an EC2 instance (AWS CLI)
aws ec2 run-instances \
  --image-id ami-0c55b159cbfafe1f0 \
  --instance-type t3.micro \
  --key-name my-key-pair \
  --security-group-ids sg-0123456789abcdef0

# SSH into it
ssh -i my-key-pair.pem ec2-user@54.123.45.67

# Common instance types (AWS):
# t3.micro  -- 2 vCPU,  1 GB RAM  -- free tier, dev/test
# t3.medium -- 2 vCPU,  4 GB RAM  -- small apps
# m6i.large -- 2 vCPU,  8 GB RAM  -- general purpose
# c6i.large -- 2 vCPU,  4 GB RAM  -- CPU-intensive (compute-optimized)
# r6i.large -- 2 vCPU, 16 GB RAM  -- memory-intensive (databases, caching)

Instance Type Naming (AWS)

The naming follows a pattern: t3.micro = t (family: burstable) + 3 (generation) + micro (size). Families: t = burstable (cheap, good for variable workloads), m = general purpose, c = compute-optimized, r = memory-optimized, g/p = GPU instances.

Storage (S3 / Blob)

Object storage for files of any size -- images, videos, backups, logs, static websites. You upload objects into "buckets" (containers). Each object gets a unique key (like a file path). S3 has 99.999999999% (eleven 9s) durability -- meaning if you store 10 million objects, you might lose 1 every 10,000 years.

# Upload a file to S3
aws s3 cp ./backup.tar.gz s3://my-bucket/backups/backup-2024-01-15.tar.gz

# List bucket contents
aws s3 ls s3://my-bucket/backups/

# Download a file
aws s3 cp s3://my-bucket/backups/backup-2024-01-15.tar.gz ./

# Sync a directory (like rsync)
aws s3 sync ./dist s3://my-website-bucket/ --delete

Databases (RDS)

Managed database services. You pick the engine (PostgreSQL, MySQL, etc.) and the instance size. The cloud provider handles backups, patching, replication, and failover. You just connect and run queries.

# What "managed" means -- the provider handles:
# - Automated daily backups (point-in-time recovery)
# - OS and database engine patching
# - Multi-AZ failover (automatic switch to standby if primary dies)
# - Read replicas (for scaling reads)
# - Monitoring and alerts
# - Storage auto-scaling

# You handle:
# - Schema design
# - Query optimization
# - Application-level connection pooling
# - Access control (who can connect)

Networking (VPC)

A Virtual Private Cloud is your own private network in the cloud. It's like having your own isolated section of the data center. You control which resources can talk to each other, which can access the internet, and which ports are open.

  YOUR VPC (10.0.0.0/16)
  +---------------------------------------------------+
  |                                                   |
  |  Public Subnet (10.0.1.0/24)                      |
  |  +---------------------------------------------+  |
  |  | Load Balancer  --->  Web Server (EC2)        |  |
  |  +---------------------------------------------+  |
  |         |                                         |
  |  Private Subnet (10.0.2.0/24)                     |
  |  +---------------------------------------------+  |
  |  | App Server (EC2)  --->  Database (RDS)       |  |
  |  +---------------------------------------------+  |
  |                                                   |
  +---------------------------------------------------+
        |
   Internet Gateway (only public subnet has access)

Why VPC Matters

Without a VPC, your database would be directly accessible from the internet. Anyone could try to connect. With a VPC, your database sits in a private subnet with no internet access. Only your app servers in the same VPC can reach it. The load balancer sits in a public subnet and forwards traffic to the app servers. This is defense in depth -- even if someone compromises your web server, they can't directly access the database from outside.

CDN (CloudFront / Cloud CDN)

A Content Delivery Network caches your content at edge locations around the world. When a user in Tokyo requests your image, they get it from a server in Tokyo instead of your origin server in Virginia. This reduces latency from ~200ms to ~20ms. More on this in Section 9.

4. IaaS vs PaaS vs SaaS vs FaaS

These acronyms describe how much the provider manages for you. Think of it as a spectrum from "you manage everything" to "you manage nothing".

  What YOU manage vs what the PROVIDER manages:

  On-Premise   IaaS         PaaS         FaaS         SaaS
  (you own     (rent        (rent        (rent        (rent
   everything)  hardware)    platform)    functions)   software)

  +----------+ +----------+ +----------+ +----------+ +----------+
  |   App    | |   App    | |   App    | | Function | |          |
  +----------+ +----------+ +----------+ +----------+ |          |
  | Runtime  | | Runtime  | |##########| |##########| |          |
  +----------+ +----------+ +----------+ +----------+ |  Gmail   |
  |    OS    | |    OS    | |##########| |##########| |  Slack   |
  +----------+ +----------+ +----------+ +----------+ | Shopify  |
  |   Disk   | |##########| |##########| |##########| |          |
  +----------+ +----------+ +----------+ +----------+ |          |
  | Network  | |##########| |##########| |##########| |          |
  +----------+ +----------+ +----------+ +----------+ +----------+

  [  = You manage ]     [## = Provider manages ]

IaaS -- Infrastructure as a Service

You rent the hardware (virtual machines, networking, storage). You install and manage everything on top: the OS, runtime, app, security patches.

Examples: AWS EC2, GCP Compute Engine, DigitalOcean Droplets, Hetzner Cloud

Use when: You need full control. Custom OS configurations. Running software that doesn't fit into platform constraints. You have the ops expertise to manage it.

IaaS in Practice

# You spin up a VM, then you're responsible for everything:
ssh root@your-server
apt update && apt upgrade              # You patch the OS
apt install nodejs nginx postgresql    # You install dependencies
git clone your-repo && npm install     # You deploy your code
systemctl enable nginx                 # You configure the web server
certbot --nginx                        # You manage SSL certificates
# ...and you set up monitoring, backups, firewalls, log rotation

PaaS -- Platform as a Service

You give the provider your code, and they handle the rest -- servers, OS, runtime, scaling, SSL, deployments. You just git push.

Examples: Heroku, Vercel, Netlify, Railway, Render, Fly.io, Google App Engine

Use when: You want to ship fast without managing infrastructure. Small teams. Prototypes. Apps that fit standard patterns.

PaaS in Practice

# Heroku: deploy a Node.js app
git push heroku main
# That's it. Heroku detects Node.js, installs dependencies,
# starts your server, gives you a URL, handles SSL, and scales.

# Vercel: deploy a Next.js app
vercel --prod
# Builds, deploys to edge network, SSL, custom domain, done.

# Railway: deploy anything with a Dockerfile
railway up
# Detects Dockerfile, builds, deploys, gives you a URL.

SaaS -- Software as a Service

You don't manage anything. You just use the software through a browser or API. The provider handles everything -- infrastructure, updates, security, availability.

Examples: Gmail, Slack, Shopify, GitHub, Notion, Stripe, Twilio

Use when: The software does what you need out of the box. You're a user, not a builder (for that specific function).

FaaS -- Functions as a Service

You write individual functions. The provider runs them when triggered (HTTP request, file upload, timer, queue message). No servers to manage. You pay per execution, not per hour. More on this in Section 5.

Examples: AWS Lambda, Google Cloud Functions, Cloudflare Workers, Azure Functions

Model	You Manage	Provider Manages	Cost Model	Best For
IaaS	OS, runtime, app, data	Hardware, network, storage	Per hour/second (VM runs)	Full control, custom setups
PaaS	App, data	Everything else	Per dyno/instance + usage	Fast deployment, small teams
FaaS	Individual functions	Everything else	Per invocation + duration	Event-driven, sporadic workloads
SaaS	Configuration, data	Everything	Per user/month or per usage	Standard business tools

The PaaS Trap

PaaS is amazing for getting started but can become expensive at scale. Heroku's free tier is gone, and a basic production setup (2 dynos + managed Postgres + Redis) can run $100+/month -- the same workload on a $20/month VPS. Many startups start on PaaS for speed, then migrate to IaaS or containers as they grow. Plan for this migration path from day one.

5. Serverless

Serverless is the most misnamed concept in computing. There are servers -- you just don't think about them. You write a function, upload it, and the cloud provider runs it whenever it's triggered. You don't provision VMs, you don't manage scaling, you don't patch anything. You pay only when your code actually runs.

The Vending Machine Analogy

A traditional server is like a restaurant kitchen: it's running all the time, waiting for orders, costing money even when nobody is eating. A serverless function is like a vending machine: it does nothing until someone presses a button. It serves the request, then goes idle. You don't pay for idle time.

AWS Lambda Example

// handler.js -- an AWS Lambda function
// This function runs when triggered (HTTP request, S3 upload, etc.)

exports.handler = async (event) => {
  // event contains the trigger data (request body, S3 event, etc.)
  const name = event.queryStringParameters?.name || "World";

  return {
    statusCode: 200,
    headers: { "Content-Type": "application/json" },
    body: JSON.stringify({
      message: `Hello, ${name}!`,
      timestamp: new Date().toISOString(),
    }),
  };
};

// Deploy with AWS CLI:
// aws lambda create-function \
//   --function-name hello \
//   --runtime nodejs20.x \
//   --handler handler.handler \
//   --zip-file fileb://function.zip \
//   --role arn:aws:iam::123456789:role/lambda-role

Cloudflare Workers Example

// worker.js -- runs at the edge (200+ locations worldwide)
// Response time: ~10ms because it runs near the user

export default {
  async fetch(request, env) {
    const url = new URL(request.url);

    if (url.pathname === "/api/hello") {
      return new Response(JSON.stringify({ message: "Hello from the edge!" }), {
        headers: { "Content-Type": "application/json" },
      });
    }

    return new Response("Not Found", { status: 404 });
  },
};

// Deploy: wrangler deploy

Cold Starts -- The Serverless Tax

When a serverless function hasn't been called recently, the provider needs to spin up a new execution environment (load your code, initialize the runtime, run your initialization code). This is called a cold start, and it adds latency to the first request.

Runtime	Cold Start Time	Notes
Node.js	100-500ms	Fast, good default choice
Python	200-600ms	Depends on imports (numpy is heavy)
Go	50-200ms	Compiled binary, fastest cold starts
Java	1-5 seconds	JVM startup is heavy; use GraalVM to improve
Cloudflare Workers	<5ms	V8 isolates, not containers -- nearly instant

Reducing Cold Starts

Provisioned concurrency (Lambda): Keep N instances warm. Costs more but eliminates cold starts.
Small bundles: Less code to load = faster cold start. Tree-shake your dependencies.
Avoid heavy imports: Don't import the entire AWS SDK when you only need S3.
Use compiled languages: Go has the fastest cold starts. Rust is also excellent.
Edge runtimes: Cloudflare Workers, Deno Deploy, and Vercel Edge Functions use V8 isolates instead of containers. Cold starts are near-zero.

When to Use Serverless (and When NOT To)

Good Fit	Bad Fit
API endpoints with variable traffic	Long-running processes (>15 min)
Event processing (file uploads, webhooks)	WebSocket connections
Scheduled tasks (cron jobs)	Stateful applications
Image/video processing triggers	High-throughput, consistent traffic
Low-traffic APIs (pay $0 when idle)	Applications needing <10ms latency on every request
MVP / prototypes (ship fast, worry later)	Complex apps with many interconnected services

The Serverless Cost Crossover

Serverless is cheap for low traffic (you pay nothing when idle). But at high, consistent traffic, it gets expensive. A Lambda function handling 10 million requests/month with 500ms average duration costs roughly $30-50/month. That same workload on a $5/month VPS would handle it easily. Serverless is cheaper below a threshold and more expensive above it. That crossover point varies, but for many apps it's around 1-5 million requests/month.

6. CI/CD (Continuous Integration & Deployment)

CI/CD automates the process of testing, building, and deploying your code. Without it, deployments are manual, error-prone, and scary. With it, you push code to Git and it automatically gets tested and deployed. This is how professional teams ship software.

What CI and CD Actually Mean

Continuous Integration (CI): Every time you push code, automated tests run. If tests fail, the build fails and you know immediately. This catches bugs before they reach production. The "continuous" part means it happens on every push, not once a week.

Continuous Delivery (CD): After CI passes, the code is automatically packaged and ready to deploy. A human clicks "deploy" to push to production.

Continuous Deployment (also CD): After CI passes, the code automatically deploys to production with no human intervention. This is what most modern teams aim for.

  Developer pushes code to GitHub
       |
       v
  +--------------------+
  |  CI Pipeline Runs  |  (automated)
  |  - Install deps    |
  |  - Lint code       |
  |  - Run tests       |
  |  - Build project   |
  +---------+----------+
            |
       Tests pass?
       /         \
     YES          NO
      |            |
      v            v
  +--------+   +----------+
  | Deploy |   | Notify   |
  | to     |   | developer|
  | prod   |   | (fix it) |
  +--------+   +----------+

GitHub Actions -- The Most Common CI/CD Tool

GitHub Actions runs workflows defined in YAML files inside your repo (.github/workflows/). Workflows are triggered by events (push, pull request, schedule, manual). Each workflow has jobs, and each job has steps.

# .github/workflows/ci.yml
name: CI/CD Pipeline

# When to run
on:
  push:
    branches: [main]
  pull_request:
    branches: [main]

jobs:
  test:
    runs-on: ubuntu-latest      # GitHub provides the VM

    steps:
      # 1. Check out your code
      - uses: actions/checkout@v4

      # 2. Set up Node.js
      - uses: actions/setup-node@v4
        with:
          node-version: 20
          cache: 'npm'           # Cache node_modules between runs

      # 3. Install dependencies
      - run: npm ci

      # 4. Lint
      - run: npm run lint

      # 5. Run tests
      - run: npm test

      # 6. Build
      - run: npm run build

  deploy:
    needs: test                  # Only runs if "test" job passes
    runs-on: ubuntu-latest
    if: github.ref == 'refs/heads/main'  # Only deploy from main branch

    steps:
      - uses: actions/checkout@v4

      - uses: actions/setup-node@v4
        with:
          node-version: 20
          cache: 'npm'

      - run: npm ci
      - run: npm run build

      # Deploy to your hosting provider
      - name: Deploy to production
        run: |
          npx vercel --prod --token=${{ secrets.VERCEL_TOKEN }}
        env:
          VERCEL_ORG_ID: ${{ secrets.VERCEL_ORG_ID }}
          VERCEL_PROJECT_ID: ${{ secrets.VERCEL_PROJECT_ID }}

Line-by-Line Breakdown

on: push/pull_request -- Trigger this workflow when someone pushes to main or opens a PR against main.

runs-on: ubuntu-latest -- GitHub spins up a fresh Ubuntu VM for this job. It's destroyed after the job finishes. You get 2,000 free minutes/month on the free plan.

uses: actions/checkout@v4 -- A pre-built action that clones your repo into the VM. Without this, the VM has no code.

cache: 'npm' -- Caches node_modules between workflow runs. Instead of downloading all dependencies every time (30-60 seconds), it restores from cache (2-3 seconds).

needs: test -- The deploy job won't start until the test job succeeds. If tests fail, deploy is skipped.

${{ secrets.VERCEL_TOKEN }} -- GitHub Secrets store sensitive values (API keys, tokens). You set them in your repo settings. They're never exposed in logs.

A Real-World Workflow: Docker Build + Deploy

# .github/workflows/deploy.yml
name: Build and Deploy

on:
  push:
    branches: [main]

jobs:
  build-and-deploy:
    runs-on: ubuntu-latest

    steps:
      - uses: actions/checkout@v4

      # Log into Docker Hub (or ECR, GHCR)
      - name: Login to Docker Hub
        uses: docker/login-action@v3
        with:
          username: ${{ secrets.DOCKER_USERNAME }}
          password: ${{ secrets.DOCKER_TOKEN }}

      # Build and push Docker image
      - name: Build and push
        uses: docker/build-push-action@v5
        with:
          push: true
          tags: myuser/myapp:latest,myuser/myapp:${{ github.sha }}

      # Deploy to server via SSH
      - name: Deploy
        uses: appleboy/ssh-action@v1
        with:
          host: ${{ secrets.SERVER_HOST }}
          username: deploy
          key: ${{ secrets.SSH_PRIVATE_KEY }}
          script: |
            docker pull myuser/myapp:latest
            docker stop myapp || true
            docker rm myapp || true
            docker run -d --name myapp \
              -p 80:3000 \
              --env-file /home/deploy/.env \
              myuser/myapp:latest

CI/CD Best Practices

Run tests on every PR -- Never merge code that breaks tests
Keep pipelines fast -- Under 5 minutes is ideal. Cache dependencies aggressively.
Use branch protection -- Require CI to pass before merging to main
Never put secrets in code -- Use GitHub Secrets, environment variables, or a vault
Tag deployments -- Use git SHA or semver tags so you can roll back to any version
Monitor after deploy -- Set up alerts for error rates, response times, and health checks

Other CI/CD Tools

GitHub Actions is the most popular for open source and startups. But you'll also encounter: GitLab CI (built into GitLab), Jenkins (self-hosted, oldest, most configurable), CircleCI, Travis CI (declining), Buildkite (hybrid -- you provide the runners), and Drone CI (container-native). The concepts are identical across all of them -- only the YAML syntax differs.

7. Infrastructure as Code

Infrastructure as Code (IaC) means defining your cloud resources in code files instead of clicking through a web console. Your servers, databases, networks, DNS records -- all described in text files that live in Git. This makes infrastructure reproducible, reviewable, and version-controlled.

Why IaC Matters

Imagine you set up your production environment by clicking through the AWS console for 3 hours. Everything works. Six months later, you need to create an identical staging environment. Can you remember every click? Every security group rule? Every IAM policy? Of course not.

With IaC, you run terraform apply and your entire infrastructure is created from a file. Need a staging copy? Change a variable and apply again. Need to see who changed the database size? Check the git log. Need to roll back a networking change? git revert and re-apply.

Declarative vs Imperative

# IMPERATIVE (scripts): "Here are the steps to get there"
# Like giving someone turn-by-turn directions
aws ec2 create-security-group --group-name web-sg --description "Web SG"
aws ec2 authorize-security-group-ingress --group-name web-sg --port 80 --cidr 0.0.0.0/0
aws ec2 run-instances --instance-type t3.micro --security-groups web-sg
# Problem: run it twice and you get TWO servers. Not idempotent.

# DECLARATIVE (Terraform/CloudFormation): "Here's what I want to exist"
# Like showing someone a photo of where you want to go
resource "aws_instance" "web" {
  instance_type = "t3.micro"
  ami           = "ami-0c55b159cbfafe1f0"
}
# Run it twice and Terraform says "already exists, nothing to do."
# Change instance_type and Terraform updates only what changed.

Terraform -- The Most Popular IaC Tool

Terraform by HashiCorp works with every cloud provider (and many other services like GitHub, Cloudflare, Datadog). You write .tf files using HCL (HashiCorp Configuration Language), then Terraform figures out what needs to be created, updated, or destroyed.

# main.tf -- Define your infrastructure

# Tell Terraform which providers to use
terraform {
  required_providers {
    aws = {
      source  = "hashicorp/aws"
      version = "~> 5.0"
    }
  }
}

# Configure the AWS provider
provider "aws" {
  region = "us-east-1"
}

# Create a VPC
resource "aws_vpc" "main" {
  cidr_block = "10.0.0.0/16"
  tags = {
    Name = "my-app-vpc"
  }
}

# Create a public subnet
resource "aws_subnet" "public" {
  vpc_id            = aws_vpc.main.id
  cidr_block        = "10.0.1.0/24"
  availability_zone = "us-east-1a"
  tags = {
    Name = "public-subnet"
  }
}

# Create an EC2 instance
resource "aws_instance" "web" {
  ami           = "ami-0c55b159cbfafe1f0"
  instance_type = var.instance_type     # Variable -- configurable
  subnet_id     = aws_subnet.public.id  # Reference another resource

  tags = {
    Name        = "web-server"
    Environment = var.environment
  }
}

# Create an RDS database
resource "aws_db_instance" "postgres" {
  engine            = "postgres"
  engine_version    = "16"
  instance_class    = "db.t3.micro"
  allocated_storage = 20
  db_name           = "myapp"
  username          = "appuser"
  password          = var.db_password  # From variable (never hardcode)

  skip_final_snapshot = true  # For dev environments
}

# variables.tf -- Define configurable values
variable "instance_type" {
  description = "EC2 instance type"
  default     = "t3.micro"
}

variable "environment" {
  description = "Environment name"
  default     = "development"
}

variable "db_password" {
  description = "Database password"
  sensitive   = true  # Won't show in logs/output
}

Terraform Workflow

# 1. Initialize (download provider plugins)
terraform init

# 2. Plan (preview what will change -- ALWAYS do this)
terraform plan
# Output:
# + aws_instance.web will be created
# + aws_db_instance.postgres will be created
# Plan: 2 to add, 0 to change, 0 to destroy.

# 3. Apply (create/update resources)
terraform apply
# Type "yes" to confirm (or use -auto-approve in CI)

# 4. Destroy (tear down everything -- careful!)
terraform destroy

Terraform State -- The Most Important File

Terraform stores the current state of your infrastructure in a terraform.tfstate file. This file maps your .tf definitions to real cloud resources. If you lose this file, Terraform doesn't know what exists and might try to create duplicates. Never commit this file to git. Store it in a remote backend (S3 bucket, Terraform Cloud) so your team shares the same state.

IaC Tools Comparison

Terraform -- Multi-cloud, most popular, declarative HCL. The default choice.
OpenTofu -- Open-source fork of Terraform (after HashiCorp changed its license). Drop-in replacement.
Pulumi -- Write IaC in real programming languages (TypeScript, Python, Go). Good if you hate learning HCL.
AWS CloudFormation -- AWS-only, YAML/JSON. More verbose than Terraform but deeper AWS integration.
AWS CDK -- Write CloudFormation in TypeScript/Python. Compiles to CloudFormation templates.
Ansible -- Imperative (procedural), good for server configuration, not ideal for cloud resources.

8. Load Balancers

A load balancer sits in front of multiple servers and distributes incoming requests across them. Without one, a single server handles all traffic and becomes a bottleneck (and a single point of failure). With one, you can scale to any number of servers and survive individual server failures.

  Without Load Balancer:                With Load Balancer:

  Users ---> [Server]                   Users ---> [Load Balancer]
             (single point                         /     |      \
              of failure)                      [Srv 1] [Srv 2] [Srv 3]
                                               (any server can die,
                                                users don't notice)

The Restaurant Host Analogy

A load balancer is like the host at a busy restaurant. When customers arrive, the host doesn't send everyone to the same table. They look at which servers (waiters) are available and distribute customers evenly. If one waiter calls in sick, the host just routes to the remaining waiters. Customers still get served without knowing anything changed.

Layer 4 vs Layer 7 Load Balancers

Load balancers operate at different layers of the networking stack, which determines what information they can see and use for routing decisions.

Feature	Layer 4 (Transport)	Layer 7 (Application)
Sees	IP addresses, TCP/UDP ports	HTTP headers, URLs, cookies, body
Routing based on	IP + port only	URL path, hostname, headers, cookies
Performance	Faster (less processing)	Slower (must parse HTTP)
SSL termination	No (passes encrypted traffic through)	Yes (decrypts, inspects, re-encrypts)
Use case	TCP/UDP services, databases, gaming	HTTP APIs, web apps, microservices
AWS service	NLB (Network Load Balancer)	ALB (Application Load Balancer)

Layer 7 Routing in Practice

# With a Layer 7 load balancer, you can route based on URL path:
#
# /api/*        --> API servers (backend cluster)
# /images/*     --> Image processing servers
# /admin/*      --> Admin dashboard servers
# /*            --> Frontend servers
#
# Or route based on hostname:
#
# api.example.com    --> API cluster
# www.example.com    --> Web frontend cluster
# admin.example.com  --> Admin cluster
#
# This is impossible with Layer 4 -- it can't see the URL or hostname.

Load Balancing Algorithms

# How the load balancer decides which server gets the next request:

# 1. ROUND ROBIN (simplest, most common)
# Request 1 -> Server A
# Request 2 -> Server B
# Request 3 -> Server C
# Request 4 -> Server A (back to start)
# Good for: servers with equal capacity

# 2. WEIGHTED ROUND ROBIN
# Server A (weight 3): gets 3 out of every 6 requests
# Server B (weight 2): gets 2 out of every 6 requests
# Server C (weight 1): gets 1 out of every 6 requests
# Good for: servers with different capacities

# 3. LEAST CONNECTIONS
# Send to the server with the fewest active connections
# Good for: requests that take variable time (some fast, some slow)

# 4. IP HASH
# Hash the client's IP to consistently route to the same server
# Good for: session stickiness without cookies
# Problem: if a server dies, all its clients get reassigned

# 5. LEAST RESPONSE TIME
# Send to the server that's responding fastest right now
# Good for: maximizing performance when servers have varying load

Load Balancing Algorithm Selection:

• Round Robin: Stateless requests, equal server capacity
• Least Connections: Long-lived connections, variable request duration
• IP Hash: Session affinity needed (same user → same server)
• Weighted: Servers with different capacities

Layer 4 vs Layer 7: L4 routes by IP/port (faster, no content inspection). L7 routes by URL/headers/cookies (smarter, can do path-based routing).

Health Checks

The load balancer regularly pings your servers to check if they're alive. If a server stops responding, the load balancer stops sending traffic to it. When it recovers, traffic resumes.

# Health check configuration (AWS ALB example):
#
# Protocol:      HTTP
# Path:          /health
# Port:          3000
# Interval:      30 seconds (check every 30s)
# Timeout:       5 seconds (wait 5s for response)
# Healthy:       3 consecutive successes = server is healthy
# Unhealthy:     2 consecutive failures = server is unhealthy

# Your app needs a health endpoint:
app.get("/health", (req, res) => {
  // Check dependencies
  const dbHealthy = await checkDatabase();
  const cacheHealthy = await checkRedis();

  if (dbHealthy && cacheHealthy) {
    res.status(200).json({ status: "ok" });
  } else {
    res.status(503).json({ status: "unhealthy", db: dbHealthy, cache: cacheHealthy });
  }
});

Common Load Balancers

Nginx -- The most popular self-hosted reverse proxy / load balancer. Free, battle-tested, excellent performance.
HAProxy -- Pure load balancer, even faster than Nginx for TCP/HTTP proxying. Used by GitHub, Stack Overflow.
AWS ALB/NLB -- Managed, no servers to maintain, auto-scales. ALB for HTTP, NLB for TCP/UDP.
Cloudflare -- Global load balancing with built-in DDoS protection and CDN.
Traefik -- Popular in Docker/Kubernetes environments. Auto-discovers services.
Caddy -- Simple config, automatic HTTPS, good for smaller deployments.

9. CDNs (Content Delivery Networks)

A CDN caches copies of your content at servers spread across the globe (called edge nodes or PoPs -- Points of Presence). When a user requests your image, JavaScript file, or API response, they get it from the nearest edge node instead of your origin server thousands of miles away.

  Without CDN:                          With CDN:

  User in Tokyo                         User in Tokyo
       |                                     |
       | 200ms round trip                    | 10ms round trip
       |                                     |
       v                                     v
  Server in Virginia                    Edge node in Tokyo
                                             |
                                        (cache miss? fetch from origin)
                                             |
                                             v
                                        Server in Virginia

How CDN Caching Works

First request: User in Tokyo requests /images/logo.png. No edge node in Tokyo has it cached. The CDN fetches it from your origin server in Virginia (slow), serves it to the user, and caches a copy at the Tokyo edge node.

Second request: Another user in Tokyo requests the same image. The edge node already has it cached. Served instantly. No round trip to Virginia.

Cache expiration: After the TTL (Time To Live) expires (e.g., 24 hours), the edge node considers the cached copy stale and fetches a fresh copy from origin on the next request.

What Should You Put Behind a CDN?

Content Type	CDN Benefit	TTL Recommendation
Static assets (JS, CSS, images)	Massive -- rarely changes	1 year (use content hash in filename)
Fonts	Massive -- never changes	1 year
Video / large media	Huge -- saves bandwidth	1 week to 1 year
HTML pages (static sites)	High -- same content for everyone	5 min to 1 hour
API responses (public data)	Moderate -- depends on freshness needs	30 sec to 5 min
API responses (user-specific)	Low -- different for every user	Don't cache (or use Vary header)

Cache-Control Headers

# You control CDN caching through HTTP headers:

# Cache for 1 year (static assets with hashed filenames)
Cache-Control: public, max-age=31536000, immutable
# "public" = CDN can cache it
# "immutable" = don't even check for updates (filename changes instead)

# Cache for 5 minutes, revalidate after
Cache-Control: public, max-age=300, stale-while-revalidate=60
# Serve cached copy for 300s. After that, serve stale for 60s
# while fetching fresh copy in background.

# Don't cache (user-specific data)
Cache-Control: private, no-store
# "private" = only browser can cache, not CDN
# "no-store" = don't cache at all

# Cache but always revalidate (HTML pages)
Cache-Control: public, no-cache
# "no-cache" doesn't mean "don't cache" -- it means
# "cache it, but check with origin before using it" (via ETag/If-Modified-Since)

The Cache Invalidation Problem

You deploy a new version of your app. Your CSS is cached at 200+ edge nodes worldwide with a 1-year TTL. Users see the old CSS. What do you do? You can't easily purge every edge node instantly.

The solution: content-hashed filenames. Instead of style.css, your build tool generates style.a3f8b2c1.css. The hash changes when the content changes. New deployment = new filename = CDN fetches the new file. Old cached files naturally expire. This is why every modern build tool (Vite, webpack, Parcel) includes content hashing.

Major CDN Providers

Provider	Edge Nodes	Notable Features	Cost
Cloudflare	300+ cities	DDoS protection, WAF, Workers (edge compute), free tier	Generous free tier, paid from $20/mo
AWS CloudFront	450+ PoPs	Deep AWS integration, Lambda@Edge	Pay per GB transferred + requests
Fastly	80+ PoPs	Instant purge, VCL configs, Compute@Edge	Pay per usage, more expensive
Bunny CDN	120+ PoPs	Simple, cheap, good for small projects	From $0.01/GB

CDN for Beginners

Start with Cloudflare. Their free tier gives you CDN, DDoS protection, SSL, and DNS -- all for $0. Just point your domain's nameservers to Cloudflare, and they proxy all traffic through their edge network. It's the easiest way to add a CDN to any project.

10. Object Storage

Object storage is a flat storage system where every item (called an "object") gets a unique key. Unlike a filesystem with directories and nested folders, object storage is a giant key-value store. The key looks like a path (images/avatars/user-123.jpg), but there's no actual directory structure -- it's just a string.

Object Storage vs Filesystem

Filesystem (your hard drive): Hierarchical. Directories contain files. You navigate a tree structure. Great for small, frequently-accessed data. Limited to one machine.

Object storage (S3): Flat. Every object has a key and metadata. Scales to exabytes. Accessible from anywhere via HTTP. Built for durability (data replicated across multiple facilities). No concept of "editing a file" -- you replace the whole object.

Rule of thumb: If your app creates/serves files that users upload (images, videos, PDFs, backups), use object storage. If your app reads/writes local config or temp files, use the filesystem.

S3 -- The Standard

# AWS S3 CLI basics

# Create a bucket
aws s3 mb s3://my-app-uploads-prod

# Upload a file
aws s3 cp ./photo.jpg s3://my-app-uploads-prod/avatars/user-123.jpg

# Download a file
aws s3 cp s3://my-app-uploads-prod/avatars/user-123.jpg ./downloaded.jpg

# List objects
aws s3 ls s3://my-app-uploads-prod/avatars/

# Delete an object
aws s3 rm s3://my-app-uploads-prod/avatars/user-123.jpg

# Sync a directory
aws s3 sync ./uploads s3://my-app-uploads-prod/ --delete

Uploading from Your App

// Node.js: Upload to S3 using AWS SDK v3
const { S3Client, PutObjectCommand, GetObjectCommand } = require("@aws-sdk/client-s3");
const { getSignedUrl } = require("@aws-sdk/s3-request-presigner");

const s3 = new S3Client({ region: "us-east-1" });

// Direct upload from your server
async function uploadFile(buffer, key, contentType) {
  await s3.send(new PutObjectCommand({
    Bucket: "my-app-uploads-prod",
    Key: key,                        // e.g., "avatars/user-123.jpg"
    Body: buffer,                    // File contents as Buffer
    ContentType: contentType,        // e.g., "image/jpeg"
  }));

  return `https://my-app-uploads-prod.s3.amazonaws.com/${key}`;
}

// Express route: handle file upload
app.post("/api/upload", upload.single("file"), async (req, res) => {
  const key = `uploads/${Date.now()}-${req.file.originalname}`;
  const url = await uploadFile(req.file.buffer, key, req.file.mimetype);
  res.json({ url });
});

Presigned URLs -- The Production Pattern

Instead of uploading files through your server (which uses your bandwidth and CPU), generate a presigned URL that lets the client upload directly to S3. Your server never touches the file.

// Step 1: Client requests an upload URL from your API
// Step 2: Your API generates a presigned S3 URL
// Step 3: Client uploads directly to S3 using that URL
// Step 4: Client tells your API the upload is done

// Server: generate presigned upload URL
app.post("/api/upload-url", async (req, res) => {
  const { filename, contentType } = req.body;
  const key = `uploads/${req.user.id}/${Date.now()}-${filename}`;

  const command = new PutObjectCommand({
    Bucket: "my-app-uploads-prod",
    Key: key,
    ContentType: contentType,
  });

  // URL is valid for 15 minutes
  const uploadUrl = await getSignedUrl(s3, command, { expiresIn: 900 });

  res.json({ uploadUrl, key });
});

// Client: upload directly to S3
const { uploadUrl } = await fetch("/api/upload-url", {
  method: "POST",
  headers: { "Content-Type": "application/json" },
  body: JSON.stringify({ filename: "photo.jpg", contentType: "image/jpeg" }),
}).then(r => r.json());

// Upload file directly to S3 (bypasses your server entirely)
await fetch(uploadUrl, {
  method: "PUT",
  headers: { "Content-Type": "image/jpeg" },
  body: fileBlob,  // The actual file
});

Why Presigned URLs Matter

Without presigned URLs, a 100MB file upload goes: User -> Your Server -> S3. Your server is a bottleneck, uses bandwidth, and ties up a connection for the entire upload.

With presigned URLs, the 100MB file goes: User -> S3 directly. Your server only generates a URL (tiny JSON response). This scales massively better and costs less.

S3-Compatible Alternatives

Service	Provider	Key Benefit	Pricing
S3	AWS	The original, most features	$0.023/GB/month + egress fees
R2	Cloudflare	Zero egress fees	$0.015/GB/month, no egress cost
Backblaze B2	Backblaze	Cheapest storage	$0.006/GB/month
MinIO	Self-hosted	S3-compatible, runs on your servers	Free (you pay for hardware)
GCS	Google	Integrated with GCP ecosystem	$0.020/GB/month

S3 Egress Costs -- The Hidden Tax

Storing data in S3 is cheap ($0.023/GB/month). Downloading data from S3 is expensive ($0.09/GB). If you serve 1TB of images directly from S3 per month, that's $92 just in egress fees. This is why you put a CDN in front of S3 -- the CDN caches the files and serves them from edge nodes, dramatically reducing how much data leaves S3. Alternatively, use Cloudflare R2 which has zero egress fees.

11. Managed Databases

A managed database is a database where the cloud provider handles all the operational work -- backups, patching, replication, failover, monitoring, scaling. You just connect and run queries. The alternative is running your own database on a VM (or bare metal), which means you're responsible for everything.

Managed vs Self-Hosted

Aspect	Managed (RDS, PlanetScale, etc.)	Self-Hosted (Postgres on a VM)
Setup time	5 minutes	Hours to days (properly hardened)
Backups	Automatic, point-in-time recovery	You set up pg_dump / WAL archiving
Patching	Provider handles it (with maintenance window)	You track CVEs and apply patches
Failover	Automatic (Multi-AZ)	You configure replication + Patroni/pgBouncer
Scaling	Click a button (or API call)	Migrate data to bigger machine manually
Cost	2-5x more expensive	Cheaper, but your time has value
Control	Limited (can't tune everything)	Full control over every setting
Best for	Most teams, production workloads	Cost-sensitive, high-volume, specific tuning needs

The Rule

Use managed databases unless you have a specific reason not to. The cost premium is worth it for most teams because the alternative is waking up at 3 AM to fix a crashed database. Your engineering time is more expensive than the managed DB surcharge.

Popular Managed Database Services

Traditional Cloud (AWS, GCP, Azure)

Service	Engine	Starting Price	Notes
AWS RDS	Postgres, MySQL, MariaDB, Oracle, SQL Server	~$15/month (db.t3.micro)	The standard. Multi-AZ, read replicas, automated backups.
AWS Aurora	MySQL/Postgres compatible	~$30/month	5x faster than standard MySQL. Storage auto-scales. More expensive but more performant.
GCP Cloud SQL	Postgres, MySQL, SQL Server	~$10/month	Similar to RDS. Good GCP integration.

Modern Database Platforms

Service	Engine	Standout Feature	Free Tier
Neon	PostgreSQL	Branching (like git branches for your database), scales to zero	500MB storage, generous compute
PlanetScale	MySQL (Vitess)	Non-blocking schema changes, horizontal scaling	5GB storage, 1 billion row reads/month
Supabase	PostgreSQL	Built-in auth, real-time subscriptions, REST API auto-generated	500MB storage, 50K monthly active users
Turso	SQLite (libSQL)	Edge replicas (database at every edge location), embedded SQLite	9GB storage, 500 databases
Upstash	Redis, Kafka	Serverless Redis with per-request pricing	10K commands/day

When to Use What

Side project / MVP: Supabase or Neon free tier. You get a real PostgreSQL database with modern tooling for $0. Ship fast.

Startup (growing): Neon or PlanetScale paid tier. Database branching for safe migrations (Neon), or horizontal scaling with non-blocking schema changes (PlanetScale).

Enterprise / production: AWS RDS or Aurora. Battle-tested, every compliance certification, multi-AZ failover, and your ops team already knows AWS.

Edge-first app: Turso. Put read replicas at the edge so database reads are <10ms worldwide. Write to a primary that replicates automatically.

Caching layer: Upstash Redis. Serverless pricing means you pay nothing when idle, pennies when active.

Connection Strings

# Every managed database gives you a connection string. Format:
# protocol://username:password@host:port/database?options

# PostgreSQL (RDS, Neon, Supabase)
postgresql://appuser:secret@db.abc123.us-east-1.rds.amazonaws.com:5432/myapp?sslmode=require

# MySQL (PlanetScale)
mysql://username:password@aws.connect.psdb.cloud/mydb?ssl={"rejectUnauthorized":true}

# Redis (Upstash)
rediss://default:token@usw1-capable-toucan-12345.upstash.io:6379

# ALWAYS:
# - Use SSL/TLS (sslmode=require or ?ssl=true)
# - Store connection strings in environment variables, NEVER in code
# - Use connection pooling (PgBouncer, Prisma pool, etc.)
# - Restrict access to your VPC / allowed IP addresses

Connection Pooling with Serverless

Serverless functions (Lambda, Vercel) create a new database connection on every cold start. At scale, this means hundreds of connections opening and closing rapidly, overwhelming your database. The fix: use an external connection pooler. Neon has a built-in pooler. Supabase uses PgBouncer. For RDS, use RDS Proxy. Without pooling, your serverless app will crash your database at moderate traffic.

12. Cloud Costs & Avoiding Bill Shock

Cloud billing is designed to be complex. Every service has its own pricing model with multiple dimensions (compute, storage, network, requests, data transfer). People routinely wake up to unexpected $500 bills from leaving a test cluster running or misconfiguring auto-scaling. Understanding cloud costs is a survival skill.

Horror Stories Are Real

A developer left 5 GPU instances running overnight: $2,000 bill. A startup enabled S3 Transfer Acceleration on a high-traffic bucket: $8,000 surprise. A student accidentally created 100 RDS instances in a loop: $14,000 in 48 hours. AWS and GCP will charge you, send the bill, and expect payment. Set up billing alerts BEFORE you create any resources.

How Cloud Billing Works

# Cloud costs have multiple dimensions. Here's a typical web app:

# 1. COMPUTE (EC2 / VMs)
#    Billed per second (or hour) the instance is running
#    t3.micro: ~$0.0104/hour = ~$7.50/month (running 24/7)
#    t3.medium: ~$0.0416/hour = ~$30/month
#    GOTCHA: Instances bill even when idle. Stop/terminate when not needed.

# 2. STORAGE (EBS / disks attached to VMs)
#    Billed per GB per month
#    gp3 (SSD): $0.08/GB/month. 100GB = $8/month
#    GOTCHA: Volumes persist after instance termination. Delete them manually.

# 3. DATA TRANSFER (bandwidth)
#    Inbound: FREE (data going INTO AWS)
#    Outbound: $0.09/GB (data leaving AWS to the internet)
#    Between AZs: $0.01/GB (data between availability zones)
#    GOTCHA: This is the hidden killer. Serving 10TB/month = $900.

# 4. OBJECT STORAGE (S3)
#    Storage: $0.023/GB/month
#    Requests: $0.0004 per 1,000 GET requests
#    Egress: $0.09/GB (same as data transfer)
#    GOTCHA: Millions of tiny objects = expensive from requests, not storage.

# 5. DATABASE (RDS)
#    Instance: $15-$500+/month depending on size
#    Storage: $0.115/GB/month (gp3)
#    Backups: Free up to 100% of DB storage, then $0.095/GB
#    GOTCHA: Multi-AZ doubles the instance cost.

Free Tier Gotchas

Service	Free Tier	The Trap
EC2	750 hours/month of t2.micro (12 months)	That's ONE instance running 24/7. Two instances = you're billed for the second one. Also, t2.micro not t3.micro -- make sure you select the right one.
RDS	750 hours/month of db.t2.micro (12 months)	Multi-AZ is NOT free tier. Enabling it doubles your bill immediately.
S3	5GB storage, 20K GET, 2K PUT (12 months)	20K GETs is nothing. A moderately popular image gallery can burn through this in a day.
Lambda	1M requests + 400K GB-seconds/month (always free)	This is genuinely generous. Most small apps never exceed it.
Data Transfer	100GB/month outbound (12 months)	After free tier expires, 100GB/month = $9. Doesn't sound bad until your traffic grows.
NAT Gateway	NOT included in free tier	$32/month minimum just to exist + $0.045/GB. The sneakiest cost in AWS. Needed if private subnets need internet access.

Cost Optimization Tips

Immediate Actions (Do These Right Now)

Set billing alerts. AWS: Billing > Budgets > Create a $10/month alert. GCP: Billing > Budgets & Alerts. Do this BEFORE creating any resources.
Stop or terminate unused resources. Run aws ec2 describe-instances --query 'Reservations[*].Instances[*].[InstanceId,State.Name,InstanceType]' regularly. Kill anything you're not using.
Delete unattached EBS volumes. When you terminate an EC2 instance, attached volumes might persist. They still cost money.
Use the cost calculator before launching. AWS: calculator.aws. GCP: cloud.google.com/products/calculator.

Cost Optimization Strategies

# 1. RIGHT-SIZE YOUR INSTANCES
# Most instances are over-provisioned. Monitor CPU/RAM usage.
# If your t3.medium (4GB RAM) never uses more than 1.5GB,
# switch to t3.small (2GB RAM) and save 50%.

# 2. USE RESERVED INSTANCES OR SAVINGS PLANS
# Commit to 1 or 3 years of usage for 30-60% discount.
# Only for stable, predictable workloads.
# AWS: Reserved Instances or Savings Plans
# GCP: Committed Use Discounts

# 3. USE SPOT INSTANCES FOR FAULT-TOLERANT WORK
# Spot/Preemptible VMs cost 60-90% less than on-demand.
# AWS can reclaim them with 2 minutes notice.
# Good for: batch processing, CI/CD runners, data pipelines
# Bad for: web servers, databases, anything stateful

# 4. PUT A CDN IN FRONT OF EVERYTHING
# Reduce origin data transfer (the expensive part)
# Cloudflare free tier = $0 for unlimited bandwidth
# That $900/month S3 egress bill becomes $50/month with a CDN.

# 5. USE S3 STORAGE CLASSES
# S3 Standard:              $0.023/GB  (frequently accessed)
# S3 Infrequent Access:     $0.0125/GB (accessed once a month)
# S3 Glacier Instant:       $0.004/GB  (archival, rare access)
# S3 Glacier Deep Archive:  $0.00099/GB (accessed once a year)
# Set up lifecycle policies to automatically move old data to cheaper tiers.

# 6. USE SERVERLESS FOR VARIABLE WORKLOADS
# A t3.small running 24/7 costs ~$15/month even when idle.
# Lambda costs $0 when not running.
# For APIs with <100K requests/month, serverless is nearly free.

# 7. AVOID NAT GATEWAYS IF POSSIBLE
# $32/month + data transfer for a NAT Gateway.
# Alternatives: VPC endpoints for AWS services ($0),
# or put your resources in public subnets with security groups.

# 8. SCHEDULE NON-PRODUCTION RESOURCES
# Dev/staging environments don't need to run 24/7.
# Run them 10 hours/day, 5 days/week = 70% cost reduction.
# Use AWS Instance Scheduler or a simple cron + Lambda.

Monthly Cost Estimates for Common Setups

Setup	Estimated Monthly Cost	Notes
Static site on Vercel/Netlify/Cloudflare Pages	$0	Free tier covers most personal sites
Small API on Lambda + Neon DB	$0-5	Free tiers cover low traffic
VPS (Hetzner) + self-hosted Postgres	$5-10	Best bang for buck, but you're the ops team
EC2 t3.small + RDS db.t3.micro	$30-50	Basic AWS setup, watch for data transfer
ECS/EKS + RDS + ElastiCache + ALB	$150-400	Production stack, costs add up quick
Multi-AZ RDS + CloudFront + S3 + Lambda	$100-300	Serverless production stack

The #1 Rule of Cloud Costs

Always set a billing alert before you start. AWS will happily charge you $10,000 and send you the bill later. GCP is slightly better (they'll pause projects at budget limits if configured). But never assume you're "on the free tier" -- verify it. Check your billing dashboard weekly when you're learning. It takes 30 seconds and can save you hundreds of dollars.

The Honest Advice for Learners

Use free tiers aggressively. AWS free tier (12 months), GCP $300 credit, Azure $200 credit, Oracle Cloud always-free VMs.
Use Hetzner/DigitalOcean for personal projects. A $5/month VPS is more powerful than AWS free tier and has predictable billing.
Use Cloudflare for everything you can. Free CDN, free DNS, free SSL, free DDoS protection. Cloudflare R2 for storage (no egress fees). Cloudflare Workers for serverless (100K requests/day free).
Don't use AWS for learning unless you're specifically studying for AWS certifications. The console is confusing, the billing is opaque, and the free tier has too many traps. Use simpler providers until you need AWS specifically.
Destroy resources when you're done experimenting. Run terraform destroy. Terminate instances. Delete databases. Check your bill.