Everything you need to understand about cloud computing -- what it actually is, why it exists, how the major providers compare, and the core services that power every modern application. No marketing fluff, just the concepts and practical knowledge you need to deploy, scale, and operate real software.
Strip away all the marketing, and "the cloud" is just other people's computers that you rent. That's it. When you deploy your app to AWS, your code runs on a physical server sitting in a warehouse (called a data center) owned by Amazon. You don't buy the server. You don't maintain the server. You pay for how much you use.
Running your own servers is like owning a restaurant. You buy the building, install the kitchen, hire staff, handle plumbing when it breaks, pay rent even when nobody is eating. It's total control but total responsibility.
Using the cloud is like renting a commercial kitchen. You show up, cook your food, serve your customers, and leave. Someone else handles the building, the electricity, the fire suppression system, and the plumbing. You just pay for the time and space you use.
The tradeoff: you give up some control (you can't knock out a wall to expand), but you gain speed (start cooking tomorrow, not in 6 months after construction) and flexibility (rent a bigger kitchen next month if business grows).
A data center is a massive building full of racks of servers, connected by fiber optic cables, cooled by industrial HVAC systems, and backed by redundant power supplies (including diesel generators for when the grid fails). AWS alone operates 100+ data centers worldwide. Each one looks like this:
DATA CENTER
+-------------------------------------------------+
| |
| +-------+ +-------+ +-------+ +-------+ |
| | Rack | | Rack | | Rack | | Rack | |
| |Server1| |Server1| |Server1| |Server1| |
| |Server2| |Server2| |Server2| |Server2| |
| |Server3| |Server3| |Server3| |Server3| |
| | ... | | ... | | ... | | ... | |
| +---+---+ +---+---+ +---+---+ +---+---+ |
| | | | | |
| +---+----------+----------+----------+---+ |
| | Network Switch Layer | |
| +---+------------------------------------+ |
| | |
| +---+---+ +----------+ +-----------+ |
| |Router | | Cooling | | Backup | |
| |to ISP | | System | | Power | |
| +-------+ +----------+ +-----------+ |
| |
+-------------------------------------------------+
|
To the internet (your users)
Before the cloud, if you wanted to launch a web app, you had to:
The cloud eliminates all of this. You click a button (or run a command), and a virtual server appears in seconds. Need 10 more? Click 10 more times. Traffic drops? Turn them off and stop paying. This is elastic computing, and it changed how software is built.
Your code still runs on physical hardware. Hard drives still fail. Networks still have latency. Servers still crash. The cloud doesn't eliminate infrastructure problems -- it makes them someone else's problem and gives you tools to handle failures gracefully (redundancy, auto-scaling, health checks). But if you design your app assuming nothing ever fails, you'll have a bad time, cloud or not.
Three companies dominate cloud computing: AWS (Amazon), Azure (Microsoft), and GCP (Google). Together they hold about 65% of the market. There are also strong alternatives: DigitalOcean, Linode (now Akamai), Hetzner, Vultr, and Cloudflare.
| Aspect | AWS | GCP | Azure |
|---|---|---|---|
| Market share | ~31% (largest) | ~12% | ~25% |
| Best for | Everything. Most services, most mature. | Data/ML, Kubernetes, developer UX | Enterprise, Windows/.NET, Office 365 integration |
| Compute | EC2 | Compute Engine | Virtual Machines |
| Object Storage | S3 | Cloud Storage | Blob Storage |
| Serverless | Lambda | Cloud Functions | Azure Functions |
| Managed K8s | EKS | GKE (best K8s experience) | AKS |
| Managed DB | RDS, DynamoDB, Aurora | Cloud SQL, Spanner, Firestore | Azure SQL, Cosmos DB |
| CDN | CloudFront | Cloud CDN | Azure CDN |
| Free tier | 12-month free tier + always-free tier | $300 credit for 90 days + always-free tier | $200 credit for 30 days + always-free tier |
| CLI | aws |
gcloud |
az |
| Pricing | Complex, many hidden costs | Simpler, per-second billing | Complex, enterprise-oriented |
| Learning curve | Steepest -- 200+ services | Moderate -- cleaner console | Moderate -- if you know Microsoft ecosystem |
For jobs: Learn AWS. It has the most market share, the most job listings, and the most services. Knowing AWS makes you employable almost anywhere.
For startups: GCP often has the best developer experience and generous free tiers. Their Kubernetes offering (GKE) is best-in-class since Google invented Kubernetes.
For enterprise: If the company already uses Microsoft (Active Directory, Office 365, .NET), Azure is the natural choice because of deep integration.
For personal projects: DigitalOcean, Hetzner, or Vultr. Simpler, cheaper, and you'll learn more about infrastructure because there's less hand-holding.
Every provider has the same core services -- they just give them different names. Here's how to translate:
| What You Need | AWS | GCP | Azure |
|---|---|---|---|
| Virtual server | EC2 | Compute Engine | Virtual Machines |
| Object storage | S3 | Cloud Storage | Blob Storage |
| SQL database | RDS | Cloud SQL | Azure SQL |
| NoSQL database | DynamoDB | Firestore | Cosmos DB |
| Serverless functions | Lambda | Cloud Functions | Azure Functions |
| Container orchestration | ECS / EKS | Cloud Run / GKE | ACI / AKS |
| Message queue | SQS | Pub/Sub | Service Bus |
| DNS | Route 53 | Cloud DNS | Azure DNS |
| Secret management | Secrets Manager | Secret Manager | Key Vault |
| IAM (permissions) | IAM | IAM | Entra ID (was Azure AD) |
Every cloud provider offers hundreds of services, but you only need to understand about 5-6 core ones to be productive. Everything else is built on top of these fundamentals.
A virtual machine in the cloud. You choose the CPU, RAM, and disk size. You get root access. It's like having a remote computer you can SSH into. This is the most fundamental cloud service -- everything else can theoretically run on a VM.
# Launch an EC2 instance (AWS CLI)
aws ec2 run-instances \
--image-id ami-0c55b159cbfafe1f0 \
--instance-type t3.micro \
--key-name my-key-pair \
--security-group-ids sg-0123456789abcdef0
# SSH into it
ssh -i my-key-pair.pem ec2-user@54.123.45.67
# Common instance types (AWS):
# t3.micro -- 2 vCPU, 1 GB RAM -- free tier, dev/test
# t3.medium -- 2 vCPU, 4 GB RAM -- small apps
# m6i.large -- 2 vCPU, 8 GB RAM -- general purpose
# c6i.large -- 2 vCPU, 4 GB RAM -- CPU-intensive (compute-optimized)
# r6i.large -- 2 vCPU, 16 GB RAM -- memory-intensive (databases, caching)
The naming follows a pattern: t3.micro = t (family: burstable) + 3 (generation) + micro (size). Families: t = burstable (cheap, good for variable workloads), m = general purpose, c = compute-optimized, r = memory-optimized, g/p = GPU instances.
Object storage for files of any size -- images, videos, backups, logs, static websites. You upload objects into "buckets" (containers). Each object gets a unique key (like a file path). S3 has 99.999999999% (eleven 9s) durability -- meaning if you store 10 million objects, you might lose 1 every 10,000 years.
# Upload a file to S3
aws s3 cp ./backup.tar.gz s3://my-bucket/backups/backup-2024-01-15.tar.gz
# List bucket contents
aws s3 ls s3://my-bucket/backups/
# Download a file
aws s3 cp s3://my-bucket/backups/backup-2024-01-15.tar.gz ./
# Sync a directory (like rsync)
aws s3 sync ./dist s3://my-website-bucket/ --delete
Managed database services. You pick the engine (PostgreSQL, MySQL, etc.) and the instance size. The cloud provider handles backups, patching, replication, and failover. You just connect and run queries.
# What "managed" means -- the provider handles:
# - Automated daily backups (point-in-time recovery)
# - OS and database engine patching
# - Multi-AZ failover (automatic switch to standby if primary dies)
# - Read replicas (for scaling reads)
# - Monitoring and alerts
# - Storage auto-scaling
# You handle:
# - Schema design
# - Query optimization
# - Application-level connection pooling
# - Access control (who can connect)
A Virtual Private Cloud is your own private network in the cloud. It's like having your own isolated section of the data center. You control which resources can talk to each other, which can access the internet, and which ports are open.
YOUR VPC (10.0.0.0/16)
+---------------------------------------------------+
| |
| Public Subnet (10.0.1.0/24) |
| +---------------------------------------------+ |
| | Load Balancer ---> Web Server (EC2) | |
| +---------------------------------------------+ |
| | |
| Private Subnet (10.0.2.0/24) |
| +---------------------------------------------+ |
| | App Server (EC2) ---> Database (RDS) | |
| +---------------------------------------------+ |
| |
+---------------------------------------------------+
|
Internet Gateway (only public subnet has access)
Without a VPC, your database would be directly accessible from the internet. Anyone could try to connect. With a VPC, your database sits in a private subnet with no internet access. Only your app servers in the same VPC can reach it. The load balancer sits in a public subnet and forwards traffic to the app servers. This is defense in depth -- even if someone compromises your web server, they can't directly access the database from outside.
A Content Delivery Network caches your content at edge locations around the world. When a user in Tokyo requests your image, they get it from a server in Tokyo instead of your origin server in Virginia. This reduces latency from ~200ms to ~20ms. More on this in Section 9.
These acronyms describe how much the provider manages for you. Think of it as a spectrum from "you manage everything" to "you manage nothing".
What YOU manage vs what the PROVIDER manages:
On-Premise IaaS PaaS FaaS SaaS
(you own (rent (rent (rent (rent
everything) hardware) platform) functions) software)
+----------+ +----------+ +----------+ +----------+ +----------+
| App | | App | | App | | Function | | |
+----------+ +----------+ +----------+ +----------+ | |
| Runtime | | Runtime | |##########| |##########| | |
+----------+ +----------+ +----------+ +----------+ | Gmail |
| OS | | OS | |##########| |##########| | Slack |
+----------+ +----------+ +----------+ +----------+ | Shopify |
| Disk | |##########| |##########| |##########| | |
+----------+ +----------+ +----------+ +----------+ | |
| Network | |##########| |##########| |##########| | |
+----------+ +----------+ +----------+ +----------+ +----------+
[ = You manage ] [## = Provider manages ]
You rent the hardware (virtual machines, networking, storage). You install and manage everything on top: the OS, runtime, app, security patches.
Examples: AWS EC2, GCP Compute Engine, DigitalOcean Droplets, Hetzner Cloud
Use when: You need full control. Custom OS configurations. Running software that doesn't fit into platform constraints. You have the ops expertise to manage it.
# You spin up a VM, then you're responsible for everything:
ssh root@your-server
apt update && apt upgrade # You patch the OS
apt install nodejs nginx postgresql # You install dependencies
git clone your-repo && npm install # You deploy your code
systemctl enable nginx # You configure the web server
certbot --nginx # You manage SSL certificates
# ...and you set up monitoring, backups, firewalls, log rotation
You give the provider your code, and they handle the rest -- servers, OS, runtime, scaling, SSL, deployments. You just git push.
Examples: Heroku, Vercel, Netlify, Railway, Render, Fly.io, Google App Engine
Use when: You want to ship fast without managing infrastructure. Small teams. Prototypes. Apps that fit standard patterns.
# Heroku: deploy a Node.js app
git push heroku main
# That's it. Heroku detects Node.js, installs dependencies,
# starts your server, gives you a URL, handles SSL, and scales.
# Vercel: deploy a Next.js app
vercel --prod
# Builds, deploys to edge network, SSL, custom domain, done.
# Railway: deploy anything with a Dockerfile
railway up
# Detects Dockerfile, builds, deploys, gives you a URL.
You don't manage anything. You just use the software through a browser or API. The provider handles everything -- infrastructure, updates, security, availability.
Examples: Gmail, Slack, Shopify, GitHub, Notion, Stripe, Twilio
Use when: The software does what you need out of the box. You're a user, not a builder (for that specific function).
You write individual functions. The provider runs them when triggered (HTTP request, file upload, timer, queue message). No servers to manage. You pay per execution, not per hour. More on this in Section 5.
Examples: AWS Lambda, Google Cloud Functions, Cloudflare Workers, Azure Functions
| Model | You Manage | Provider Manages | Cost Model | Best For |
|---|---|---|---|---|
| IaaS | OS, runtime, app, data | Hardware, network, storage | Per hour/second (VM runs) | Full control, custom setups |
| PaaS | App, data | Everything else | Per dyno/instance + usage | Fast deployment, small teams |
| FaaS | Individual functions | Everything else | Per invocation + duration | Event-driven, sporadic workloads |
| SaaS | Configuration, data | Everything | Per user/month or per usage | Standard business tools |
PaaS is amazing for getting started but can become expensive at scale. Heroku's free tier is gone, and a basic production setup (2 dynos + managed Postgres + Redis) can run $100+/month -- the same workload on a $20/month VPS. Many startups start on PaaS for speed, then migrate to IaaS or containers as they grow. Plan for this migration path from day one.
Serverless is the most misnamed concept in computing. There are servers -- you just don't think about them. You write a function, upload it, and the cloud provider runs it whenever it's triggered. You don't provision VMs, you don't manage scaling, you don't patch anything. You pay only when your code actually runs.
A traditional server is like a restaurant kitchen: it's running all the time, waiting for orders, costing money even when nobody is eating. A serverless function is like a vending machine: it does nothing until someone presses a button. It serves the request, then goes idle. You don't pay for idle time.
// handler.js -- an AWS Lambda function
// This function runs when triggered (HTTP request, S3 upload, etc.)
exports.handler = async (event) => {
// event contains the trigger data (request body, S3 event, etc.)
const name = event.queryStringParameters?.name || "World";
return {
statusCode: 200,
headers: { "Content-Type": "application/json" },
body: JSON.stringify({
message: `Hello, ${name}!`,
timestamp: new Date().toISOString(),
}),
};
};
// Deploy with AWS CLI:
// aws lambda create-function \
// --function-name hello \
// --runtime nodejs20.x \
// --handler handler.handler \
// --zip-file fileb://function.zip \
// --role arn:aws:iam::123456789:role/lambda-role
// worker.js -- runs at the edge (200+ locations worldwide)
// Response time: ~10ms because it runs near the user
export default {
async fetch(request, env) {
const url = new URL(request.url);
if (url.pathname === "/api/hello") {
return new Response(JSON.stringify({ message: "Hello from the edge!" }), {
headers: { "Content-Type": "application/json" },
});
}
return new Response("Not Found", { status: 404 });
},
};
// Deploy: wrangler deploy
When a serverless function hasn't been called recently, the provider needs to spin up a new execution environment (load your code, initialize the runtime, run your initialization code). This is called a cold start, and it adds latency to the first request.
| Runtime | Cold Start Time | Notes |
|---|---|---|
| Node.js | 100-500ms | Fast, good default choice |
| Python | 200-600ms | Depends on imports (numpy is heavy) |
| Go | 50-200ms | Compiled binary, fastest cold starts |
| Java | 1-5 seconds | JVM startup is heavy; use GraalVM to improve |
| Cloudflare Workers | <5ms | V8 isolates, not containers -- nearly instant |
| Good Fit | Bad Fit |
|---|---|
| API endpoints with variable traffic | Long-running processes (>15 min) |
| Event processing (file uploads, webhooks) | WebSocket connections |
| Scheduled tasks (cron jobs) | Stateful applications |
| Image/video processing triggers | High-throughput, consistent traffic |
| Low-traffic APIs (pay $0 when idle) | Applications needing <10ms latency on every request |
| MVP / prototypes (ship fast, worry later) | Complex apps with many interconnected services |
Serverless is cheap for low traffic (you pay nothing when idle). But at high, consistent traffic, it gets expensive. A Lambda function handling 10 million requests/month with 500ms average duration costs roughly $30-50/month. That same workload on a $5/month VPS would handle it easily. Serverless is cheaper below a threshold and more expensive above it. That crossover point varies, but for many apps it's around 1-5 million requests/month.
CI/CD automates the process of testing, building, and deploying your code. Without it, deployments are manual, error-prone, and scary. With it, you push code to Git and it automatically gets tested and deployed. This is how professional teams ship software.
Continuous Integration (CI): Every time you push code, automated tests run. If tests fail, the build fails and you know immediately. This catches bugs before they reach production. The "continuous" part means it happens on every push, not once a week.
Continuous Delivery (CD): After CI passes, the code is automatically packaged and ready to deploy. A human clicks "deploy" to push to production.
Continuous Deployment (also CD): After CI passes, the code automatically deploys to production with no human intervention. This is what most modern teams aim for.
Developer pushes code to GitHub
|
v
+--------------------+
| CI Pipeline Runs | (automated)
| - Install deps |
| - Lint code |
| - Run tests |
| - Build project |
+---------+----------+
|
Tests pass?
/ \
YES NO
| |
v v
+--------+ +----------+
| Deploy | | Notify |
| to | | developer|
| prod | | (fix it) |
+--------+ +----------+
GitHub Actions runs workflows defined in YAML files inside your repo (.github/workflows/). Workflows are triggered by events (push, pull request, schedule, manual). Each workflow has jobs, and each job has steps.
# .github/workflows/ci.yml
name: CI/CD Pipeline
# When to run
on:
push:
branches: [main]
pull_request:
branches: [main]
jobs:
test:
runs-on: ubuntu-latest # GitHub provides the VM
steps:
# 1. Check out your code
- uses: actions/checkout@v4
# 2. Set up Node.js
- uses: actions/setup-node@v4
with:
node-version: 20
cache: 'npm' # Cache node_modules between runs
# 3. Install dependencies
- run: npm ci
# 4. Lint
- run: npm run lint
# 5. Run tests
- run: npm test
# 6. Build
- run: npm run build
deploy:
needs: test # Only runs if "test" job passes
runs-on: ubuntu-latest
if: github.ref == 'refs/heads/main' # Only deploy from main branch
steps:
- uses: actions/checkout@v4
- uses: actions/setup-node@v4
with:
node-version: 20
cache: 'npm'
- run: npm ci
- run: npm run build
# Deploy to your hosting provider
- name: Deploy to production
run: |
npx vercel --prod --token=${{ secrets.VERCEL_TOKEN }}
env:
VERCEL_ORG_ID: ${{ secrets.VERCEL_ORG_ID }}
VERCEL_PROJECT_ID: ${{ secrets.VERCEL_PROJECT_ID }}
on: push/pull_request -- Trigger this workflow when someone pushes to main or opens a PR against main.
runs-on: ubuntu-latest -- GitHub spins up a fresh Ubuntu VM for this job. It's destroyed after the job finishes. You get 2,000 free minutes/month on the free plan.
uses: actions/checkout@v4 -- A pre-built action that clones your repo into the VM. Without this, the VM has no code.
cache: 'npm' -- Caches node_modules between workflow runs. Instead of downloading all dependencies every time (30-60 seconds), it restores from cache (2-3 seconds).
needs: test -- The deploy job won't start until the test job succeeds. If tests fail, deploy is skipped.
${{ secrets.VERCEL_TOKEN }} -- GitHub Secrets store sensitive values (API keys, tokens). You set them in your repo settings. They're never exposed in logs.
# .github/workflows/deploy.yml
name: Build and Deploy
on:
push:
branches: [main]
jobs:
build-and-deploy:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
# Log into Docker Hub (or ECR, GHCR)
- name: Login to Docker Hub
uses: docker/login-action@v3
with:
username: ${{ secrets.DOCKER_USERNAME }}
password: ${{ secrets.DOCKER_TOKEN }}
# Build and push Docker image
- name: Build and push
uses: docker/build-push-action@v5
with:
push: true
tags: myuser/myapp:latest,myuser/myapp:${{ github.sha }}
# Deploy to server via SSH
- name: Deploy
uses: appleboy/ssh-action@v1
with:
host: ${{ secrets.SERVER_HOST }}
username: deploy
key: ${{ secrets.SSH_PRIVATE_KEY }}
script: |
docker pull myuser/myapp:latest
docker stop myapp || true
docker rm myapp || true
docker run -d --name myapp \
-p 80:3000 \
--env-file /home/deploy/.env \
myuser/myapp:latest
GitHub Actions is the most popular for open source and startups. But you'll also encounter: GitLab CI (built into GitLab), Jenkins (self-hosted, oldest, most configurable), CircleCI, Travis CI (declining), Buildkite (hybrid -- you provide the runners), and Drone CI (container-native). The concepts are identical across all of them -- only the YAML syntax differs.
Infrastructure as Code (IaC) means defining your cloud resources in code files instead of clicking through a web console. Your servers, databases, networks, DNS records -- all described in text files that live in Git. This makes infrastructure reproducible, reviewable, and version-controlled.
Imagine you set up your production environment by clicking through the AWS console for 3 hours. Everything works. Six months later, you need to create an identical staging environment. Can you remember every click? Every security group rule? Every IAM policy? Of course not.
With IaC, you run terraform apply and your entire infrastructure is created from a file. Need a staging copy? Change a variable and apply again. Need to see who changed the database size? Check the git log. Need to roll back a networking change? git revert and re-apply.
# IMPERATIVE (scripts): "Here are the steps to get there"
# Like giving someone turn-by-turn directions
aws ec2 create-security-group --group-name web-sg --description "Web SG"
aws ec2 authorize-security-group-ingress --group-name web-sg --port 80 --cidr 0.0.0.0/0
aws ec2 run-instances --instance-type t3.micro --security-groups web-sg
# Problem: run it twice and you get TWO servers. Not idempotent.
# DECLARATIVE (Terraform/CloudFormation): "Here's what I want to exist"
# Like showing someone a photo of where you want to go
resource "aws_instance" "web" {
instance_type = "t3.micro"
ami = "ami-0c55b159cbfafe1f0"
}
# Run it twice and Terraform says "already exists, nothing to do."
# Change instance_type and Terraform updates only what changed.
Terraform by HashiCorp works with every cloud provider (and many other services like GitHub, Cloudflare, Datadog). You write .tf files using HCL (HashiCorp Configuration Language), then Terraform figures out what needs to be created, updated, or destroyed.
# main.tf -- Define your infrastructure
# Tell Terraform which providers to use
terraform {
required_providers {
aws = {
source = "hashicorp/aws"
version = "~> 5.0"
}
}
}
# Configure the AWS provider
provider "aws" {
region = "us-east-1"
}
# Create a VPC
resource "aws_vpc" "main" {
cidr_block = "10.0.0.0/16"
tags = {
Name = "my-app-vpc"
}
}
# Create a public subnet
resource "aws_subnet" "public" {
vpc_id = aws_vpc.main.id
cidr_block = "10.0.1.0/24"
availability_zone = "us-east-1a"
tags = {
Name = "public-subnet"
}
}
# Create an EC2 instance
resource "aws_instance" "web" {
ami = "ami-0c55b159cbfafe1f0"
instance_type = var.instance_type # Variable -- configurable
subnet_id = aws_subnet.public.id # Reference another resource
tags = {
Name = "web-server"
Environment = var.environment
}
}
# Create an RDS database
resource "aws_db_instance" "postgres" {
engine = "postgres"
engine_version = "16"
instance_class = "db.t3.micro"
allocated_storage = 20
db_name = "myapp"
username = "appuser"
password = var.db_password # From variable (never hardcode)
skip_final_snapshot = true # For dev environments
}
# variables.tf -- Define configurable values
variable "instance_type" {
description = "EC2 instance type"
default = "t3.micro"
}
variable "environment" {
description = "Environment name"
default = "development"
}
variable "db_password" {
description = "Database password"
sensitive = true # Won't show in logs/output
}
# 1. Initialize (download provider plugins)
terraform init
# 2. Plan (preview what will change -- ALWAYS do this)
terraform plan
# Output:
# + aws_instance.web will be created
# + aws_db_instance.postgres will be created
# Plan: 2 to add, 0 to change, 0 to destroy.
# 3. Apply (create/update resources)
terraform apply
# Type "yes" to confirm (or use -auto-approve in CI)
# 4. Destroy (tear down everything -- careful!)
terraform destroy
Terraform stores the current state of your infrastructure in a terraform.tfstate file. This file maps your .tf definitions to real cloud resources. If you lose this file, Terraform doesn't know what exists and might try to create duplicates. Never commit this file to git. Store it in a remote backend (S3 bucket, Terraform Cloud) so your team shares the same state.
A load balancer sits in front of multiple servers and distributes incoming requests across them. Without one, a single server handles all traffic and becomes a bottleneck (and a single point of failure). With one, you can scale to any number of servers and survive individual server failures.
Without Load Balancer: With Load Balancer:
Users ---> [Server] Users ---> [Load Balancer]
(single point / | \
of failure) [Srv 1] [Srv 2] [Srv 3]
(any server can die,
users don't notice)
A load balancer is like the host at a busy restaurant. When customers arrive, the host doesn't send everyone to the same table. They look at which servers (waiters) are available and distribute customers evenly. If one waiter calls in sick, the host just routes to the remaining waiters. Customers still get served without knowing anything changed.
Load balancers operate at different layers of the networking stack, which determines what information they can see and use for routing decisions.
| Feature | Layer 4 (Transport) | Layer 7 (Application) |
|---|---|---|
| Sees | IP addresses, TCP/UDP ports | HTTP headers, URLs, cookies, body |
| Routing based on | IP + port only | URL path, hostname, headers, cookies |
| Performance | Faster (less processing) | Slower (must parse HTTP) |
| SSL termination | No (passes encrypted traffic through) | Yes (decrypts, inspects, re-encrypts) |
| Use case | TCP/UDP services, databases, gaming | HTTP APIs, web apps, microservices |
| AWS service | NLB (Network Load Balancer) | ALB (Application Load Balancer) |
# With a Layer 7 load balancer, you can route based on URL path:
#
# /api/* --> API servers (backend cluster)
# /images/* --> Image processing servers
# /admin/* --> Admin dashboard servers
# /* --> Frontend servers
#
# Or route based on hostname:
#
# api.example.com --> API cluster
# www.example.com --> Web frontend cluster
# admin.example.com --> Admin cluster
#
# This is impossible with Layer 4 -- it can't see the URL or hostname.
# How the load balancer decides which server gets the next request:
# 1. ROUND ROBIN (simplest, most common)
# Request 1 -> Server A
# Request 2 -> Server B
# Request 3 -> Server C
# Request 4 -> Server A (back to start)
# Good for: servers with equal capacity
# 2. WEIGHTED ROUND ROBIN
# Server A (weight 3): gets 3 out of every 6 requests
# Server B (weight 2): gets 2 out of every 6 requests
# Server C (weight 1): gets 1 out of every 6 requests
# Good for: servers with different capacities
# 3. LEAST CONNECTIONS
# Send to the server with the fewest active connections
# Good for: requests that take variable time (some fast, some slow)
# 4. IP HASH
# Hash the client's IP to consistently route to the same server
# Good for: session stickiness without cookies
# Problem: if a server dies, all its clients get reassigned
# 5. LEAST RESPONSE TIME
# Send to the server that's responding fastest right now
# Good for: maximizing performance when servers have varying load
The load balancer regularly pings your servers to check if they're alive. If a server stops responding, the load balancer stops sending traffic to it. When it recovers, traffic resumes.
# Health check configuration (AWS ALB example):
#
# Protocol: HTTP
# Path: /health
# Port: 3000
# Interval: 30 seconds (check every 30s)
# Timeout: 5 seconds (wait 5s for response)
# Healthy: 3 consecutive successes = server is healthy
# Unhealthy: 2 consecutive failures = server is unhealthy
# Your app needs a health endpoint:
app.get("/health", (req, res) => {
// Check dependencies
const dbHealthy = await checkDatabase();
const cacheHealthy = await checkRedis();
if (dbHealthy && cacheHealthy) {
res.status(200).json({ status: "ok" });
} else {
res.status(503).json({ status: "unhealthy", db: dbHealthy, cache: cacheHealthy });
}
});
A CDN caches copies of your content at servers spread across the globe (called edge nodes or PoPs -- Points of Presence). When a user requests your image, JavaScript file, or API response, they get it from the nearest edge node instead of your origin server thousands of miles away.
Without CDN: With CDN:
User in Tokyo User in Tokyo
| |
| 200ms round trip | 10ms round trip
| |
v v
Server in Virginia Edge node in Tokyo
|
(cache miss? fetch from origin)
|
v
Server in Virginia
First request: User in Tokyo requests /images/logo.png. No edge node in Tokyo has it cached. The CDN fetches it from your origin server in Virginia (slow), serves it to the user, and caches a copy at the Tokyo edge node.
Second request: Another user in Tokyo requests the same image. The edge node already has it cached. Served instantly. No round trip to Virginia.
Cache expiration: After the TTL (Time To Live) expires (e.g., 24 hours), the edge node considers the cached copy stale and fetches a fresh copy from origin on the next request.
| Content Type | CDN Benefit | TTL Recommendation |
|---|---|---|
| Static assets (JS, CSS, images) | Massive -- rarely changes | 1 year (use content hash in filename) |
| Fonts | Massive -- never changes | 1 year |
| Video / large media | Huge -- saves bandwidth | 1 week to 1 year |
| HTML pages (static sites) | High -- same content for everyone | 5 min to 1 hour |
| API responses (public data) | Moderate -- depends on freshness needs | 30 sec to 5 min |
| API responses (user-specific) | Low -- different for every user | Don't cache (or use Vary header) |
# You control CDN caching through HTTP headers:
# Cache for 1 year (static assets with hashed filenames)
Cache-Control: public, max-age=31536000, immutable
# "public" = CDN can cache it
# "immutable" = don't even check for updates (filename changes instead)
# Cache for 5 minutes, revalidate after
Cache-Control: public, max-age=300, stale-while-revalidate=60
# Serve cached copy for 300s. After that, serve stale for 60s
# while fetching fresh copy in background.
# Don't cache (user-specific data)
Cache-Control: private, no-store
# "private" = only browser can cache, not CDN
# "no-store" = don't cache at all
# Cache but always revalidate (HTML pages)
Cache-Control: public, no-cache
# "no-cache" doesn't mean "don't cache" -- it means
# "cache it, but check with origin before using it" (via ETag/If-Modified-Since)
You deploy a new version of your app. Your CSS is cached at 200+ edge nodes worldwide with a 1-year TTL. Users see the old CSS. What do you do? You can't easily purge every edge node instantly.
The solution: content-hashed filenames. Instead of style.css, your build tool generates style.a3f8b2c1.css. The hash changes when the content changes. New deployment = new filename = CDN fetches the new file. Old cached files naturally expire. This is why every modern build tool (Vite, webpack, Parcel) includes content hashing.
| Provider | Edge Nodes | Notable Features | Cost |
|---|---|---|---|
| Cloudflare | 300+ cities | DDoS protection, WAF, Workers (edge compute), free tier | Generous free tier, paid from $20/mo |
| AWS CloudFront | 450+ PoPs | Deep AWS integration, Lambda@Edge | Pay per GB transferred + requests |
| Fastly | 80+ PoPs | Instant purge, VCL configs, Compute@Edge | Pay per usage, more expensive |
| Bunny CDN | 120+ PoPs | Simple, cheap, good for small projects | From $0.01/GB |
Start with Cloudflare. Their free tier gives you CDN, DDoS protection, SSL, and DNS -- all for $0. Just point your domain's nameservers to Cloudflare, and they proxy all traffic through their edge network. It's the easiest way to add a CDN to any project.
Object storage is a flat storage system where every item (called an "object") gets a unique key. Unlike a filesystem with directories and nested folders, object storage is a giant key-value store. The key looks like a path (images/avatars/user-123.jpg), but there's no actual directory structure -- it's just a string.
Filesystem (your hard drive): Hierarchical. Directories contain files. You navigate a tree structure. Great for small, frequently-accessed data. Limited to one machine.
Object storage (S3): Flat. Every object has a key and metadata. Scales to exabytes. Accessible from anywhere via HTTP. Built for durability (data replicated across multiple facilities). No concept of "editing a file" -- you replace the whole object.
Rule of thumb: If your app creates/serves files that users upload (images, videos, PDFs, backups), use object storage. If your app reads/writes local config or temp files, use the filesystem.
# AWS S3 CLI basics
# Create a bucket
aws s3 mb s3://my-app-uploads-prod
# Upload a file
aws s3 cp ./photo.jpg s3://my-app-uploads-prod/avatars/user-123.jpg
# Download a file
aws s3 cp s3://my-app-uploads-prod/avatars/user-123.jpg ./downloaded.jpg
# List objects
aws s3 ls s3://my-app-uploads-prod/avatars/
# Delete an object
aws s3 rm s3://my-app-uploads-prod/avatars/user-123.jpg
# Sync a directory
aws s3 sync ./uploads s3://my-app-uploads-prod/ --delete
// Node.js: Upload to S3 using AWS SDK v3
const { S3Client, PutObjectCommand, GetObjectCommand } = require("@aws-sdk/client-s3");
const { getSignedUrl } = require("@aws-sdk/s3-request-presigner");
const s3 = new S3Client({ region: "us-east-1" });
// Direct upload from your server
async function uploadFile(buffer, key, contentType) {
await s3.send(new PutObjectCommand({
Bucket: "my-app-uploads-prod",
Key: key, // e.g., "avatars/user-123.jpg"
Body: buffer, // File contents as Buffer
ContentType: contentType, // e.g., "image/jpeg"
}));
return `https://my-app-uploads-prod.s3.amazonaws.com/${key}`;
}
// Express route: handle file upload
app.post("/api/upload", upload.single("file"), async (req, res) => {
const key = `uploads/${Date.now()}-${req.file.originalname}`;
const url = await uploadFile(req.file.buffer, key, req.file.mimetype);
res.json({ url });
});
Instead of uploading files through your server (which uses your bandwidth and CPU), generate a presigned URL that lets the client upload directly to S3. Your server never touches the file.
// Step 1: Client requests an upload URL from your API
// Step 2: Your API generates a presigned S3 URL
// Step 3: Client uploads directly to S3 using that URL
// Step 4: Client tells your API the upload is done
// Server: generate presigned upload URL
app.post("/api/upload-url", async (req, res) => {
const { filename, contentType } = req.body;
const key = `uploads/${req.user.id}/${Date.now()}-${filename}`;
const command = new PutObjectCommand({
Bucket: "my-app-uploads-prod",
Key: key,
ContentType: contentType,
});
// URL is valid for 15 minutes
const uploadUrl = await getSignedUrl(s3, command, { expiresIn: 900 });
res.json({ uploadUrl, key });
});
// Client: upload directly to S3
const { uploadUrl } = await fetch("/api/upload-url", {
method: "POST",
headers: { "Content-Type": "application/json" },
body: JSON.stringify({ filename: "photo.jpg", contentType: "image/jpeg" }),
}).then(r => r.json());
// Upload file directly to S3 (bypasses your server entirely)
await fetch(uploadUrl, {
method: "PUT",
headers: { "Content-Type": "image/jpeg" },
body: fileBlob, // The actual file
});
Without presigned URLs, a 100MB file upload goes: User -> Your Server -> S3. Your server is a bottleneck, uses bandwidth, and ties up a connection for the entire upload.
With presigned URLs, the 100MB file goes: User -> S3 directly. Your server only generates a URL (tiny JSON response). This scales massively better and costs less.
| Service | Provider | Key Benefit | Pricing |
|---|---|---|---|
| S3 | AWS | The original, most features | $0.023/GB/month + egress fees |
| R2 | Cloudflare | Zero egress fees | $0.015/GB/month, no egress cost |
| Backblaze B2 | Backblaze | Cheapest storage | $0.006/GB/month |
| MinIO | Self-hosted | S3-compatible, runs on your servers | Free (you pay for hardware) |
| GCS | Integrated with GCP ecosystem | $0.020/GB/month |
Storing data in S3 is cheap ($0.023/GB/month). Downloading data from S3 is expensive ($0.09/GB). If you serve 1TB of images directly from S3 per month, that's $92 just in egress fees. This is why you put a CDN in front of S3 -- the CDN caches the files and serves them from edge nodes, dramatically reducing how much data leaves S3. Alternatively, use Cloudflare R2 which has zero egress fees.
A managed database is a database where the cloud provider handles all the operational work -- backups, patching, replication, failover, monitoring, scaling. You just connect and run queries. The alternative is running your own database on a VM (or bare metal), which means you're responsible for everything.
| Aspect | Managed (RDS, PlanetScale, etc.) | Self-Hosted (Postgres on a VM) |
|---|---|---|
| Setup time | 5 minutes | Hours to days (properly hardened) |
| Backups | Automatic, point-in-time recovery | You set up pg_dump / WAL archiving |
| Patching | Provider handles it (with maintenance window) | You track CVEs and apply patches |
| Failover | Automatic (Multi-AZ) | You configure replication + Patroni/pgBouncer |
| Scaling | Click a button (or API call) | Migrate data to bigger machine manually |
| Cost | 2-5x more expensive | Cheaper, but your time has value |
| Control | Limited (can't tune everything) | Full control over every setting |
| Best for | Most teams, production workloads | Cost-sensitive, high-volume, specific tuning needs |
Use managed databases unless you have a specific reason not to. The cost premium is worth it for most teams because the alternative is waking up at 3 AM to fix a crashed database. Your engineering time is more expensive than the managed DB surcharge.
| Service | Engine | Starting Price | Notes |
|---|---|---|---|
| AWS RDS | Postgres, MySQL, MariaDB, Oracle, SQL Server | ~$15/month (db.t3.micro) | The standard. Multi-AZ, read replicas, automated backups. |
| AWS Aurora | MySQL/Postgres compatible | ~$30/month | 5x faster than standard MySQL. Storage auto-scales. More expensive but more performant. |
| GCP Cloud SQL | Postgres, MySQL, SQL Server | ~$10/month | Similar to RDS. Good GCP integration. |
| Service | Engine | Standout Feature | Free Tier |
|---|---|---|---|
| Neon | PostgreSQL | Branching (like git branches for your database), scales to zero | 500MB storage, generous compute |
| PlanetScale | MySQL (Vitess) | Non-blocking schema changes, horizontal scaling | 5GB storage, 1 billion row reads/month |
| Supabase | PostgreSQL | Built-in auth, real-time subscriptions, REST API auto-generated | 500MB storage, 50K monthly active users |
| Turso | SQLite (libSQL) | Edge replicas (database at every edge location), embedded SQLite | 9GB storage, 500 databases |
| Upstash | Redis, Kafka | Serverless Redis with per-request pricing | 10K commands/day |
Side project / MVP: Supabase or Neon free tier. You get a real PostgreSQL database with modern tooling for $0. Ship fast.
Startup (growing): Neon or PlanetScale paid tier. Database branching for safe migrations (Neon), or horizontal scaling with non-blocking schema changes (PlanetScale).
Enterprise / production: AWS RDS or Aurora. Battle-tested, every compliance certification, multi-AZ failover, and your ops team already knows AWS.
Edge-first app: Turso. Put read replicas at the edge so database reads are <10ms worldwide. Write to a primary that replicates automatically.
Caching layer: Upstash Redis. Serverless pricing means you pay nothing when idle, pennies when active.
# Every managed database gives you a connection string. Format:
# protocol://username:password@host:port/database?options
# PostgreSQL (RDS, Neon, Supabase)
postgresql://appuser:secret@db.abc123.us-east-1.rds.amazonaws.com:5432/myapp?sslmode=require
# MySQL (PlanetScale)
mysql://username:password@aws.connect.psdb.cloud/mydb?ssl={"rejectUnauthorized":true}
# Redis (Upstash)
rediss://default:token@usw1-capable-toucan-12345.upstash.io:6379
# ALWAYS:
# - Use SSL/TLS (sslmode=require or ?ssl=true)
# - Store connection strings in environment variables, NEVER in code
# - Use connection pooling (PgBouncer, Prisma pool, etc.)
# - Restrict access to your VPC / allowed IP addresses
Serverless functions (Lambda, Vercel) create a new database connection on every cold start. At scale, this means hundreds of connections opening and closing rapidly, overwhelming your database. The fix: use an external connection pooler. Neon has a built-in pooler. Supabase uses PgBouncer. For RDS, use RDS Proxy. Without pooling, your serverless app will crash your database at moderate traffic.
Cloud billing is designed to be complex. Every service has its own pricing model with multiple dimensions (compute, storage, network, requests, data transfer). People routinely wake up to unexpected $500 bills from leaving a test cluster running or misconfiguring auto-scaling. Understanding cloud costs is a survival skill.
A developer left 5 GPU instances running overnight: $2,000 bill. A startup enabled S3 Transfer Acceleration on a high-traffic bucket: $8,000 surprise. A student accidentally created 100 RDS instances in a loop: $14,000 in 48 hours. AWS and GCP will charge you, send the bill, and expect payment. Set up billing alerts BEFORE you create any resources.
# Cloud costs have multiple dimensions. Here's a typical web app:
# 1. COMPUTE (EC2 / VMs)
# Billed per second (or hour) the instance is running
# t3.micro: ~$0.0104/hour = ~$7.50/month (running 24/7)
# t3.medium: ~$0.0416/hour = ~$30/month
# GOTCHA: Instances bill even when idle. Stop/terminate when not needed.
# 2. STORAGE (EBS / disks attached to VMs)
# Billed per GB per month
# gp3 (SSD): $0.08/GB/month. 100GB = $8/month
# GOTCHA: Volumes persist after instance termination. Delete them manually.
# 3. DATA TRANSFER (bandwidth)
# Inbound: FREE (data going INTO AWS)
# Outbound: $0.09/GB (data leaving AWS to the internet)
# Between AZs: $0.01/GB (data between availability zones)
# GOTCHA: This is the hidden killer. Serving 10TB/month = $900.
# 4. OBJECT STORAGE (S3)
# Storage: $0.023/GB/month
# Requests: $0.0004 per 1,000 GET requests
# Egress: $0.09/GB (same as data transfer)
# GOTCHA: Millions of tiny objects = expensive from requests, not storage.
# 5. DATABASE (RDS)
# Instance: $15-$500+/month depending on size
# Storage: $0.115/GB/month (gp3)
# Backups: Free up to 100% of DB storage, then $0.095/GB
# GOTCHA: Multi-AZ doubles the instance cost.
| Service | Free Tier | The Trap |
|---|---|---|
| EC2 | 750 hours/month of t2.micro (12 months) | That's ONE instance running 24/7. Two instances = you're billed for the second one. Also, t2.micro not t3.micro -- make sure you select the right one. |
| RDS | 750 hours/month of db.t2.micro (12 months) | Multi-AZ is NOT free tier. Enabling it doubles your bill immediately. |
| S3 | 5GB storage, 20K GET, 2K PUT (12 months) | 20K GETs is nothing. A moderately popular image gallery can burn through this in a day. |
| Lambda | 1M requests + 400K GB-seconds/month (always free) | This is genuinely generous. Most small apps never exceed it. |
| Data Transfer | 100GB/month outbound (12 months) | After free tier expires, 100GB/month = $9. Doesn't sound bad until your traffic grows. |
| NAT Gateway | NOT included in free tier | $32/month minimum just to exist + $0.045/GB. The sneakiest cost in AWS. Needed if private subnets need internet access. |
aws ec2 describe-instances --query 'Reservations[*].Instances[*].[InstanceId,State.Name,InstanceType]' regularly. Kill anything you're not using.# 1. RIGHT-SIZE YOUR INSTANCES
# Most instances are over-provisioned. Monitor CPU/RAM usage.
# If your t3.medium (4GB RAM) never uses more than 1.5GB,
# switch to t3.small (2GB RAM) and save 50%.
# 2. USE RESERVED INSTANCES OR SAVINGS PLANS
# Commit to 1 or 3 years of usage for 30-60% discount.
# Only for stable, predictable workloads.
# AWS: Reserved Instances or Savings Plans
# GCP: Committed Use Discounts
# 3. USE SPOT INSTANCES FOR FAULT-TOLERANT WORK
# Spot/Preemptible VMs cost 60-90% less than on-demand.
# AWS can reclaim them with 2 minutes notice.
# Good for: batch processing, CI/CD runners, data pipelines
# Bad for: web servers, databases, anything stateful
# 4. PUT A CDN IN FRONT OF EVERYTHING
# Reduce origin data transfer (the expensive part)
# Cloudflare free tier = $0 for unlimited bandwidth
# That $900/month S3 egress bill becomes $50/month with a CDN.
# 5. USE S3 STORAGE CLASSES
# S3 Standard: $0.023/GB (frequently accessed)
# S3 Infrequent Access: $0.0125/GB (accessed once a month)
# S3 Glacier Instant: $0.004/GB (archival, rare access)
# S3 Glacier Deep Archive: $0.00099/GB (accessed once a year)
# Set up lifecycle policies to automatically move old data to cheaper tiers.
# 6. USE SERVERLESS FOR VARIABLE WORKLOADS
# A t3.small running 24/7 costs ~$15/month even when idle.
# Lambda costs $0 when not running.
# For APIs with <100K requests/month, serverless is nearly free.
# 7. AVOID NAT GATEWAYS IF POSSIBLE
# $32/month + data transfer for a NAT Gateway.
# Alternatives: VPC endpoints for AWS services ($0),
# or put your resources in public subnets with security groups.
# 8. SCHEDULE NON-PRODUCTION RESOURCES
# Dev/staging environments don't need to run 24/7.
# Run them 10 hours/day, 5 days/week = 70% cost reduction.
# Use AWS Instance Scheduler or a simple cron + Lambda.
| Setup | Estimated Monthly Cost | Notes |
|---|---|---|
| Static site on Vercel/Netlify/Cloudflare Pages | $0 | Free tier covers most personal sites |
| Small API on Lambda + Neon DB | $0-5 | Free tiers cover low traffic |
| VPS (Hetzner) + self-hosted Postgres | $5-10 | Best bang for buck, but you're the ops team |
| EC2 t3.small + RDS db.t3.micro | $30-50 | Basic AWS setup, watch for data transfer |
| ECS/EKS + RDS + ElastiCache + ALB | $150-400 | Production stack, costs add up quick |
| Multi-AZ RDS + CloudFront + S3 + Lambda | $100-300 | Serverless production stack |
Always set a billing alert before you start. AWS will happily charge you $10,000 and send you the bill later. GCP is slightly better (they'll pause projects at budget limits if configured). But never assume you're "on the free tier" -- verify it. Check your billing dashboard weekly when you're learning. It takes 30 seconds and can save you hundreds of dollars.
terraform destroy. Terminate instances. Delete databases. Check your bill.