Operating Systems

1. What Is an OS? 2. Processes 3. System Calls 4. Process API (fork, exec, wait) 5. CPU Scheduling 6. Threads & Concurrency 7. Locks & Synchronization 8. Deadlocks 9. Memory & Address Spaces 10. Paging & Virtual Memory 11. I/O & Devices 12. File Systems 13. OS Security Basics 14. Context Switching 15. Inter-Process Communication (IPC) 16. How Programs Actually Run 17. Containers & Virtualization 18. Essential Linux Commands 19. Practice Quiz

1. What Is an Operating System?

An operating system (OS) is a piece of software that sits between your programs and the hardware. When you open a browser, type in a terminal, or save a file, your program doesn't talk to the CPU or disk directly -- it asks the OS to do it.

The OS has three main jobs:

Virtualisation -- make it look like each program has its own CPU and its own memory, even though they're all sharing the same physical hardware.
Concurrency -- let multiple programs (and multiple threads within a program) run at the same time without stepping on each other.
Persistence -- store data on disk so it survives after your program exits or the machine reboots.

These are the "three easy pieces" from the OSTEP textbook. Every concept in this page falls into one of these three buckets.

┌─────────────────────────────────┐ │ Your Programs │ ← User space (your code runs here) │ (browser, editor, game, etc.) │ ├─────────────────────────────────┤ │ System Call Interface │ ← The boundary (syscalls) ├─────────────────────────────────┤ │ Operating System │ ← Kernel space (privileged code) │ Process mgmt | Memory | FS │ ├─────────────────────────────────┤ │ Hardware │ ← CPU, RAM, Disk, Network │ (physical resources) │ └─────────────────────────────────┘

User Mode vs Kernel Mode

The CPU has (at minimum) two privilege levels:

User mode -- your normal programs run here. They cannot directly access hardware, other processes' memory, or privileged instructions. If they try, the CPU raises a fault.
Kernel mode -- the OS kernel runs here. It has full access to everything: hardware registers, all memory, I/O ports.

The only way for a user-mode program to do something privileged (open a file, allocate memory, send a network packet) is to make a system call, which we'll cover in section 3.

Why This Matters

This separation is what keeps your computer stable. A buggy program can't crash the whole machine because it literally cannot touch hardware or other processes' memory. The OS acts as a gatekeeper.

2. Processes

A process is a running program. When you double-click an app or type ./my_program in a terminal, the OS creates a process for it. Each process gets:

Its own address space -- its own view of memory (code, stack, heap, data)
One or more threads of execution
Open file descriptors (files, sockets, pipes it has open)
A process ID (PID) -- a unique number identifying it
A state -- running, ready, or blocked

Process States

At any moment, a process is in one of these states:

┌──────────┐ Created ──→ │ Ready │ ←──────────────┐ └────┬─────┘ │ │ (scheduler picks it) │ (I/O completes or ▼ │ event arrives) ┌──────────┐ ┌─────┴──────┐ │ Running │ ──────→ │ Blocked │ └────┬─────┘ (needs └────────────┘ │ I/O or │ waits) ▼ ┌──────────┐ │Terminated│ └──────────┘

Ready -- the process could run, but the CPU is busy with another process. It's waiting in the ready queue.
Running -- the process is actually executing instructions on the CPU right now.
Blocked -- the process is waiting for something (disk read, network data, user input). It can't run even if the CPU is free.

What's Inside a Process (the PCB)

The OS keeps a Process Control Block (PCB) for every process. Think of it as a struct:

Conceptual
struct PCB {
    int    pid;              // process ID
    int    state;            // READY, RUNNING, BLOCKED
    int    priority;         // scheduling priority
    regs_t saved_registers;  // CPU registers when not running
    void*  page_table;       // pointer to address space info
    int    open_files[256];  // file descriptor table
    pid_t  parent_pid;       // who created this process
};

When the OS switches from process A to process B (context switch), it saves A's registers into A's PCB, then loads B's registers from B's PCB. This is how the illusion of "every process has its own CPU" works.

Example: Context Switch

Process A is running. A timer interrupt fires (every ~1-10ms). The OS:

Saves A's registers (program counter, stack pointer, etc.) into A's PCB
Changes A's state from RUNNING to READY
Picks process B from the ready queue
Loads B's saved registers from B's PCB
Changes B's state from READY to RUNNING
Jumps to where B left off

This happens thousands of times per second. Each switch takes ~1-10 microseconds.

3. System Calls (Syscalls)

A system call is how your program asks the OS to do something it can't do itself. Your code runs in user mode. It can't touch hardware. So when it needs to open a file, create a process, or send data over the network, it must cross the boundary into kernel mode.

How a Syscall Works (Step by Step)

Your program Kernel ───────────── ────── 1. Put syscall number in a register (e.g. rax) 2. Put arguments in registers (rdi, rsi, rdx..) 3. Execute special CPU instruction: "syscall" (x86-64) "svc" (ARM) "int 0x80" (old x86) │ ▼ ┌─── CPU switches to ───┐ │ kernel mode │ └────────────────────────┘ 4. Kernel looks up syscall number in syscall table 5. Runs the handler function (e.g. sys_open, sys_read) 6. Returns result │ ▼ ┌─── CPU switches back ─┐ │ to user mode │ └────────────────────────┘ 7. Result is in rax register (or -1 for error + errno)

The key insight: the syscall instruction is a hardware mechanism. The CPU itself switches privilege levels. Your program can't fake being in kernel mode -- the CPU enforces the boundary.

Major Syscall Categories

Category	Syscalls	What They Do
Process	`fork`, `exec`, `wait`, `exit`, `kill`	Create, replace, wait for, terminate processes
File I/O	`open`, `read`, `write`, `close`, `lseek`	Open, read, write, close files
Memory	`mmap`, `munmap`, `brk`	Map memory, grow/shrink heap
Network	`socket`, `bind`, `listen`, `accept`, `connect`	Create sockets, accept connections
Info	`getpid`, `getuid`, `uname`	Get process/user/system info
Signals	`signal`, `sigaction`, `kill`	Handle async notifications
Directory	`mkdir`, `rmdir`, `chdir`, `getcwd`	Manage directories

File Descriptors -- the Universal Handle

When you open a file, the OS doesn't give you a pointer to the file. It gives you a small integer called a file descriptor (fd). This is an index into a per-process table of open files.

Every process starts with three file descriptors already open:

fd 0 = stdin (standard input -- keyboard by default) fd 1 = stdout (standard output -- terminal by default) fd 2 = stderr (standard error -- terminal by default)

When you call open("myfile.txt", O_RDONLY), the kernel finds the next available fd (usually 3) and returns it. All future read() and write() calls use this fd number.

C
// Open a file -- returns a file descriptor (small int)
int fd = open("data.txt", O_RDONLY);
if (fd == -1) {
    perror("open failed");  // prints human-readable error
    return 1;
}

// Read up to 1024 bytes from the file
char buf[1024];
ssize_t bytes_read = read(fd, buf, sizeof(buf));

// Write to stdout (fd 1)
write(1, buf, bytes_read);

// Always close when done
close(fd);

Everything is a File

Unix's big insight: files, pipes, sockets, terminals, and even devices all use the same read()/write()/close() interface with file descriptors. That's why the same code can read from a file, a network connection, or a pipe -- the fd abstracts the details away.

Syscalls vs Library Functions

Don't confuse syscalls with C library (libc) functions:

printf() is a library function -- it formats your string in user space, then eventually calls the write() syscall to actually output it.
malloc() is a library function -- it manages a pool of memory in user space, and only calls brk() or mmap() syscalls when it needs more from the OS.
fopen() is a library wrapper around the open() syscall that adds buffering.

Library functions are faster because they avoid the user-to-kernel mode switch. Syscalls are expensive (hundreds of nanoseconds each), so libc batches operations to minimise them.

Example: Tracing Syscalls

You can see every syscall a program makes using strace on Linux:

Shell
# See all syscalls made by ls
$ strace ls

# Count syscalls by type
$ strace -c ls

# Trace only file-related syscalls
$ strace -e trace=file ls

# Trace a running process by PID
$ strace -p 1234

Try strace -c echo "hello" -- you'll see it makes about 30 syscalls just to print one word. Most are setup (loading libraries, setting up memory).

4. Process API (fork, exec, wait)

In Unix/Linux, you create processes using three syscalls that work together: fork(), exec(), and wait(). This design seems odd at first, but it's incredibly powerful.

fork() -- Clone Yourself

fork() creates an exact copy of the current process. The new process (child) gets a copy of the parent's memory, file descriptors, and code. Both processes continue from the same line -- the line after fork().

C
#include <stdio.h>
#include <unistd.h>

int main() {
    printf("Before fork (PID %d)\n", getpid());

    pid_t pid = fork();

    if (pid == 0) {
        // This runs in the CHILD process
        printf("I am the child!  PID = %d, parent = %d\n",
               getpid(), getppid());
    } else if (pid > 0) {
        // This runs in the PARENT process
        printf("I am the parent! PID = %d, child = %d\n",
               getpid(), pid);
    } else {
        // fork() failed
        perror("fork");
    }
    return 0;
}

Before fork(): Parent (PID 100) ──→ running main() After fork(): Parent (PID 100) ──→ pid = 200 (child's PID) Child (PID 200) ──→ pid = 0 (that's how it knows it's the child)

The trick: fork() returns twice -- once in the parent (returns child's PID) and once in the child (returns 0). That's how each process knows who it is.

exec() -- Replace Yourself

exec() replaces the current process's code and memory with a new program. It loads a different executable and starts running it from its main(). The PID stays the same.

C
#include <unistd.h>

int main() {
    pid_t pid = fork();

    if (pid == 0) {
        // Child: replace myself with "ls -la"
        execlp("ls", "ls", "-la", NULL);

        // If we get here, exec failed
        perror("exec failed");
    }
    return 0;
}

Key Insight

exec() never returns on success -- the old program is completely gone, replaced by the new one. Any code after exec() only runs if exec failed.

wait() -- Wait for Your Child

wait() blocks the parent until a child process finishes. This lets you run something and get its exit code.

C
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <sys/wait.h>

int main() {
    pid_t pid = fork();

    if (pid == 0) {
        // Child
        printf("Child doing work...\n");
        sleep(2);
        printf("Child done!\n");
        exit(42);  // exit with code 42
    } else {
        // Parent waits for child
        int status;
        waitpid(pid, &status, 0);

        if (WIFEXITED(status)) {
            printf("Child exited with code %d\n",
                   WEXITSTATUS(status));  // prints 42
        }
    }
    return 0;
}

How a Shell Works (Putting It Together)

Now you can understand how bash/zsh actually works. When you type ls -la in a shell:

Shell (PID 100) │ ├─ 1. Read your command: "ls -la" ├─ 2. fork() ──→ Child (PID 200) created │ │ │ ├─ 3. exec("ls", "ls", "-la") │ │ (child becomes ls) │ ├─ 4. ls runs, prints output │ └─ 5. ls exits │ ├─ 6. wait() returns (child done) └─ 7. Print prompt, go to step 1

This fork+exec pattern is also why redirection (>, <, |) works. Between the fork() and exec(), the shell can manipulate the child's file descriptors (close stdout, open a file on fd 1) before the new program starts. The new program inherits the modified file descriptors and doesn't even know its output is going to a file instead of the terminal.

Zombie Processes

If a child exits but its parent never calls wait(), the child becomes a zombie -- it's dead but its PCB entry stays in the process table (so the parent can check its exit code later). Zombies waste a slot in the process table. If the parent dies too, init (PID 1) adopts the orphan and reaps it.

5. CPU Scheduling

The CPU can only run one process at a time (per core). The scheduler decides which process runs next and for how long. The goal: make it feel like everything runs simultaneously.

Key Metrics

Turnaround time = completion time - arrival time (how long from submission to finish)
Response time = first-run time - arrival time (how long until first interaction)
Fairness -- every process gets a reasonable share of CPU

Scheduling Metrics (Precise Definitions):

• Turnaround time = completion_time - arrival_time
• Response time = first_run_time - arrival_time
• Waiting time = turnaround_time - burst_time
• Throughput = number_of_processes / total_time

Scheduling Algorithms

Algorithm	How It Works	Pros / Cons
FIFO	First come, first served. Run each job to completion.	Simple. But a long job blocks everything behind it (convoy effect).
SJF	Shortest Job First. Run the shortest job next.	Optimal turnaround time. But you can't always predict job length.
Round Robin	Give each process a fixed time slice (quantum, e.g. 10ms). Rotate through the ready queue.	Good response time. Fair. But poor turnaround for long jobs.
MLFQ	Multi-Level Feedback Queue. Multiple priority queues. New jobs start at highest priority. If a job uses its full time slice, it moves down. I/O-bound jobs stay high.	Best of both worlds -- responsive to interactive tasks AND handles long jobs. Used by real OSes.

Example: Round Robin with quantum = 2

Three processes arrive at time 0: A (needs 5 units), B (needs 3 units), C (needs 1 unit).

Timeline
Time:  0  1  2  3  4  5  6  7  8
CPU:   A  A  B  B  C  A  A  B  A
       └──┘  └──┘  │  └──┘  │  │
       A:2   B:2   C done  B done  A done
                    t=5     t=8    t=9

C finishes at time 5, B at time 8, A at time 9. Compare with FIFO (A, B, C): C wouldn't start until time 8!

Preemption

Modern schedulers are preemptive -- they can forcibly stop a running process and switch to another. This uses a hardware timer interrupt that fires periodically (every 1-10ms). When it fires, the OS gets control and can decide to switch processes. Without preemption, a process could hog the CPU forever.

Linux Uses CFS

Linux uses the Completely Fair Scheduler (CFS). It tracks how much CPU time each process has gotten ("virtual runtime") and always runs the process with the least virtual runtime. This ensures fairness over time. It uses a red-black tree to pick the next process in O(log n).

6. Threads & Concurrency

A thread is a lightweight unit of execution within a process. One process can have multiple threads, and they all share the same address space (code, heap, global data). Each thread has its own stack and registers.

Processes vs Threads

Aspect	Process	Thread
Memory	Separate address space	Shared address space
Creation cost	Expensive (copy page table)	Cheap (just a new stack)
Communication	IPC needed (pipes, sockets)	Read/write shared memory directly
Crash impact	One crash doesn't affect others	One crash kills all threads in process
Context switch	Slow (switch page table)	Fast (same page table)

Process (PID 100) ┌──────────────────────────────────┐ │ Code (shared) │ │ Heap (shared) │ │ Global data (shared) │ │ │ │ ┌──────┐ ┌──────┐ ┌──────┐ │ │ │Stack │ │Stack │ │Stack │ │ │ │ T1 │ │ T2 │ │ T3 │ │ │ └──────┘ └──────┘ └──────┘ │ │ Thread 1 Thread 2 Thread 3 │ └──────────────────────────────────┘

Why Threads Are Useful

Parallelism -- on a multi-core CPU, threads can run on different cores simultaneously. A 4-core machine can truly run 4 threads at once.
Avoiding blocking -- if one thread is waiting for I/O, other threads keep running. A web server uses one thread per connection so slow clients don't block fast ones.
Sharing state -- threads can share data structures without the complexity of inter-process communication.

Creating Threads (pthreads)

C
#include <stdio.h>
#include <pthread.h>

void* worker(void* arg) {
    int id = *(int*)arg;
    printf("Thread %d running\n", id);
    return NULL;
}

int main() {
    pthread_t threads[4];
    int ids[4] = {0, 1, 2, 3};

    for (int i = 0; i < 4; i++) {
        pthread_create(&threads[i], NULL, worker, &ids[i]);
    }

    // Wait for all threads to finish
    for (int i = 0; i < 4; i++) {
        pthread_join(threads[i], NULL);
    }
    printf("All threads done\n");
    return 0;
}

Compile with: gcc -pthread thread_example.c -o thread_example

The Concurrency Problem: Race Conditions

Shared memory is both the best and worst thing about threads. If two threads modify the same variable without coordination, you get a race condition.

C
int counter = 0;  // shared between threads

void* increment(void* arg) {
    for (int i = 0; i < 1000000; i++) {
        counter++;  // NOT atomic! This is actually 3 steps:
                    //   1. Load counter from memory into register
                    //   2. Add 1 to register
                    //   3. Store register back to memory
    }
    return NULL;
}

// If 2 threads run increment(), you'd expect counter = 2,000,000
// But you'll get something less -- maybe 1,500,000 -- because
// the threads interleave those 3 steps unpredictably

Thread A Thread B ───────── ───────── load counter (= 5) load counter (= 5) add 1 (= 6) add 1 (= 6) store counter (= 6) store counter (= 6) ← LOST UPDATE! Both threads incremented, but counter only went from 5 to 6, not 5 to 7.

This is why we need locks and synchronisation, covered in the next section.

7. Locks & Synchronization

A lock (mutex) ensures that only one thread can access a shared resource at a time. The pattern is always the same:

lock(&mutex); // critical section -- only one thread at a time here // access shared data safely unlock(&mutex);

C
#include <pthread.h>

int counter = 0;
pthread_mutex_t lock = PTHREAD_MUTEX_INITIALIZER;

void* increment(void* arg) {
    for (int i = 0; i < 1000000; i++) {
        pthread_mutex_lock(&lock);
        counter++;  // safe now -- only one thread at a time
        pthread_mutex_unlock(&lock);
    }
    return NULL;
}
// Now 2 threads will correctly produce counter = 2,000,000

Condition Variables

Sometimes a thread needs to wait for a condition, not just for a lock. A condition variable lets a thread sleep until another thread signals it.

Classic pattern: producer/consumer queue.

C
pthread_mutex_t lock = PTHREAD_MUTEX_INITIALIZER;
pthread_cond_t  cond = PTHREAD_COND_INITIALIZER;
int ready = 0;

// Thread A: producer
pthread_mutex_lock(&lock);
ready = 1;
pthread_cond_signal(&cond);   // wake up the consumer
pthread_mutex_unlock(&lock);

// Thread B: consumer
pthread_mutex_lock(&lock);
while (!ready) {                // always use while, not if
    pthread_cond_wait(&cond, &lock);  // releases lock, sleeps, re-acquires
}
// ready == 1 here, process the data
pthread_mutex_unlock(&lock);

Always Use while, Not if

Use while (!condition), never if (!condition) before cond_wait. The thread can be woken spuriously (without the condition actually being true). The while loop rechecks and goes back to sleep if it was a false alarm.

Semaphores

A semaphore is a generalised lock with a counter. A mutex is a semaphore with value 1. A semaphore with value N lets N threads in simultaneously.

sem_wait(&sem): if sem > 0, decrement and continue. If sem == 0, block. sem_post(&sem): increment sem. If threads are waiting, wake one up.

Use case: limit concurrent database connections to 10.

C
#include <semaphore.h>

sem_t pool;
sem_init(&pool, 0, 10);  // allow 10 concurrent connections

// Each worker thread:
sem_wait(&pool);      // decrement (blocks if 0)
// ... use the connection ...
sem_post(&pool);      // increment (wake up a waiter)

8. Deadlocks

A deadlock occurs when two or more threads are each waiting for a resource held by another, and none can make progress. Everyone's stuck forever.

Thread A Thread B ──────── ──────── lock(mutex1) ✓ lock(mutex2) ✓ lock(mutex2) BLOCKED! (waiting for B) lock(mutex1) BLOCKED! (waiting for A) ──→ DEADLOCK! Neither can proceed.

Four Conditions for Deadlock

ALL four must be true simultaneously for deadlock to occur (Coffman conditions):

Mutual exclusion -- resources can't be shared (only one thread holds a lock at a time)
Hold and wait -- a thread holds one resource while waiting for another
No preemption -- you can't forcibly take a lock from a thread
Circular wait -- A waits for B, B waits for C, C waits for A

Break any one condition and deadlock becomes impossible.

Coffman Conditions for Deadlock (All Four Must Hold):

1. Mutual Exclusion: At least one resource is non-sharable
2. Hold and Wait: A process holds resources while waiting for others
3. No Preemption: Resources cannot be forcibly taken from a process
4. Circular Wait: A cycle exists in the resource-allocation graph

Break any one condition → no deadlock possible.

Prevention Strategies

Strategy	How	Breaks Which Condition
Lock ordering	Always acquire locks in the same global order	Circular wait
Lock all at once	Grab all needed locks atomically before doing work	Hold and wait
Try-lock	Use `trylock()` -- if you can't get the lock, release everything and retry	Hold and wait
Single lock	Use one big lock instead of many fine-grained ones	Circular wait (but hurts performance)

Practical Advice

Lock ordering is the most common real-world solution. If every thread always locks mutex1 before mutex2, circular wait is impossible. Assign a global order to all locks and enforce it everywhere.

9. Memory & Address Spaces

Every process thinks it has its own private, contiguous block of memory starting at address 0. This is a virtual address space -- an illusion created by the OS and hardware.

Layout of a Process's Address Space

High addresses ┌─────────────────────────┐ 0xFFFFFFFF... │ Stack │ ← Local variables, function call frames │ ↓ │ Grows downward │ │ │ (free space) │ │ │ │ ↑ │ │ Heap │ ← malloc/new allocations │ │ Grows upward ├─────────────────────────┤ │ BSS (uninit data) │ ← Global vars initialized to 0 ├─────────────────────────┤ │ Data (init data) │ ← Global vars with initial values ├─────────────────────────┤ │ Code │ ← Your compiled machine instructions │ (Text) │ (read-only) └─────────────────────────┘ 0x00000000 Low addresses

Virtual vs Physical Addresses

When your program accesses address 0x00401000, that's a virtual address. The CPU's Memory Management Unit (MMU) translates it to a physical address in actual RAM. Different processes can use the same virtual address but they map to different physical locations.

Process A Physical RAM Virtual addr 0x1000 ──────→ Physical addr 0x50000 Virtual addr 0x2000 ──────→ Physical addr 0x80000 Process B Virtual addr 0x1000 ──────→ Physical addr 0xA0000 (different!) Virtual addr 0x2000 ──────→ Physical addr 0xC0000 (different!)

This is why processes can't read each other's memory -- their virtual address 0x1000 maps to completely different physical locations.

Stack vs Heap

Stack	Heap
Automatic allocation/deallocation	Manual (`malloc`/`free`, `new`/`delete`)
Very fast (just move stack pointer)	Slower (search for free block)
Fixed size (~1-8 MB typically)	Can grow to fill available memory
Local variables, function args	Dynamic data, objects, arrays of unknown size
LIFO order	Any order

The malloc/free Dance

malloc() doesn't always make a syscall. It manages a free list of previously-freed blocks in user space. Only when it runs out does it call sbrk() or mmap() to ask the kernel for more memory.

C
// Allocate 100 ints on the heap
int* arr = malloc(100 * sizeof(int));
if (!arr) { /* out of memory */ }

arr[0] = 42;
arr[99] = 99;

// MUST free when done -- OS doesn't free it for you (until process exits)
free(arr);
arr = NULL;  // good practice: avoid dangling pointer

Common Memory Bugs

Memory leak -- malloc() without free(). Memory usage grows forever.
Use after free -- access memory after free(). Undefined behavior.
Double free -- call free() twice on the same pointer. Corrupts the allocator.
Buffer overflow -- write past the end of an array. Can overwrite other data or code.

Use valgrind to detect these: valgrind ./my_program

10. Paging & Virtual Memory

The OS divides both virtual and physical memory into fixed-size chunks called pages (typically 4 KB). The mapping from virtual pages to physical frames is stored in a page table.

Address Translation

Virtual address = Page Number + Offset within page Example with 4 KB (4096-byte) pages: Virtual address 0x00005A30 Page number = 0x00005A30 / 4096 = 0x5 (page 5) Offset = 0x00005A30 % 4096 = 0xA30 (byte 2608 within the page) Look up page 5 in page table → physical frame 0x7C Physical address = 0x7C * 4096 + 0xA30 = 0x7CA30

Page Table Entry

Each entry in the page table contains:

Physical frame number -- where this page lives in RAM
Valid bit -- is this page actually in RAM? (0 = not mapped or swapped out)
Protection bits -- read/write/execute permissions
Dirty bit -- has this page been modified? (needs writing to disk before eviction)
Referenced bit -- has this page been accessed recently?

TLB -- Making It Fast

Looking up the page table on every memory access would be unbearably slow. The Translation Lookaside Buffer (TLB) is a tiny, ultra-fast cache inside the CPU that stores recent page-to-frame translations.

CPU generates virtual address │ ▼ ┌─────────┐ HIT (fast path, ~1 ns) │ TLB │ ──────────────────────→ Physical address → RAM └────┬────┘ │ MISS ▼ ┌──────────┐ │Page Table│ ──→ Physical address → RAM │ (in RAM) │ (also update TLB for next time) └──────────┘

TLB hit rates are typically 99%+. A TLB miss costs ~10-100 ns (page table walk). A page fault (page not in RAM at all) costs ~1-10 ms (load from disk).

Page Faults & Swapping

If a page's valid bit is 0, accessing it causes a page fault. The OS handles it:

Program accesses a virtual address whose page isn't in RAM
CPU raises a page fault exception
OS checks: is this a valid address? If not → segfault (kill the process)
If valid but swapped to disk → OS finds a free frame (or evicts another page)
OS reads the page from disk into the free frame
OS updates the page table entry with the new frame number
Process resumes the instruction that faulted

This is how you can use more memory than you physically have -- the OS transparently swaps pages to/from disk. But disk is ~100,000x slower than RAM, so heavy swapping (thrashing) makes everything crawl.

Why Programs Start Slowly

When you launch a program, the OS doesn't load all its pages into RAM immediately. It uses demand paging -- pages are only loaded when first accessed. That's why the first run of a function is slower (page fault) but subsequent runs are fast (page already in RAM).

11. I/O & Devices

The OS manages all hardware devices: disks, keyboards, network cards, GPUs, USB devices. Programs never talk to devices directly -- they go through the OS using syscalls and device drivers.

How I/O Works

There are two main approaches:

Polling (programmed I/O) -- the CPU repeatedly checks "is the device ready yet?" Wastes CPU cycles spinning in a loop.
Interrupts -- the device notifies the CPU when it's done by sending an interrupt. The CPU can do other work while waiting. This is what modern systems use.

Polling: Interrupts: CPU: "Ready?" → No CPU: "Start read" → does other work CPU: "Ready?" → No ... CPU: "Ready?" → No Device: "INTERRUPT! Data ready" CPU: "Ready?" → Yes! CPU: handles data CPU: reads data (wastes CPU) (efficient)

DMA (Direct Memory Access)

For large data transfers (reading a file from disk), the CPU doesn't copy each byte itself. Instead, it sets up a DMA controller that transfers data directly from the device to RAM. The CPU is free to run other processes during the transfer. When DMA finishes, it sends an interrupt.

Device Drivers

Each hardware device has a driver -- kernel code that knows how to talk to that specific device. The OS provides a standard interface (read, write, ioctl) and the driver translates to device-specific commands. About 70% of OS code is device drivers.

Application │ read(fd, buf, n) ▼ Generic File System Layer │ ▼ Device Driver (e.g., NVMe driver) │ sends specific hardware commands ▼ Physical Device (SSD)

12. File Systems

A file system organises data on disk into files and directories. It answers: where on the physical disk is byte 50,000 of /home/you/report.pdf?

Key Abstractions

File -- a named sequence of bytes. Nothing more. The OS doesn't care what's in it (text, binary, image). That's the application's job.
Directory -- a special file that contains a list of (name, inode number) pairs.
Inode -- a data structure that stores a file's metadata (size, permissions, timestamps) and the locations of its data blocks on disk. Every file has exactly one inode.

Inode Structure

Inode #42 ┌──────────────────────────┐ │ type: regular file │ │ size: 15,360 bytes │ │ permissions: rwxr-xr-x │ │ owner: user1 │ │ timestamps: ctime/mtime │ │ link count: 1 │ ├──────────────────────────┤ │ direct pointers: │ │ block 100 │ → 4 KB of data │ block 204 │ → 4 KB of data │ block 307 │ → 4 KB of data (last ~3 KB) │ ... │ ├──────────────────────────┤ │ indirect pointer → block │ → points to block of pointers │ double indirect → block │ → points to blocks of pointers │ triple indirect → block │ → for very large files └──────────────────────────┘

Small files (a few KB) use direct pointers. Large files use indirect pointers (a block that contains more pointers). This is how Unix handles files from 1 byte to terabytes with the same inode structure.

Creating a File (What Actually Happens)

When you run touch newfile.txt:

OS finds a free inode in the inode bitmap
Initialises the inode (size=0, permissions, timestamps)
Adds entry ("newfile.txt", inode_number) to the parent directory
Updates the parent directory's modification time

When you write data to it, the OS also allocates data blocks from the data bitmap.

Hard Links vs Symbolic Links

Hard Link	Symbolic (Soft) Link
Another directory entry pointing to the same inode	A separate file whose content is a path to the target
Same file, different name. Deleting one name doesn't delete the data (until link count = 0)	A shortcut. If the target is deleted, the symlink is broken
Can't cross filesystem boundaries	Can point anywhere, even across filesystems
`ln original.txt hardlink.txt`	`ln -s original.txt symlink.txt`

Crash Consistency: Journaling

What if the machine crashes mid-write? A file system write might need to update 3 things: the inode, a data block, and the bitmap. If only some of these complete before the crash, the disk is inconsistent.

Journaling (used by ext4, NTFS, HFS+) solves this:

Write a journal entry describing all planned changes
Write the journal entry to a special area on disk
Commit the journal entry (mark it complete)
Apply the changes to the actual file system locations
Delete the journal entry

If the machine crashes at any point, on reboot the OS replays any committed journal entries. This guarantees the file system is always consistent.

Common Linux File Systems

ext4 -- the default. Journaling, mature, reliable. Good for most uses.
XFS -- good for large files and parallel I/O. Used by many servers.
Btrfs -- copy-on-write, snapshots, checksums. Modern but less battle-tested.
tmpfs -- in-memory filesystem. Files in /tmp often live here (super fast, lost on reboot).

13. OS Security Basics

The OS is the ultimate gatekeeper. Every security boundary on your machine is enforced by the kernel.

Users & Permissions

Every file has an owner, a group, and permission bits:

-rwxr-xr-- 1 alice devs 4096 Jan 15 10:30 script.sh │├─┤├─┤├─┤ │ │ │ └── Others: read only (r--) │ │ └───── Group (devs): read + execute (r-x) │ └──────── Owner (alice): read + write + execute (rwx) └────────── File type: - = regular, d = directory, l = symlink

Numeric form: r=4, w=2, x=1. So rwxr-xr-- = 754.

Shell
# Change permissions
chmod 755 script.sh      # rwxr-xr-x
chmod u+x script.sh      # add execute for owner
chmod go-w secret.txt    # remove write for group and others

# Change ownership
chown alice:devs file.txt

# The root user (UID 0) bypasses all permission checks

Privilege Escalation

setuid bit -- when set on an executable, it runs as the file's owner, not the user who launched it. This is how passwd (owned by root) can modify /etc/shadow even when run by a normal user. chmod u+s file
sudo -- runs a single command as root, after checking /etc/sudoers.
capabilities -- fine-grained alternative to root. Instead of all-or-nothing, a process can be granted just the ability to bind to low ports, or just the ability to read raw packets.

Process Isolation

The OS enforces isolation between processes via:

Virtual memory -- each process has its own page table. Process A literally cannot address Process B's memory.
User/kernel mode -- user-mode code cannot execute privileged instructions.
File permissions -- processes inherit the UID of the user who launched them and can only access files that user has permission for.
Namespaces (Linux) -- isolate what a process can see: its own PID space, network stack, mount points. This is the foundation of containers (Docker).
cgroups (Linux) -- limit how much CPU, memory, and I/O a process can use.

Buffer Overflows & Security

A buffer overflow in C/C++ can let an attacker overwrite the return address on the stack, redirecting execution to malicious code. Modern defenses:

ASLR -- randomise memory layout so attackers can't predict addresses.
Stack canaries -- place a known value before the return address; detect overwrites.
NX bit -- mark the stack as non-executable so injected code can't run.

14. Context Switching

We mentioned context switches briefly in section 2. Now let's go deeper. A context switch is when the OS stops running one process (or thread) on a CPU core and starts running a different one. This is the fundamental mechanism that makes multitasking work.

What Actually Happens During a Context Switch

When the OS decides to switch from Process A to Process B, here's exactly what happens:

Process A running on CPU │ ▼ ┌─────────────────────────────────────────────┐ │ 1. Trap/interrupt fires (timer, syscall) │ │ → CPU switches to kernel mode │ ├─────────────────────────────────────────────┤ │ 2. Save Process A's state: │ │ • General-purpose registers (rax..r15) │ │ • Program counter (RIP) -- where A was │ │ • Stack pointer (RSP) │ │ • Flags register (RFLAGS) │ │ • Floating-point/SIMD registers │ │ → All saved into A's PCB │ ├─────────────────────────────────────────────┤ │ 3. Save A's memory mapping info │ │ • Page table base register (CR3 on x86) │ ├─────────────────────────────────────────────┤ │ 4. Switch kernel stack to B's kernel stack │ ├─────────────────────────────────────────────┤ │ 5. Load Process B's state from B's PCB: │ │ • Restore all registers │ │ • Restore program counter │ │ • Restore stack pointer │ ├─────────────────────────────────────────────┤ │ 6. Load B's page table (write to CR3) │ │ → TLB is flushed (expensive!) │ ├─────────────────────────────────────────────┤ │ 7. Return from trap │ │ → CPU switches to user mode │ │ → B resumes exactly where it left off │ └─────────────────────────────────────────────┘ │ ▼ Process B running on CPU

The Cost of Context Switching

A context switch typically takes 1-10 microseconds of direct CPU time. That sounds tiny, but there are hidden costs:

TLB flush -- this is the expensive part. When you load a new page table, the Translation Lookaside Buffer (TLB) entries from the old process are invalid. The new process starts with a cold TLB, meaning every memory access causes a page table walk until the TLB warms up again. This can cost tens of microseconds of indirect slowdown.
Cache pollution -- Process A's data was in L1/L2/L3 cache. Process B's data isn't. B will suffer cache misses until its working set is loaded.
Pipeline flush -- the CPU's instruction pipeline and branch predictor state are useless for the new process.

Why Too Many Context Switches Kill Performance

If the OS is switching thousands of times per second, processes spend more time warming up caches and TLBs than doing actual work. This is called thrashing. It's why a system with 500 runnable processes feels sluggish even if the CPU isn't "100% busy" -- the useful work per time slice drops dramatically.

Voluntary vs Involuntary Context Switches

Voluntary -- the process gives up the CPU willingly. It made a blocking syscall (read from disk, sleep, wait for a lock). The process can't continue, so it tells the OS "I'm done for now."
Involuntary -- the OS forces the process off the CPU. Usually because the timer interrupt fired and the process used up its time slice (quantum). This is preemption.

Diagnosing Context Switches

High involuntary context switches = too many runnable processes fighting for CPU. High voluntary context switches = process is doing lots of I/O (which might be fine, or might indicate excessive blocking).

How to Measure Context Switches

Terminal
# System-wide context switches per second
vmstat 1
# Look at the 'cs' column (context switches)

# Per-process context switches
grep ctxt /proc/PID/status
# voluntary_ctxt_switches:     150
# nonvoluntary_ctxt_switches:  30

# Watch context switches live with pidstat
pidstat -w 1
# Shows cswch/s (voluntary) and nvcswch/s (involuntary) per process

# Trace context switches with perf
sudo perf stat -e context-switches ./my_program

Reading vmstat Output

Terminal
$ vmstat 1
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 2  0      0 3201024  89456 1123456  0    0     4    12  156  320 15  5 78  2  0
 1  0      0 3200896  89456 1123460  0    0     0     0  203  450 22  8 68  2  0

The cs column shows 320 and 450 context switches per second. The r column shows runnable processes (2 and 1). If r is consistently much higher than your CPU count and cs is in the thousands, you likely have too many processes competing.

15. Inter-Process Communication (IPC)

Every process has its own isolated address space -- process A cannot read or write process B's memory. This is great for security and stability, but programs often need to talk to each other. That's what IPC is for.

Pipes (Unnamed)

The simplest IPC mechanism. When you write ls | grep foo in your shell, the shell creates a pipe between ls and grep.

One-directional -- data flows one way (writer → reader)
Parent-child only -- the pipe exists as a file descriptor inherited by fork()
Buffered -- the kernel provides a small buffer (typically 64KB on Linux)
If the buffer is full, the writer blocks. If the buffer is empty, the reader blocks.

C
#include <unistd.h>

int fd[2];
pipe(fd);  // fd[0] = read end, fd[1] = write end

if (fork() == 0) {
    // Child: read from pipe
    close(fd[1]);
    char buf[128];
    read(fd[0], buf, sizeof(buf));
    printf("Child got: %s\n", buf);
} else {
    // Parent: write to pipe
    close(fd[0]);
    write(fd[1], "hello from parent", 17);
    close(fd[1]);
    wait(NULL);
}

Named Pipes (FIFOs)

Like unnamed pipes but they exist as files on the filesystem, so any two processes can use them (not just parent-child).

Terminal
# Create a named pipe
mkfifo /tmp/my_pipe

# Terminal 1: read from it (blocks until someone writes)
cat /tmp/my_pipe

# Terminal 2: write to it
echo "hello" > /tmp/my_pipe

Shared Memory

The fastest IPC method. Two processes map the same physical memory into their address spaces. No copying, no kernel involvement after setup -- just read and write memory directly.

POSIX API: shmget() / shmat() (older) or mmap() with MAP_SHARED (modern)
Needs synchronization -- since both processes access the same memory, you need semaphores or mutexes to avoid races
Used when performance matters: databases, game engines, scientific computing

C
// Process A: create shared memory
int fd = shm_open("/my_shm", O_CREAT | O_RDWR, 0666);
ftruncate(fd, 4096);
char* ptr = mmap(NULL, 4096, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
sprintf(ptr, "shared data here");

// Process B: open the same shared memory
int fd = shm_open("/my_shm", O_RDONLY, 0);
char* ptr = mmap(NULL, 4096, PROT_READ, MAP_SHARED, fd, 0);
printf("%s\n", ptr);  // prints "shared data here"

Message Queues

Structured message passing. A process sends a message to a queue, another process reads from it. Messages have types, so you can selectively receive.

Decoupled -- sender and receiver don't need to be running at the same time
POSIX: mq_open(), mq_send(), mq_receive()
Used for task queues, job dispatching

Signals

Asynchronous notifications sent to a process. Like software interrupts.

SIGTERM (15) -- "please terminate gracefully" (the default kill signal)
SIGKILL (9) -- "die immediately, no cleanup" (cannot be caught or ignored)
SIGINT (2) -- sent when you press Ctrl+C
SIGUSR1 / SIGUSR2 -- user-defined signals for custom behaviour
SIGSEGV (11) -- segmentation fault (invalid memory access)
SIGCHLD -- sent to parent when a child process exits

Terminal
# Send SIGTERM to process 1234
kill 1234

# Send SIGKILL (force kill)
kill -9 1234

# Send SIGUSR1
kill -USR1 1234

# List all signals
kill -l

Unix Domain Sockets

Like TCP/IP sockets but for local communication only. No network overhead, no TCP handshake -- just fast, bidirectional, byte-stream or datagram communication between processes on the same machine.

Used by Docker (communicates via /var/run/docker.sock)
Used by PostgreSQL, MySQL for local connections
Used by X11/Wayland for display communication
About 2x faster than TCP loopback (localhost)

Terminal
# See Unix domain sockets in use
ss -xl
# Or check what Docker uses
ls -la /var/run/docker.sock

Comparing IPC Methods

Speed vs Complexity of IPC Methods Fast ▲ │ ★ Shared Memory │ (fastest, but needs sync) │ │ ★ Unix Domain Sockets │ (fast, bidirectional) │ │ ★ Pipes / FIFOs │ (simple, one-way) │ │ ★ Message Queues │ (structured, decoupled) │ │ ★ Signals Slow │ (tiny data, async only) └───────────────────────────────────► Complex Simple Complex

IPC Methods Summary

Method	Direction	Speed	Use Case
Pipe	One-way	Fast	Shell pipelines, parent-child
Named pipe (FIFO)	One-way	Fast	Unrelated processes, simple streaming
Shared memory	Both	Fastest	High-throughput data sharing, databases
Message queue	Both	Medium	Task queues, structured messages
Signal	One-way	Fast	Async notifications (kill, Ctrl+C)
Unix domain socket	Both	Fast	Docker, databases, local services

16. How Programs Actually Run

You type ./myprogram and hit Enter. What actually happens? Most CS students can't explain this clearly. Let's fix that.

The Full Sequence

You type: ./myprogram │ ▼ ┌─────────────────────────────────────────────┐ │ 1. Shell calls fork() │ │ → Creates a child process (copy of shell)│ ├─────────────────────────────────────────────┤ │ 2. Child calls execve("./myprogram", ...) │ │ → Replaces itself with the new program │ │ → Old shell code is gone from child │ ├─────────────────────────────────────────────┤ │ 3. Kernel loads the ELF binary │ │ → Reads ELF header to find segments │ │ → Maps .text (code) into memory │ │ → Maps .data (initialized globals) │ │ → Allocates .bss (zero-initialized data) │ │ → Sets up stack │ ├─────────────────────────────────────────────┤ │ 4. Dynamic linker runs (ld-linux.so) │ │ → Finds shared libraries (.so files) │ │ → Maps them into process address space │ │ → Resolves symbols (function addresses) │ ├─────────────────────────────────────────────┤ │ 5. Jump to _start (C runtime entry point) │ │ → _start calls __libc_start_main() │ │ → Sets up argc, argv, envp │ │ → Calls your main() function │ ├─────────────────────────────────────────────┤ │ 6. main() runs your code │ │ → When main() returns, exit() is called │ │ → Kernel cleans up process resources │ └─────────────────────────────────────────────┘

The ELF Format

ELF (Executable and Linkable Format) is the binary format used on Linux. Every compiled C/C++/Rust/Go program on Linux is an ELF file. Here are the key sections:

.text -- your compiled machine code (read-only, executable)
.data -- initialized global/static variables (e.g., int x = 42;)
.bss -- uninitialized global/static variables (e.g., int y;) -- zeroed at startup, doesn't take space in the file
.rodata -- read-only data (string literals like "hello")
.plt / .got -- used for dynamic linking (jumping to shared library functions)

Terminal
# Inspect ELF headers
readelf -h ./myprogram

# See all sections
readelf -S ./myprogram

# See program headers (segments loaded into memory)
readelf -l ./myprogram

# Quick check if something is an ELF file
file ./myprogram
# Output: ELF 64-bit LSB pie executable, x86-64, ...

Static vs Dynamic Linking

Static linking -- all library code is copied into your executable at compile time. Bigger binary, but no external dependencies. The binary runs anywhere (same architecture).
Dynamic linking -- your binary just records which shared libraries it needs. At runtime, the dynamic linker (ld-linux.so) loads them. Smaller binary, shared memory for libraries, but the .so files must exist on the system.

Terminal
# Compile statically (everything baked in)
gcc -static -o myprogram myprogram.c
ls -la myprogram   # ~1-2 MB (includes all of libc)

# Compile dynamically (default)
gcc -o myprogram myprogram.c
ls -la myprogram   # ~16 KB (just your code + references)

# See what shared libraries a dynamic binary needs
ldd ./myprogram
#   linux-vdso.so.1 (0x00007ffd3a1f2000)
#   libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f2a1c000000)
#   /lib64/ld-linux-x86-64.so.2 (0x00007f2a1c400000)

Seeing the Memory Layout: /proc/PID/maps

Every running process has a file at /proc/PID/maps that shows its complete virtual memory layout. This is the actual address space we talked about in section 9, but for a real process.

Terminal
# See the memory map of the current shell
cat /proc/self/maps

Here's what real output looks like (simplified):

Terminal
# Address range          Perms  Offset  Dev   Inode  Pathname
55a1d7a00000-55a1d7a02000 r--p  00000000 08:01 12345  /usr/bin/myprogram   # ELF header + read-only data
55a1d7a02000-55a1d7a04000 r-xp  00002000 08:01 12345  /usr/bin/myprogram   # .text (executable code)
55a1d7a04000-55a1d7a05000 r--p  00004000 08:01 12345  /usr/bin/myprogram   # .rodata (string literals)
55a1d7a05000-55a1d7a06000 rw-p  00005000 08:01 12345  /usr/bin/myprogram   # .data + .bss (globals)
55a1d8200000-55a1d8221000 rw-p  00000000 00:00 0      [heap]               # heap (malloc'd memory)
7f2a1c000000-7f2a1c1b0000 r-xp  00000000 08:01 67890  /lib/x86_64-linux-gnu/libc.so.6  # libc code
7f2a1c1b0000-7f2a1c1b4000 rw-p  001b0000 08:01 67890  /lib/x86_64-linux-gnu/libc.so.6  # libc data
7ffd3a1d0000-7ffd3a1f1000 rw-p  00000000 00:00 0      [stack]              # the stack
7ffd3a1f2000-7ffd3a1f4000 r--p  00000000 00:00 0      [vdso]               # virtual dynamic shared object

Reading the Permissions

r = readable, w = writable, x = executable, p = private (copy-on-write), s = shared. Notice that .text is r-x (read + execute, but NOT writable) -- this is the NX bit in action. The stack is rw- (read + write, but NOT executable) -- preventing code injection.

Try It Yourself

Run cat /proc/self/maps on your Linux machine (or in WSL). You'll see the memory layout of the cat process itself. Find the [heap], [stack], and libc entries. Compare the addresses to the memory layout diagrams from section 9.

17. Containers & Virtualization (OS Perspective)

You've probably used Docker or heard of VMs. But how do they actually work at the OS level? This section explains the kernel mechanisms that make containers and VMs possible.

How Virtual Machines Work

A VM runs a complete guest operating system on top of emulated hardware. A piece of software called a hypervisor creates the illusion of real hardware for each guest OS.

Type 1 (bare-metal) -- the hypervisor runs directly on hardware, no host OS. Examples: VMware ESXi, KVM (Linux's built-in hypervisor), Microsoft Hyper-V, Xen. Used in datacentres.
Type 2 (hosted) -- the hypervisor runs as a program on top of a regular OS. Examples: VirtualBox, VMware Workstation. Used on developer laptops.

Each VM has its own kernel, its own init system, its own everything. This provides strong isolation but has significant overhead: each VM might use 512MB-2GB just for the guest OS.

How Containers Work: NOT VMs

Containers are fundamentally different from VMs. A container does not run its own kernel. It shares the host's kernel but gets its own isolated view of the system using two Linux kernel features: namespaces and cgroups.

┌───────────────────────────────────────────────────────────────────┐ │ VIRTUAL MACHINES │ │ │ │ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │ │ │ App A │ │ App B │ │ App C │ │ │ ├─────────────┤ ├─────────────┤ ├─────────────┤ │ │ │ Bins/Libs │ │ Bins/Libs │ │ Bins/Libs │ │ │ ├─────────────┤ ├─────────────┤ ├─────────────┤ │ │ │ Guest OS │ │ Guest OS │ │ Guest OS │ ← Each has │ │ │ (full │ │ (full │ │ (full │ its own │ │ │ kernel) │ │ kernel) │ │ kernel) │ kernel! │ │ └──────┬──────┘ └──────┬──────┘ └──────┬──────┘ │ │ └────────────────┼────────────────┘ │ │ ┌─────┴──────┐ │ │ │ Hypervisor │ ← Emulates hardware │ │ ├────────────┤ │ │ │ Host OS │ (or bare metal for Type 1) │ │ ├────────────┤ │ │ │ Hardware │ │ │ └────────────┘ │ └───────────────────────────────────────────────────────────────────┘ ┌───────────────────────────────────────────────────────────────────┐ │ CONTAINERS │ │ │ │ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │ │ │ App A │ │ App B │ │ App C │ │ │ ├─────────────┤ ├─────────────┤ ├─────────────┤ │ │ │ Bins/Libs │ │ Bins/Libs │ │ Bins/Libs │ │ │ └──────┬──────┘ └──────┬──────┘ └──────┬──────┘ │ │ └────────────────┼────────────────┘ │ │ ┌───────────┴───────────┐ │ │ │ Container Runtime │ ← Docker, containerd │ │ │ (namespaces + cgroups)│ │ │ ├───────────────────────┤ │ │ │ Single Shared Kernel │ ← ONE kernel for all! │ │ ├───────────────────────┤ │ │ │ Hardware │ │ │ └───────────────────────┘ │ └───────────────────────────────────────────────────────────────────┘

Linux Namespaces

Namespaces give each container its own isolated view of system resources. The kernel supports these namespace types:

PID namespace -- the container sees its own set of process IDs. PID 1 inside the container is just a regular process on the host with a different PID.
Network namespace -- the container gets its own network stack: its own IP address, routing table, ports. Port 80 inside the container doesn't conflict with port 80 on the host.
Mount namespace -- the container sees its own filesystem tree. It can mount/unmount without affecting the host.
UTS namespace -- the container can have its own hostname.
User namespace -- UID 0 (root) inside the container maps to an unprivileged user on the host. This is how rootless containers work.
Cgroup namespace -- the container sees its own cgroup hierarchy.

Terminal
# See what namespaces a process belongs to
ls -la /proc/self/ns/
# lrwxrwxrwx 1 user user 0 ... cgroup -> 'cgroup:[4026531835]'
# lrwxrwxrwx 1 user user 0 ... mnt -> 'mnt:[4026531840]'
# lrwxrwxrwx 1 user user 0 ... net -> 'net:[4026531992]'
# lrwxrwxrwx 1 user user 0 ... pid -> 'pid:[4026531836]'

# Create a new PID namespace (you become PID 1 inside it!)
sudo unshare --pid --fork --mount-proc bash
ps aux   # Only shows processes in this namespace

Control Groups (cgroups)

While namespaces control what a container can see, cgroups control how much it can use. They limit and account for resource usage:

CPU -- limit a container to e.g. 0.5 CPU cores
Memory -- limit to e.g. 512MB. If the container exceeds this, the OOM killer terminates it.
I/O -- throttle disk read/write bandwidth
PIDs -- limit the number of processes (prevent fork bombs)

Terminal
# Run a Docker container with resource limits
docker run --memory=256m --cpus=0.5 --pids-limit=100 ubuntu

# See cgroup limits for a running container
cat /sys/fs/cgroup/memory/docker/CONTAINER_ID/memory.limit_in_bytes

Why Containers Are Faster Than VMs

There's no guest OS to boot, no second kernel consuming memory, no hardware emulation overhead. A container starts in milliseconds (it's just a process with namespaces). A VM takes seconds to minutes (it's booting an entire OS). Containers also share the base image layers, so 10 containers based on the same image use barely more disk/memory than one.

How Docker Puts It All Together

Docker isn't magic. It's a user-friendly wrapper around three Linux kernel features:

Namespaces -- process isolation (PID, network, mount, etc.)
Cgroups -- resource limits (CPU, memory, I/O)
Overlay filesystem (OverlayFS) -- layered filesystem. The base image is read-only; the container's changes are written to a thin writable layer on top. This is why images are built in layers and why docker commit works.

Connection

If you want to build a simple container from scratch and really understand these mechanisms, check out the Docker page -- there's a section on building your own Docker in Go using these exact syscalls (clone with CLONE_NEWPID, CLONE_NEWNS, etc.).

Verify It Yourself

Run a Docker container, then from the host, inspect it:

Terminal
# Start a container
docker run -d --name test ubuntu sleep 3600

# Find its PID on the host
docker inspect --format '{{.State.Pid}}' test
# e.g., 12345

# Look at its namespaces
ls -la /proc/12345/ns/

# Look at its cgroup limits
cat /proc/12345/cgroup

# See its memory map (same as any process!)
cat /proc/12345/maps

A container is just a process. A heavily namespaced and cgroup-limited process, but a process nonetheless.

18. Essential Linux Commands

These commands let you observe and control everything we've discussed. Every developer should know these.

Process Commands

Shell
# List all processes
ps aux

# Interactive process viewer (top on steroids)
htop

# Show process tree (parent-child relationships)
pstree -p

# Send signals to processes
kill -SIGTERM 1234    # politely ask PID 1234 to exit
kill -SIGKILL 1234    # forcibly kill (can't be caught)
kill -SIGSTOP 1234    # pause the process
kill -SIGCONT 1234    # resume the process

# Run a process in the background
./long_task &

# See what syscalls a process is making
strace -p 1234

# See open files and sockets for a process
lsof -p 1234

Memory Commands

Shell
# Show system memory usage
free -h

# Show memory map of a process
cat /proc/1234/maps

# Show memory usage per process
ps aux --sort=-%mem | head

# Check for memory leaks
valgrind --leak-check=full ./my_program

Filesystem Commands

Shell
# Show disk usage
df -h               # filesystem-level
du -sh /path/to/dir # directory-level

# Show inode info
stat filename        # shows inode number, size, permissions, timestamps
ls -i                # show inode numbers in listing

# Show mounted filesystems
mount | column -t
lsblk                # show block devices

# File type detection
file mystery_file    # "ELF 64-bit executable" or "ASCII text" etc.

Network Commands

Shell
# Show open ports and connections
ss -tulnp            # (or netstat -tulnp on older systems)

# Show network interfaces
ip addr

# DNS lookup
dig example.com

# Trace network path
traceroute example.com

The /proc Filesystem

/proc is a virtual filesystem -- it doesn't exist on disk. It's the kernel exposing its internal data structures as files. Everything about your running system is in here.

Shell
# Info about process 1234
/proc/1234/status   # state, memory, threads
/proc/1234/cmdline  # command that started it
/proc/1234/maps     # memory map (every region of the address space)
/proc/1234/fd/      # directory of open file descriptors (symlinks to files)

# System-wide info
/proc/cpuinfo       # CPU details
/proc/meminfo       # memory details
/proc/uptime        # how long the system has been running

OSTEP Labs

The OSTEP textbook has excellent hands-on labs where you write parts of an OS. If you want to really understand this stuff, do the projects at pages.cs.wisc.edu/~remzi/OSTEP/. The xv6 labs (a teaching OS written in C for RISC-V) are especially valuable.

19. Practice Quiz

Q1: What is the only way for a user-mode program to perform a privileged operation (like opening a file)?

System calls are the only gate from user mode to kernel mode. The CPU instruction (syscall/svc/int) triggers a hardware-enforced privilege switch. User-mode code cannot directly access hardware or switch to kernel mode on its own.

Q2: What does fork() return to the child process?

fork() returns 0 to the child process and the child's PID to the parent process. This is how each process knows which one it is after the fork. -1 means fork failed.

Q3: What are the standard file descriptors 0, 1, and 2?

fd 0 = stdin (standard input), fd 1 = stdout (standard output), fd 2 = stderr (standard error). This convention is universal across Unix/Linux.

Q4: Two threads try to increment a shared counter 1,000,000 times each. Without synchronisation, the final count is likely:

Without a lock, the increment operation (load, add, store) is not atomic. Threads interleave these steps, causing lost updates. The result is less than 2,000,000 and different each run -- a classic race condition.

Q5: Which scheduling algorithm gives each process a fixed time slice and rotates through the ready queue?

Round Robin gives each process a quantum (time slice). When it expires, the process goes to the back of the queue and the next one runs. It's fair and gives good response time.

Q6: What is a page fault?

A page fault occurs when the accessed page's valid bit is 0 (not in RAM). The OS handles it by loading the page from disk (or killing the process if the address is invalid). This is how demand paging and swapping work.

Q7: What is the purpose of journaling in a file system?

Journaling writes planned changes to a log before applying them. If the system crashes mid-write, the journal is replayed on reboot to bring the filesystem to a consistent state. ext4, NTFS, and HFS+ all use journaling.

Q8: Which of these is NOT one of the four conditions required for deadlock?

The four Coffman conditions are: mutual exclusion, hold and wait, no preemption, and circular wait. Starvation (a process never getting CPU time) is a different problem -- it doesn't cause deadlock.

Table of Contents