Table of Contents
1. What Is an OS? 2. Processes 3. System Calls 4. Process API (fork, exec, wait) 5. CPU Scheduling 6. Threads & Concurrency 7. Locks & Synchronization 8. Deadlocks 9. Memory & Address Spaces 10. Paging & Virtual Memory 11. I/O & Devices 12. File Systems 13. OS Security Basics 14. Context Switching 15. Inter-Process Communication (IPC) 16. How Programs Actually Run 17. Containers & Virtualization 18. Essential Linux Commands 19. Practice Quiz1. What Is an Operating System?
An operating system (OS) is a piece of software that sits between your programs and the hardware. When you open a browser, type in a terminal, or save a file, your program doesn't talk to the CPU or disk directly -- it asks the OS to do it.
The OS has three main jobs:
- Virtualisation -- make it look like each program has its own CPU and its own memory, even though they're all sharing the same physical hardware.
- Concurrency -- let multiple programs (and multiple threads within a program) run at the same time without stepping on each other.
- Persistence -- store data on disk so it survives after your program exits or the machine reboots.
These are the "three easy pieces" from the OSTEP textbook. Every concept in this page falls into one of these three buckets.
User Mode vs Kernel Mode
The CPU has (at minimum) two privilege levels:
- User mode -- your normal programs run here. They cannot directly access hardware, other processes' memory, or privileged instructions. If they try, the CPU raises a fault.
- Kernel mode -- the OS kernel runs here. It has full access to everything: hardware registers, all memory, I/O ports.
The only way for a user-mode program to do something privileged (open a file, allocate memory, send a network packet) is to make a system call, which we'll cover in section 3.
This separation is what keeps your computer stable. A buggy program can't crash the whole machine because it literally cannot touch hardware or other processes' memory. The OS acts as a gatekeeper.
2. Processes
A process is a running program. When you double-click an app or type ./my_program in a terminal, the OS creates a process for it. Each process gets:
- Its own address space -- its own view of memory (code, stack, heap, data)
- One or more threads of execution
- Open file descriptors (files, sockets, pipes it has open)
- A process ID (PID) -- a unique number identifying it
- A state -- running, ready, or blocked
Process States
At any moment, a process is in one of these states:
- Ready -- the process could run, but the CPU is busy with another process. It's waiting in the ready queue.
- Running -- the process is actually executing instructions on the CPU right now.
- Blocked -- the process is waiting for something (disk read, network data, user input). It can't run even if the CPU is free.
What's Inside a Process (the PCB)
The OS keeps a Process Control Block (PCB) for every process. Think of it as a struct:
Conceptual
struct PCB {
int pid; // process ID
int state; // READY, RUNNING, BLOCKED
int priority; // scheduling priority
regs_t saved_registers; // CPU registers when not running
void* page_table; // pointer to address space info
int open_files[256]; // file descriptor table
pid_t parent_pid; // who created this process
};
When the OS switches from process A to process B (context switch), it saves A's registers into A's PCB, then loads B's registers from B's PCB. This is how the illusion of "every process has its own CPU" works.
Process A is running. A timer interrupt fires (every ~1-10ms). The OS:
- Saves A's registers (program counter, stack pointer, etc.) into A's PCB
- Changes A's state from RUNNING to READY
- Picks process B from the ready queue
- Loads B's saved registers from B's PCB
- Changes B's state from READY to RUNNING
- Jumps to where B left off
This happens thousands of times per second. Each switch takes ~1-10 microseconds.
3. System Calls (Syscalls)
A system call is how your program asks the OS to do something it can't do itself. Your code runs in user mode. It can't touch hardware. So when it needs to open a file, create a process, or send data over the network, it must cross the boundary into kernel mode.
How a Syscall Works (Step by Step)
The key insight: the syscall instruction is a hardware mechanism. The CPU itself switches privilege levels. Your program can't fake being in kernel mode -- the CPU enforces the boundary.
Major Syscall Categories
| Category | Syscalls | What They Do |
|---|---|---|
| Process | fork, exec, wait, exit, kill | Create, replace, wait for, terminate processes |
| File I/O | open, read, write, close, lseek | Open, read, write, close files |
| Memory | mmap, munmap, brk | Map memory, grow/shrink heap |
| Network | socket, bind, listen, accept, connect | Create sockets, accept connections |
| Info | getpid, getuid, uname | Get process/user/system info |
| Signals | signal, sigaction, kill | Handle async notifications |
| Directory | mkdir, rmdir, chdir, getcwd | Manage directories |
File Descriptors -- the Universal Handle
When you open a file, the OS doesn't give you a pointer to the file. It gives you a small integer called a file descriptor (fd). This is an index into a per-process table of open files.
Every process starts with three file descriptors already open:
When you call open("myfile.txt", O_RDONLY), the kernel finds the next available fd (usually 3) and returns it. All future read() and write() calls use this fd number.
C
// Open a file -- returns a file descriptor (small int)
int fd = open("data.txt", O_RDONLY);
if (fd == -1) {
perror("open failed"); // prints human-readable error
return 1;
}
// Read up to 1024 bytes from the file
char buf[1024];
ssize_t bytes_read = read(fd, buf, sizeof(buf));
// Write to stdout (fd 1)
write(1, buf, bytes_read);
// Always close when done
close(fd);
Unix's big insight: files, pipes, sockets, terminals, and even devices all use the same read()/write()/close() interface with file descriptors. That's why the same code can read from a file, a network connection, or a pipe -- the fd abstracts the details away.
Syscalls vs Library Functions
Don't confuse syscalls with C library (libc) functions:
printf()is a library function -- it formats your string in user space, then eventually calls thewrite()syscall to actually output it.malloc()is a library function -- it manages a pool of memory in user space, and only callsbrk()ormmap()syscalls when it needs more from the OS.fopen()is a library wrapper around theopen()syscall that adds buffering.
Library functions are faster because they avoid the user-to-kernel mode switch. Syscalls are expensive (hundreds of nanoseconds each), so libc batches operations to minimise them.
You can see every syscall a program makes using strace on Linux:
Shell
# See all syscalls made by ls
$ strace ls
# Count syscalls by type
$ strace -c ls
# Trace only file-related syscalls
$ strace -e trace=file ls
# Trace a running process by PID
$ strace -p 1234
Try strace -c echo "hello" -- you'll see it makes about 30 syscalls just to print one word. Most are setup (loading libraries, setting up memory).
4. Process API (fork, exec, wait)
In Unix/Linux, you create processes using three syscalls that work together: fork(), exec(), and wait(). This design seems odd at first, but it's incredibly powerful.
fork() -- Clone Yourself
fork() creates an exact copy of the current process. The new process (child) gets a copy of the parent's memory, file descriptors, and code. Both processes continue from the same line -- the line after fork().
C
#include <stdio.h>
#include <unistd.h>
int main() {
printf("Before fork (PID %d)\n", getpid());
pid_t pid = fork();
if (pid == 0) {
// This runs in the CHILD process
printf("I am the child! PID = %d, parent = %d\n",
getpid(), getppid());
} else if (pid > 0) {
// This runs in the PARENT process
printf("I am the parent! PID = %d, child = %d\n",
getpid(), pid);
} else {
// fork() failed
perror("fork");
}
return 0;
}
The trick: fork() returns twice -- once in the parent (returns child's PID) and once in the child (returns 0). That's how each process knows who it is.
exec() -- Replace Yourself
exec() replaces the current process's code and memory with a new program. It loads a different executable and starts running it from its main(). The PID stays the same.
C
#include <unistd.h>
int main() {
pid_t pid = fork();
if (pid == 0) {
// Child: replace myself with "ls -la"
execlp("ls", "ls", "-la", NULL);
// If we get here, exec failed
perror("exec failed");
}
return 0;
}
exec() never returns on success -- the old program is completely gone, replaced by the new one. Any code after exec() only runs if exec failed.
wait() -- Wait for Your Child
wait() blocks the parent until a child process finishes. This lets you run something and get its exit code.
C
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <sys/wait.h>
int main() {
pid_t pid = fork();
if (pid == 0) {
// Child
printf("Child doing work...\n");
sleep(2);
printf("Child done!\n");
exit(42); // exit with code 42
} else {
// Parent waits for child
int status;
waitpid(pid, &status, 0);
if (WIFEXITED(status)) {
printf("Child exited with code %d\n",
WEXITSTATUS(status)); // prints 42
}
}
return 0;
}
How a Shell Works (Putting It Together)
Now you can understand how bash/zsh actually works. When you type ls -la in a shell:
This fork+exec pattern is also why redirection (>, <, |) works. Between the fork() and exec(), the shell can manipulate the child's file descriptors (close stdout, open a file on fd 1) before the new program starts. The new program inherits the modified file descriptors and doesn't even know its output is going to a file instead of the terminal.
If a child exits but its parent never calls wait(), the child becomes a zombie -- it's dead but its PCB entry stays in the process table (so the parent can check its exit code later). Zombies waste a slot in the process table. If the parent dies too, init (PID 1) adopts the orphan and reaps it.
5. CPU Scheduling
The CPU can only run one process at a time (per core). The scheduler decides which process runs next and for how long. The goal: make it feel like everything runs simultaneously.
Key Metrics
- Turnaround time = completion time - arrival time (how long from submission to finish)
- Response time = first-run time - arrival time (how long until first interaction)
- Fairness -- every process gets a reasonable share of CPU
• Turnaround time = completion_time - arrival_time
• Response time = first_run_time - arrival_time
• Waiting time = turnaround_time - burst_time
• Throughput = number_of_processes / total_time
Scheduling Algorithms
| Algorithm | How It Works | Pros / Cons |
|---|---|---|
| FIFO | First come, first served. Run each job to completion. | Simple. But a long job blocks everything behind it (convoy effect). |
| SJF | Shortest Job First. Run the shortest job next. | Optimal turnaround time. But you can't always predict job length. |
| Round Robin | Give each process a fixed time slice (quantum, e.g. 10ms). Rotate through the ready queue. | Good response time. Fair. But poor turnaround for long jobs. |
| MLFQ | Multi-Level Feedback Queue. Multiple priority queues. New jobs start at highest priority. If a job uses its full time slice, it moves down. I/O-bound jobs stay high. | Best of both worlds -- responsive to interactive tasks AND handles long jobs. Used by real OSes. |
Three processes arrive at time 0: A (needs 5 units), B (needs 3 units), C (needs 1 unit).
Timeline
Time: 0 1 2 3 4 5 6 7 8
CPU: A A B B C A A B A
└──┘ └──┘ │ └──┘ │ │
A:2 B:2 C done B done A done
t=5 t=8 t=9
C finishes at time 5, B at time 8, A at time 9. Compare with FIFO (A, B, C): C wouldn't start until time 8!
Preemption
Modern schedulers are preemptive -- they can forcibly stop a running process and switch to another. This uses a hardware timer interrupt that fires periodically (every 1-10ms). When it fires, the OS gets control and can decide to switch processes. Without preemption, a process could hog the CPU forever.
Linux uses the Completely Fair Scheduler (CFS). It tracks how much CPU time each process has gotten ("virtual runtime") and always runs the process with the least virtual runtime. This ensures fairness over time. It uses a red-black tree to pick the next process in O(log n).
6. Threads & Concurrency
A thread is a lightweight unit of execution within a process. One process can have multiple threads, and they all share the same address space (code, heap, global data). Each thread has its own stack and registers.
Processes vs Threads
| Aspect | Process | Thread |
|---|---|---|
| Memory | Separate address space | Shared address space |
| Creation cost | Expensive (copy page table) | Cheap (just a new stack) |
| Communication | IPC needed (pipes, sockets) | Read/write shared memory directly |
| Crash impact | One crash doesn't affect others | One crash kills all threads in process |
| Context switch | Slow (switch page table) | Fast (same page table) |
Why Threads Are Useful
- Parallelism -- on a multi-core CPU, threads can run on different cores simultaneously. A 4-core machine can truly run 4 threads at once.
- Avoiding blocking -- if one thread is waiting for I/O, other threads keep running. A web server uses one thread per connection so slow clients don't block fast ones.
- Sharing state -- threads can share data structures without the complexity of inter-process communication.
Creating Threads (pthreads)
C
#include <stdio.h>
#include <pthread.h>
void* worker(void* arg) {
int id = *(int*)arg;
printf("Thread %d running\n", id);
return NULL;
}
int main() {
pthread_t threads[4];
int ids[4] = {0, 1, 2, 3};
for (int i = 0; i < 4; i++) {
pthread_create(&threads[i], NULL, worker, &ids[i]);
}
// Wait for all threads to finish
for (int i = 0; i < 4; i++) {
pthread_join(threads[i], NULL);
}
printf("All threads done\n");
return 0;
}
Compile with: gcc -pthread thread_example.c -o thread_example
The Concurrency Problem: Race Conditions
Shared memory is both the best and worst thing about threads. If two threads modify the same variable without coordination, you get a race condition.
C
int counter = 0; // shared between threads
void* increment(void* arg) {
for (int i = 0; i < 1000000; i++) {
counter++; // NOT atomic! This is actually 3 steps:
// 1. Load counter from memory into register
// 2. Add 1 to register
// 3. Store register back to memory
}
return NULL;
}
// If 2 threads run increment(), you'd expect counter = 2,000,000
// But you'll get something less -- maybe 1,500,000 -- because
// the threads interleave those 3 steps unpredictably
This is why we need locks and synchronisation, covered in the next section.
7. Locks & Synchronization
A lock (mutex) ensures that only one thread can access a shared resource at a time. The pattern is always the same:
C
#include <pthread.h>
int counter = 0;
pthread_mutex_t lock = PTHREAD_MUTEX_INITIALIZER;
void* increment(void* arg) {
for (int i = 0; i < 1000000; i++) {
pthread_mutex_lock(&lock);
counter++; // safe now -- only one thread at a time
pthread_mutex_unlock(&lock);
}
return NULL;
}
// Now 2 threads will correctly produce counter = 2,000,000
Condition Variables
Sometimes a thread needs to wait for a condition, not just for a lock. A condition variable lets a thread sleep until another thread signals it.
Classic pattern: producer/consumer queue.
C
pthread_mutex_t lock = PTHREAD_MUTEX_INITIALIZER;
pthread_cond_t cond = PTHREAD_COND_INITIALIZER;
int ready = 0;
// Thread A: producer
pthread_mutex_lock(&lock);
ready = 1;
pthread_cond_signal(&cond); // wake up the consumer
pthread_mutex_unlock(&lock);
// Thread B: consumer
pthread_mutex_lock(&lock);
while (!ready) { // always use while, not if
pthread_cond_wait(&cond, &lock); // releases lock, sleeps, re-acquires
}
// ready == 1 here, process the data
pthread_mutex_unlock(&lock);
Use while (!condition), never if (!condition) before cond_wait. The thread can be woken spuriously (without the condition actually being true). The while loop rechecks and goes back to sleep if it was a false alarm.
Semaphores
A semaphore is a generalised lock with a counter. A mutex is a semaphore with value 1. A semaphore with value N lets N threads in simultaneously.
Use case: limit concurrent database connections to 10.
C
#include <semaphore.h>
sem_t pool;
sem_init(&pool, 0, 10); // allow 10 concurrent connections
// Each worker thread:
sem_wait(&pool); // decrement (blocks if 0)
// ... use the connection ...
sem_post(&pool); // increment (wake up a waiter)
8. Deadlocks
A deadlock occurs when two or more threads are each waiting for a resource held by another, and none can make progress. Everyone's stuck forever.
Four Conditions for Deadlock
ALL four must be true simultaneously for deadlock to occur (Coffman conditions):
- Mutual exclusion -- resources can't be shared (only one thread holds a lock at a time)
- Hold and wait -- a thread holds one resource while waiting for another
- No preemption -- you can't forcibly take a lock from a thread
- Circular wait -- A waits for B, B waits for C, C waits for A
Break any one condition and deadlock becomes impossible.
1. Mutual Exclusion: At least one resource is non-sharable
2. Hold and Wait: A process holds resources while waiting for others
3. No Preemption: Resources cannot be forcibly taken from a process
4. Circular Wait: A cycle exists in the resource-allocation graph
Break any one condition → no deadlock possible.
Prevention Strategies
| Strategy | How | Breaks Which Condition |
|---|---|---|
| Lock ordering | Always acquire locks in the same global order | Circular wait |
| Lock all at once | Grab all needed locks atomically before doing work | Hold and wait |
| Try-lock | Use trylock() -- if you can't get the lock, release everything and retry | Hold and wait |
| Single lock | Use one big lock instead of many fine-grained ones | Circular wait (but hurts performance) |
Lock ordering is the most common real-world solution. If every thread always locks mutex1 before mutex2, circular wait is impossible. Assign a global order to all locks and enforce it everywhere.
9. Memory & Address Spaces
Every process thinks it has its own private, contiguous block of memory starting at address 0. This is a virtual address space -- an illusion created by the OS and hardware.
Layout of a Process's Address Space
Virtual vs Physical Addresses
When your program accesses address 0x00401000, that's a virtual address. The CPU's Memory Management Unit (MMU) translates it to a physical address in actual RAM. Different processes can use the same virtual address but they map to different physical locations.
This is why processes can't read each other's memory -- their virtual address 0x1000 maps to completely different physical locations.
Stack vs Heap
| Stack | Heap |
|---|---|
| Automatic allocation/deallocation | Manual (malloc/free, new/delete) |
| Very fast (just move stack pointer) | Slower (search for free block) |
| Fixed size (~1-8 MB typically) | Can grow to fill available memory |
| Local variables, function args | Dynamic data, objects, arrays of unknown size |
| LIFO order | Any order |
The malloc/free Dance
malloc() doesn't always make a syscall. It manages a free list of previously-freed blocks in user space. Only when it runs out does it call sbrk() or mmap() to ask the kernel for more memory.
C
// Allocate 100 ints on the heap
int* arr = malloc(100 * sizeof(int));
if (!arr) { /* out of memory */ }
arr[0] = 42;
arr[99] = 99;
// MUST free when done -- OS doesn't free it for you (until process exits)
free(arr);
arr = NULL; // good practice: avoid dangling pointer
- Memory leak --
malloc()withoutfree(). Memory usage grows forever. - Use after free -- access memory after
free(). Undefined behavior. - Double free -- call
free()twice on the same pointer. Corrupts the allocator. - Buffer overflow -- write past the end of an array. Can overwrite other data or code.
Use valgrind to detect these: valgrind ./my_program
10. Paging & Virtual Memory
The OS divides both virtual and physical memory into fixed-size chunks called pages (typically 4 KB). The mapping from virtual pages to physical frames is stored in a page table.
Address Translation
Page Table Entry
Each entry in the page table contains:
- Physical frame number -- where this page lives in RAM
- Valid bit -- is this page actually in RAM? (0 = not mapped or swapped out)
- Protection bits -- read/write/execute permissions
- Dirty bit -- has this page been modified? (needs writing to disk before eviction)
- Referenced bit -- has this page been accessed recently?
TLB -- Making It Fast
Looking up the page table on every memory access would be unbearably slow. The Translation Lookaside Buffer (TLB) is a tiny, ultra-fast cache inside the CPU that stores recent page-to-frame translations.
TLB hit rates are typically 99%+. A TLB miss costs ~10-100 ns (page table walk). A page fault (page not in RAM at all) costs ~1-10 ms (load from disk).
Page Faults & Swapping
If a page's valid bit is 0, accessing it causes a page fault. The OS handles it:
- Program accesses a virtual address whose page isn't in RAM
- CPU raises a page fault exception
- OS checks: is this a valid address? If not → segfault (kill the process)
- If valid but swapped to disk → OS finds a free frame (or evicts another page)
- OS reads the page from disk into the free frame
- OS updates the page table entry with the new frame number
- Process resumes the instruction that faulted
This is how you can use more memory than you physically have -- the OS transparently swaps pages to/from disk. But disk is ~100,000x slower than RAM, so heavy swapping (thrashing) makes everything crawl.
When you launch a program, the OS doesn't load all its pages into RAM immediately. It uses demand paging -- pages are only loaded when first accessed. That's why the first run of a function is slower (page fault) but subsequent runs are fast (page already in RAM).
11. I/O & Devices
The OS manages all hardware devices: disks, keyboards, network cards, GPUs, USB devices. Programs never talk to devices directly -- they go through the OS using syscalls and device drivers.
How I/O Works
There are two main approaches:
- Polling (programmed I/O) -- the CPU repeatedly checks "is the device ready yet?" Wastes CPU cycles spinning in a loop.
- Interrupts -- the device notifies the CPU when it's done by sending an interrupt. The CPU can do other work while waiting. This is what modern systems use.
DMA (Direct Memory Access)
For large data transfers (reading a file from disk), the CPU doesn't copy each byte itself. Instead, it sets up a DMA controller that transfers data directly from the device to RAM. The CPU is free to run other processes during the transfer. When DMA finishes, it sends an interrupt.
Device Drivers
Each hardware device has a driver -- kernel code that knows how to talk to that specific device. The OS provides a standard interface (read, write, ioctl) and the driver translates to device-specific commands. About 70% of OS code is device drivers.
12. File Systems
A file system organises data on disk into files and directories. It answers: where on the physical disk is byte 50,000 of /home/you/report.pdf?
Key Abstractions
- File -- a named sequence of bytes. Nothing more. The OS doesn't care what's in it (text, binary, image). That's the application's job.
- Directory -- a special file that contains a list of (name, inode number) pairs.
- Inode -- a data structure that stores a file's metadata (size, permissions, timestamps) and the locations of its data blocks on disk. Every file has exactly one inode.
Inode Structure
Small files (a few KB) use direct pointers. Large files use indirect pointers (a block that contains more pointers). This is how Unix handles files from 1 byte to terabytes with the same inode structure.
Creating a File (What Actually Happens)
When you run touch newfile.txt:
- OS finds a free inode in the inode bitmap
- Initialises the inode (size=0, permissions, timestamps)
- Adds entry
("newfile.txt", inode_number)to the parent directory - Updates the parent directory's modification time
When you write data to it, the OS also allocates data blocks from the data bitmap.
Hard Links vs Symbolic Links
| Hard Link | Symbolic (Soft) Link |
|---|---|
| Another directory entry pointing to the same inode | A separate file whose content is a path to the target |
| Same file, different name. Deleting one name doesn't delete the data (until link count = 0) | A shortcut. If the target is deleted, the symlink is broken |
| Can't cross filesystem boundaries | Can point anywhere, even across filesystems |
ln original.txt hardlink.txt | ln -s original.txt symlink.txt |
Crash Consistency: Journaling
What if the machine crashes mid-write? A file system write might need to update 3 things: the inode, a data block, and the bitmap. If only some of these complete before the crash, the disk is inconsistent.
Journaling (used by ext4, NTFS, HFS+) solves this:
- Write a journal entry describing all planned changes
- Write the journal entry to a special area on disk
- Commit the journal entry (mark it complete)
- Apply the changes to the actual file system locations
- Delete the journal entry
If the machine crashes at any point, on reboot the OS replays any committed journal entries. This guarantees the file system is always consistent.
- ext4 -- the default. Journaling, mature, reliable. Good for most uses.
- XFS -- good for large files and parallel I/O. Used by many servers.
- Btrfs -- copy-on-write, snapshots, checksums. Modern but less battle-tested.
- tmpfs -- in-memory filesystem. Files in
/tmpoften live here (super fast, lost on reboot).
13. OS Security Basics
The OS is the ultimate gatekeeper. Every security boundary on your machine is enforced by the kernel.
Users & Permissions
Every file has an owner, a group, and permission bits:
Numeric form: r=4, w=2, x=1. So rwxr-xr-- = 754.
Shell
# Change permissions
chmod 755 script.sh # rwxr-xr-x
chmod u+x script.sh # add execute for owner
chmod go-w secret.txt # remove write for group and others
# Change ownership
chown alice:devs file.txt
# The root user (UID 0) bypasses all permission checks
Privilege Escalation
- setuid bit -- when set on an executable, it runs as the file's owner, not the user who launched it. This is how
passwd(owned by root) can modify/etc/shadoweven when run by a normal user.chmod u+s file - sudo -- runs a single command as root, after checking
/etc/sudoers. - capabilities -- fine-grained alternative to root. Instead of all-or-nothing, a process can be granted just the ability to bind to low ports, or just the ability to read raw packets.
Process Isolation
The OS enforces isolation between processes via:
- Virtual memory -- each process has its own page table. Process A literally cannot address Process B's memory.
- User/kernel mode -- user-mode code cannot execute privileged instructions.
- File permissions -- processes inherit the UID of the user who launched them and can only access files that user has permission for.
- Namespaces (Linux) -- isolate what a process can see: its own PID space, network stack, mount points. This is the foundation of containers (Docker).
- cgroups (Linux) -- limit how much CPU, memory, and I/O a process can use.
A buffer overflow in C/C++ can let an attacker overwrite the return address on the stack, redirecting execution to malicious code. Modern defenses:
- ASLR -- randomise memory layout so attackers can't predict addresses.
- Stack canaries -- place a known value before the return address; detect overwrites.
- NX bit -- mark the stack as non-executable so injected code can't run.
14. Context Switching
We mentioned context switches briefly in section 2. Now let's go deeper. A context switch is when the OS stops running one process (or thread) on a CPU core and starts running a different one. This is the fundamental mechanism that makes multitasking work.
What Actually Happens During a Context Switch
When the OS decides to switch from Process A to Process B, here's exactly what happens:
The Cost of Context Switching
A context switch typically takes 1-10 microseconds of direct CPU time. That sounds tiny, but there are hidden costs:
- TLB flush -- this is the expensive part. When you load a new page table, the Translation Lookaside Buffer (TLB) entries from the old process are invalid. The new process starts with a cold TLB, meaning every memory access causes a page table walk until the TLB warms up again. This can cost tens of microseconds of indirect slowdown.
- Cache pollution -- Process A's data was in L1/L2/L3 cache. Process B's data isn't. B will suffer cache misses until its working set is loaded.
- Pipeline flush -- the CPU's instruction pipeline and branch predictor state are useless for the new process.
If the OS is switching thousands of times per second, processes spend more time warming up caches and TLBs than doing actual work. This is called thrashing. It's why a system with 500 runnable processes feels sluggish even if the CPU isn't "100% busy" -- the useful work per time slice drops dramatically.
Voluntary vs Involuntary Context Switches
- Voluntary -- the process gives up the CPU willingly. It made a blocking syscall (read from disk, sleep, wait for a lock). The process can't continue, so it tells the OS "I'm done for now."
- Involuntary -- the OS forces the process off the CPU. Usually because the timer interrupt fired and the process used up its time slice (quantum). This is preemption.
High involuntary context switches = too many runnable processes fighting for CPU. High voluntary context switches = process is doing lots of I/O (which might be fine, or might indicate excessive blocking).
How to Measure Context Switches
Terminal
# System-wide context switches per second
vmstat 1
# Look at the 'cs' column (context switches)
# Per-process context switches
grep ctxt /proc/PID/status
# voluntary_ctxt_switches: 150
# nonvoluntary_ctxt_switches: 30
# Watch context switches live with pidstat
pidstat -w 1
# Shows cswch/s (voluntary) and nvcswch/s (involuntary) per process
# Trace context switches with perf
sudo perf stat -e context-switches ./my_program
Terminal
$ vmstat 1
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
r b swpd free buff cache si so bi bo in cs us sy id wa st
2 0 0 3201024 89456 1123456 0 0 4 12 156 320 15 5 78 2 0
1 0 0 3200896 89456 1123460 0 0 0 0 203 450 22 8 68 2 0
The cs column shows 320 and 450 context switches per second. The r column shows runnable processes (2 and 1). If r is consistently much higher than your CPU count and cs is in the thousands, you likely have too many processes competing.
15. Inter-Process Communication (IPC)
Every process has its own isolated address space -- process A cannot read or write process B's memory. This is great for security and stability, but programs often need to talk to each other. That's what IPC is for.
Pipes (Unnamed)
The simplest IPC mechanism. When you write ls | grep foo in your shell, the shell creates a pipe between ls and grep.
- One-directional -- data flows one way (writer → reader)
- Parent-child only -- the pipe exists as a file descriptor inherited by fork()
- Buffered -- the kernel provides a small buffer (typically 64KB on Linux)
- If the buffer is full, the writer blocks. If the buffer is empty, the reader blocks.
C
#include <unistd.h>
int fd[2];
pipe(fd); // fd[0] = read end, fd[1] = write end
if (fork() == 0) {
// Child: read from pipe
close(fd[1]);
char buf[128];
read(fd[0], buf, sizeof(buf));
printf("Child got: %s\n", buf);
} else {
// Parent: write to pipe
close(fd[0]);
write(fd[1], "hello from parent", 17);
close(fd[1]);
wait(NULL);
}
Named Pipes (FIFOs)
Like unnamed pipes but they exist as files on the filesystem, so any two processes can use them (not just parent-child).
Terminal
# Create a named pipe
mkfifo /tmp/my_pipe
# Terminal 1: read from it (blocks until someone writes)
cat /tmp/my_pipe
# Terminal 2: write to it
echo "hello" > /tmp/my_pipe
Shared Memory
The fastest IPC method. Two processes map the same physical memory into their address spaces. No copying, no kernel involvement after setup -- just read and write memory directly.
- POSIX API:
shmget()/shmat()(older) ormmap()withMAP_SHARED(modern) - Needs synchronization -- since both processes access the same memory, you need semaphores or mutexes to avoid races
- Used when performance matters: databases, game engines, scientific computing
C
// Process A: create shared memory
int fd = shm_open("/my_shm", O_CREAT | O_RDWR, 0666);
ftruncate(fd, 4096);
char* ptr = mmap(NULL, 4096, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
sprintf(ptr, "shared data here");
// Process B: open the same shared memory
int fd = shm_open("/my_shm", O_RDONLY, 0);
char* ptr = mmap(NULL, 4096, PROT_READ, MAP_SHARED, fd, 0);
printf("%s\n", ptr); // prints "shared data here"
Message Queues
Structured message passing. A process sends a message to a queue, another process reads from it. Messages have types, so you can selectively receive.
- Decoupled -- sender and receiver don't need to be running at the same time
- POSIX:
mq_open(),mq_send(),mq_receive() - Used for task queues, job dispatching
Signals
Asynchronous notifications sent to a process. Like software interrupts.
SIGTERM(15) -- "please terminate gracefully" (the defaultkillsignal)SIGKILL(9) -- "die immediately, no cleanup" (cannot be caught or ignored)SIGINT(2) -- sent when you press Ctrl+CSIGUSR1/SIGUSR2-- user-defined signals for custom behaviourSIGSEGV(11) -- segmentation fault (invalid memory access)SIGCHLD-- sent to parent when a child process exits
Terminal
# Send SIGTERM to process 1234
kill 1234
# Send SIGKILL (force kill)
kill -9 1234
# Send SIGUSR1
kill -USR1 1234
# List all signals
kill -l
Unix Domain Sockets
Like TCP/IP sockets but for local communication only. No network overhead, no TCP handshake -- just fast, bidirectional, byte-stream or datagram communication between processes on the same machine.
- Used by Docker (communicates via
/var/run/docker.sock) - Used by PostgreSQL, MySQL for local connections
- Used by X11/Wayland for display communication
- About 2x faster than TCP loopback (localhost)
Terminal
# See Unix domain sockets in use
ss -xl
# Or check what Docker uses
ls -la /var/run/docker.sock
Comparing IPC Methods
| Method | Direction | Speed | Use Case |
|---|---|---|---|
| Pipe | One-way | Fast | Shell pipelines, parent-child |
| Named pipe (FIFO) | One-way | Fast | Unrelated processes, simple streaming |
| Shared memory | Both | Fastest | High-throughput data sharing, databases |
| Message queue | Both | Medium | Task queues, structured messages |
| Signal | One-way | Fast | Async notifications (kill, Ctrl+C) |
| Unix domain socket | Both | Fast | Docker, databases, local services |
16. How Programs Actually Run
You type ./myprogram and hit Enter. What actually happens? Most CS students can't explain this clearly. Let's fix that.
The Full Sequence
The ELF Format
ELF (Executable and Linkable Format) is the binary format used on Linux. Every compiled C/C++/Rust/Go program on Linux is an ELF file. Here are the key sections:
.text-- your compiled machine code (read-only, executable).data-- initialized global/static variables (e.g.,int x = 42;).bss-- uninitialized global/static variables (e.g.,int y;) -- zeroed at startup, doesn't take space in the file.rodata-- read-only data (string literals like"hello").plt/.got-- used for dynamic linking (jumping to shared library functions)
Terminal
# Inspect ELF headers
readelf -h ./myprogram
# See all sections
readelf -S ./myprogram
# See program headers (segments loaded into memory)
readelf -l ./myprogram
# Quick check if something is an ELF file
file ./myprogram
# Output: ELF 64-bit LSB pie executable, x86-64, ...
Static vs Dynamic Linking
- Static linking -- all library code is copied into your executable at compile time. Bigger binary, but no external dependencies. The binary runs anywhere (same architecture).
- Dynamic linking -- your binary just records which shared libraries it needs. At runtime, the dynamic linker (
ld-linux.so) loads them. Smaller binary, shared memory for libraries, but the .so files must exist on the system.
Terminal
# Compile statically (everything baked in)
gcc -static -o myprogram myprogram.c
ls -la myprogram # ~1-2 MB (includes all of libc)
# Compile dynamically (default)
gcc -o myprogram myprogram.c
ls -la myprogram # ~16 KB (just your code + references)
# See what shared libraries a dynamic binary needs
ldd ./myprogram
# linux-vdso.so.1 (0x00007ffd3a1f2000)
# libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f2a1c000000)
# /lib64/ld-linux-x86-64.so.2 (0x00007f2a1c400000)
Seeing the Memory Layout: /proc/PID/maps
Every running process has a file at /proc/PID/maps that shows its complete virtual memory layout. This is the actual address space we talked about in section 9, but for a real process.
Terminal
# See the memory map of the current shell
cat /proc/self/maps
Here's what real output looks like (simplified):
Terminal
# Address range Perms Offset Dev Inode Pathname
55a1d7a00000-55a1d7a02000 r--p 00000000 08:01 12345 /usr/bin/myprogram # ELF header + read-only data
55a1d7a02000-55a1d7a04000 r-xp 00002000 08:01 12345 /usr/bin/myprogram # .text (executable code)
55a1d7a04000-55a1d7a05000 r--p 00004000 08:01 12345 /usr/bin/myprogram # .rodata (string literals)
55a1d7a05000-55a1d7a06000 rw-p 00005000 08:01 12345 /usr/bin/myprogram # .data + .bss (globals)
55a1d8200000-55a1d8221000 rw-p 00000000 00:00 0 [heap] # heap (malloc'd memory)
7f2a1c000000-7f2a1c1b0000 r-xp 00000000 08:01 67890 /lib/x86_64-linux-gnu/libc.so.6 # libc code
7f2a1c1b0000-7f2a1c1b4000 rw-p 001b0000 08:01 67890 /lib/x86_64-linux-gnu/libc.so.6 # libc data
7ffd3a1d0000-7ffd3a1f1000 rw-p 00000000 00:00 0 [stack] # the stack
7ffd3a1f2000-7ffd3a1f4000 r--p 00000000 00:00 0 [vdso] # virtual dynamic shared object
r = readable, w = writable, x = executable, p = private (copy-on-write), s = shared. Notice that .text is r-x (read + execute, but NOT writable) -- this is the NX bit in action. The stack is rw- (read + write, but NOT executable) -- preventing code injection.
Run cat /proc/self/maps on your Linux machine (or in WSL). You'll see the memory layout of the cat process itself. Find the [heap], [stack], and libc entries. Compare the addresses to the memory layout diagrams from section 9.
17. Containers & Virtualization (OS Perspective)
You've probably used Docker or heard of VMs. But how do they actually work at the OS level? This section explains the kernel mechanisms that make containers and VMs possible.
How Virtual Machines Work
A VM runs a complete guest operating system on top of emulated hardware. A piece of software called a hypervisor creates the illusion of real hardware for each guest OS.
- Type 1 (bare-metal) -- the hypervisor runs directly on hardware, no host OS. Examples: VMware ESXi, KVM (Linux's built-in hypervisor), Microsoft Hyper-V, Xen. Used in datacentres.
- Type 2 (hosted) -- the hypervisor runs as a program on top of a regular OS. Examples: VirtualBox, VMware Workstation. Used on developer laptops.
Each VM has its own kernel, its own init system, its own everything. This provides strong isolation but has significant overhead: each VM might use 512MB-2GB just for the guest OS.
How Containers Work: NOT VMs
Containers are fundamentally different from VMs. A container does not run its own kernel. It shares the host's kernel but gets its own isolated view of the system using two Linux kernel features: namespaces and cgroups.
Linux Namespaces
Namespaces give each container its own isolated view of system resources. The kernel supports these namespace types:
- PID namespace -- the container sees its own set of process IDs. PID 1 inside the container is just a regular process on the host with a different PID.
- Network namespace -- the container gets its own network stack: its own IP address, routing table, ports. Port 80 inside the container doesn't conflict with port 80 on the host.
- Mount namespace -- the container sees its own filesystem tree. It can mount/unmount without affecting the host.
- UTS namespace -- the container can have its own hostname.
- User namespace -- UID 0 (root) inside the container maps to an unprivileged user on the host. This is how rootless containers work.
- Cgroup namespace -- the container sees its own cgroup hierarchy.
Terminal
# See what namespaces a process belongs to
ls -la /proc/self/ns/
# lrwxrwxrwx 1 user user 0 ... cgroup -> 'cgroup:[4026531835]'
# lrwxrwxrwx 1 user user 0 ... mnt -> 'mnt:[4026531840]'
# lrwxrwxrwx 1 user user 0 ... net -> 'net:[4026531992]'
# lrwxrwxrwx 1 user user 0 ... pid -> 'pid:[4026531836]'
# Create a new PID namespace (you become PID 1 inside it!)
sudo unshare --pid --fork --mount-proc bash
ps aux # Only shows processes in this namespace
Control Groups (cgroups)
While namespaces control what a container can see, cgroups control how much it can use. They limit and account for resource usage:
- CPU -- limit a container to e.g. 0.5 CPU cores
- Memory -- limit to e.g. 512MB. If the container exceeds this, the OOM killer terminates it.
- I/O -- throttle disk read/write bandwidth
- PIDs -- limit the number of processes (prevent fork bombs)
Terminal
# Run a Docker container with resource limits
docker run --memory=256m --cpus=0.5 --pids-limit=100 ubuntu
# See cgroup limits for a running container
cat /sys/fs/cgroup/memory/docker/CONTAINER_ID/memory.limit_in_bytes
There's no guest OS to boot, no second kernel consuming memory, no hardware emulation overhead. A container starts in milliseconds (it's just a process with namespaces). A VM takes seconds to minutes (it's booting an entire OS). Containers also share the base image layers, so 10 containers based on the same image use barely more disk/memory than one.
How Docker Puts It All Together
Docker isn't magic. It's a user-friendly wrapper around three Linux kernel features:
- Namespaces -- process isolation (PID, network, mount, etc.)
- Cgroups -- resource limits (CPU, memory, I/O)
- Overlay filesystem (OverlayFS) -- layered filesystem. The base image is read-only; the container's changes are written to a thin writable layer on top. This is why images are built in layers and why
docker commitworks.
If you want to build a simple container from scratch and really understand these mechanisms, check out the Docker page -- there's a section on building your own Docker in Go using these exact syscalls (clone with CLONE_NEWPID, CLONE_NEWNS, etc.).
Run a Docker container, then from the host, inspect it:
Terminal
# Start a container
docker run -d --name test ubuntu sleep 3600
# Find its PID on the host
docker inspect --format '{{.State.Pid}}' test
# e.g., 12345
# Look at its namespaces
ls -la /proc/12345/ns/
# Look at its cgroup limits
cat /proc/12345/cgroup
# See its memory map (same as any process!)
cat /proc/12345/maps
A container is just a process. A heavily namespaced and cgroup-limited process, but a process nonetheless.
18. Essential Linux Commands
These commands let you observe and control everything we've discussed. Every developer should know these.
Process Commands
Shell
# List all processes
ps aux
# Interactive process viewer (top on steroids)
htop
# Show process tree (parent-child relationships)
pstree -p
# Send signals to processes
kill -SIGTERM 1234 # politely ask PID 1234 to exit
kill -SIGKILL 1234 # forcibly kill (can't be caught)
kill -SIGSTOP 1234 # pause the process
kill -SIGCONT 1234 # resume the process
# Run a process in the background
./long_task &
# See what syscalls a process is making
strace -p 1234
# See open files and sockets for a process
lsof -p 1234
Memory Commands
Shell
# Show system memory usage
free -h
# Show memory map of a process
cat /proc/1234/maps
# Show memory usage per process
ps aux --sort=-%mem | head
# Check for memory leaks
valgrind --leak-check=full ./my_program
Filesystem Commands
Shell
# Show disk usage
df -h # filesystem-level
du -sh /path/to/dir # directory-level
# Show inode info
stat filename # shows inode number, size, permissions, timestamps
ls -i # show inode numbers in listing
# Show mounted filesystems
mount | column -t
lsblk # show block devices
# File type detection
file mystery_file # "ELF 64-bit executable" or "ASCII text" etc.
Network Commands
Shell
# Show open ports and connections
ss -tulnp # (or netstat -tulnp on older systems)
# Show network interfaces
ip addr
# DNS lookup
dig example.com
# Trace network path
traceroute example.com
The /proc Filesystem
/proc is a virtual filesystem -- it doesn't exist on disk. It's the kernel exposing its internal data structures as files. Everything about your running system is in here.
Shell
# Info about process 1234
/proc/1234/status # state, memory, threads
/proc/1234/cmdline # command that started it
/proc/1234/maps # memory map (every region of the address space)
/proc/1234/fd/ # directory of open file descriptors (symlinks to files)
# System-wide info
/proc/cpuinfo # CPU details
/proc/meminfo # memory details
/proc/uptime # how long the system has been running
The OSTEP textbook has excellent hands-on labs where you write parts of an OS. If you want to really understand this stuff, do the projects at pages.cs.wisc.edu/~remzi/OSTEP/. The xv6 labs (a teaching OS written in C for RISC-V) are especially valuable.