Understand how services talk to each other without waiting -- message queues, publish/subscribe patterns, RabbitMQ, Kafka, and when to use each. This is how real-world backends handle scale, reliability, and async processing.
Say you're building an e-commerce app. A user places an order. Your Order Service needs to:
The naive approach: the Order Service makes HTTP calls to each service, one after another, and waits for each to respond.
It's slow. The user waits while 5 different services respond. If each takes 200ms, that's a full second of waiting.
It's fragile. If the Email Service is down, the entire order fails -- even though the email isn't critical for order processing.
It's tightly coupled. The Order Service needs to know about every downstream service, their URLs, their APIs, and their availability. Adding a new service (like analytics) means changing the Order Service code.
Imagine a restaurant where the waiter takes your order and then personally walks to the kitchen, waits for the chef to cook it, walks to the bar, waits for the drink, then comes back. That's synchronous. The waiter is blocked the entire time.
Now imagine the waiter puts your order on a ticket and clips it to a rail. The kitchen grabs it when ready, the bar grabs the drink order when ready. The waiter is immediately free to take the next table's order. That ticket rail is a message queue.
The solution: instead of directly calling each service, the Order Service publishes a message (like "Order #1234 was placed") to a message queue or topic. The other services pick up the message and process it on their own time. The Order Service doesn't wait, doesn't care who's listening, and doesn't break if one service is temporarily down.
A message queue is a middleman that sits between services. One service puts a message in, another service takes it out. The queue holds messages until they're processed.
Your boss (producer) writes tasks on sticky notes and puts them on a board (queue). You (consumer) take a sticky note, do the task, and throw it away. Your boss doesn't stand over your shoulder waiting -- they just keep adding notes. If you go on lunch, the notes pile up but nothing is lost. When you come back, you process them.
There are two fundamental ways messages can flow. Understanding the difference is critical.
One message goes to exactly one consumer. If there are multiple consumers, they compete -- each message is delivered to only one of them.
Producer --> [Queue] --> Consumer A (gets message 1)
--> Consumer B (gets message 2)
--> Consumer A (gets message 3)
Each message is processed by exactly ONE consumer.
Task processing -- You have 1000 images to resize. Put them all in a queue. 10 worker consumers each grab one image at a time and process it. No image gets processed twice, and you can scale up by adding more workers.
Job queues -- Send emails, generate reports, process payments. Each job should happen exactly once.
Think of it like a bakery counter with a ticket system. You pull a number, and ONE employee serves you. Even if there are 5 employees, your order goes to just one.
One message goes to all subscribers. Every subscriber gets a copy.
Publisher --> [Topic: "order-placed"]
--> Subscriber: Email Service (gets it)
--> Subscriber: Inventory Service (gets it)
--> Subscriber: Analytics Service (gets it)
Every subscriber gets every message.
Event broadcasting -- "An order was placed" is interesting to many services. Each service needs to know about it, but they each do something different with the information.
Notifications -- A user signs up, and you want to send a welcome email AND track it in analytics AND create a CRM entry.
Think of it like a radio broadcast. The radio station transmits once, and everyone tuned in hears it. Adding a new listener doesn't affect the broadcaster.
| Feature | Point-to-Point | Pub/Sub |
|---|---|---|
| Message delivery | One consumer only | All subscribers |
| Use case | Task distribution | Event broadcasting |
| Scaling | Add more consumers to process faster | Each subscriber processes independently |
| Example | Process payment, resize image | Order placed, user signed up |
| Analogy | Bakery ticket system | Radio broadcast |
Pub/Sub is the more interesting pattern, and the one you'll see most in real-world architectures. Let's break it down properly.
A topic is a named channel for a specific type of event. Think of it as a TV channel. Channel 5 is always sports, Channel 7 is always news. Services publish to a topic based on what happened, and subscribers choose which topics they care about.
Topics in an e-commerce app:
"order.placed" -- when a new order comes in
"order.shipped" -- when an order ships
"order.cancelled" -- when an order is cancelled
"user.registered" -- when a new user signs up
"payment.completed" -- when a payment goes through
"inventory.low" -- when stock is running low
The Order Service publishes to order.placed. It doesn't know or care who's listening. Three services subscribe to that topic:
order.placed → sends confirmation emailorder.placed → decrements stockorder.placed → logs for reportingNext week, you add a Shipping Service. You just subscribe it to order.placed. The Order Service code doesn't change at all. That's the power of Pub/Sub.
Fan-out is when one message goes to multiple consumers. In Pub/Sub, this is the default behavior -- every subscriber gets every message. The "fan" metaphor comes from the shape: one input fans out to many outputs.
+--> Email Service
|
order.placed message --->+--> Inventory Service (fan-out)
|
+--> Analytics Service
Fan-in is the opposite: multiple sources feed into one consumer. Example: an Analytics Service that subscribes to order.placed, user.registered, and payment.completed -- it collects events from everywhere into one place.
order.placed ---+
|
user.registered ---+--> Analytics Service (fan-in)
|
payment.completed -+
Here's where it gets interesting. What if your Email Service is slow and you want to run 3 instances of it? You don't want each instance to send the same email 3 times. You want the 3 instances to split the work.
This is what consumer groups solve. Instances in the same consumer group act as competing consumers -- each message goes to one instance in the group. But different groups each get their own copy.
Topic: "order.placed"
Consumer Group: "email-service"
Instance 1 --> processes message 1, 3, 5
Instance 2 --> processes message 2, 4, 6
Instance 3 --> processes message 7, 8, 9
(Each message handled by ONE instance. No duplicates.)
Consumer Group: "inventory-service"
Instance 1 --> processes ALL messages (only 1 instance)
Consumer Group: "analytics-service"
Instance 1 --> processes message 1, 2
Instance 2 --> processes message 3, 4
(Split the load within the group.)
Consumer groups give you both patterns at once. Between groups: Pub/Sub (every group gets every message). Within a group: Point-to-Point (each message goes to one instance). This is how you scale Pub/Sub in production.
Messages are usually JSON objects with enough context for the consumer to act:
// Published to "order.placed" topic
{
"eventType": "order.placed",
"timestamp": "2024-01-15T10:30:00Z",
"data": {
"orderId": "ORD-1234",
"userId": "USR-5678",
"items": [
{ "productId": "PROD-001", "quantity": 2, "price": 29.99 },
{ "productId": "PROD-042", "quantity": 1, "price": 49.99 }
],
"total": 109.97,
"shippingAddress": {
"street": "123 Main St",
"city": "London",
"postcode": "EC1A 1BB"
}
}
}
A common mistake is publishing a message like { "orderId": "1234" } and making every consumer call the Order Service API to get the details. This defeats the purpose of decoupling! Include all the data the consumers need in the message itself, so they don't need to call back.
RabbitMQ is a message broker -- it accepts, stores, and delivers messages. It's like a post office: you drop off a letter, and it makes sure it gets to the right mailbox.
| Concept | What It Is | Analogy |
|---|---|---|
| Producer | Sends messages | Person mailing a letter |
| Exchange | Receives messages and routes them to queues | Post office sorting room |
| Queue | Buffer that stores messages | Mailbox |
| Consumer | Reads messages from a queue | Person checking their mailbox |
| Binding | Rule connecting exchange to queue | Mail forwarding rule |
The exchange is what makes RabbitMQ flexible. It decides which queue(s) get each message:
payment to the payment queue."order.* matches order.placed and order.shipped. Most flexible.// Node.js example with amqplib
const amqp = require('amqplib');
async function main() {
// Connect to RabbitMQ
const connection = await amqp.connect('amqp://localhost');
const channel = await connection.createChannel();
// Create a queue (idempotent -- safe to call multiple times)
await channel.assertQueue('tasks', { durable: true });
// ===== PRODUCER: Send a message =====
const message = JSON.stringify({ orderId: 'ORD-1234', action: 'process' });
channel.sendToQueue('tasks', Buffer.from(message), { persistent: true });
console.log('Sent:', message);
// ===== CONSUMER: Receive messages =====
channel.consume('tasks', (msg) => {
const data = JSON.parse(msg.content.toString());
console.log('Received:', data);
// Process the message...
// Acknowledge -- tells RabbitMQ we're done, safe to delete
channel.ack(msg);
});
}
When a consumer gets a message, RabbitMQ doesn't delete it immediately. The consumer must acknowledge (ACK) the message after processing. If the consumer crashes before ACKing, RabbitMQ re-delivers the message to another consumer. This prevents message loss.
This is like signing for a package -- the delivery isn't complete until you confirm you received it.
// Publisher -- uses a fanout exchange
const exchange = 'order-events';
await channel.assertExchange(exchange, 'fanout', { durable: true });
const event = JSON.stringify({
type: 'order.placed',
orderId: 'ORD-1234',
total: 109.97
});
channel.publish(exchange, '', Buffer.from(event));
// ===== Subscriber 1: Email Service =====
const q1 = await channel.assertQueue('email-queue', { durable: true });
await channel.bindQueue(q1.queue, exchange, '');
channel.consume(q1.queue, (msg) => {
const event = JSON.parse(msg.content.toString());
console.log('Sending email for order:', event.orderId);
channel.ack(msg);
});
// ===== Subscriber 2: Inventory Service =====
const q2 = await channel.assertQueue('inventory-queue', { durable: true });
await channel.bindQueue(q2.queue, exchange, '');
channel.consume(q2.queue, (msg) => {
const event = JSON.parse(msg.content.toString());
console.log('Updating inventory for order:', event.orderId);
channel.ack(msg);
});
# Run RabbitMQ with management UI
docker run -d --name rabbitmq \
-p 5672:5672 \
-p 15672:15672 \
rabbitmq:3-management
# Management UI at http://localhost:15672
# Default login: guest / guest
Kafka is fundamentally different from RabbitMQ. RabbitMQ is a message broker (delivers and deletes messages). Kafka is an event streaming platform (stores events permanently as a log).
RabbitMQ is like a post office. You send a letter, it's delivered, and it's gone from the system. The post office doesn't keep a copy.
Kafka is like a newspaper. Events are published to the paper, and the paper keeps every issue forever (or for a configured retention period). Anyone can read any issue at any time. New subscribers can read from the beginning. The paper doesn't care when you read it.
| Concept | What It Is | Analogy |
|---|---|---|
| Topic | A category/feed of events | A newspaper section (Sports, Business) |
| Partition | A topic split into ordered segments for parallelism | Multiple printing presses for one section |
| Producer | Writes events to a topic | Journalist writing articles |
| Consumer | Reads events from a topic | Person reading the paper |
| Consumer Group | Group of consumers that split partitions | A team reading different sections |
| Offset | Position of a consumer in the log | Bookmark in the newspaper |
| Broker | A Kafka server | A printing press |
Topic: "orders" (3 partitions)
Partition 0: [msg0] [msg3] [msg6] [msg9] ...
Partition 1: [msg1] [msg4] [msg7] [msg10] ...
Partition 2: [msg2] [msg5] [msg8] [msg11] ...
Each partition is an ordered, append-only log.
Messages are NOT deleted after reading -- they're kept for a retention period.
Each message has an offset (position number) in its partition.
Partitions are how Kafka achieves parallelism. If you have 3 partitions, you can have 3 consumers reading simultaneously (one per partition). More partitions = more throughput.
Kafka guarantees ordering within a partition, not across partitions. If message order matters for a specific entity (like all events for order #1234), use the entity ID as the partition key -- Kafka will always put the same key in the same partition.
// Using kafkajs
const { Kafka } = require('kafkajs');
const kafka = new Kafka({
clientId: 'my-app',
brokers: ['localhost:9092']
});
// ===== PRODUCER =====
const producer = kafka.producer();
await producer.connect();
await producer.send({
topic: 'order-events',
messages: [
{
key: 'ORD-1234', // Partition key (same order always same partition)
value: JSON.stringify({
type: 'order.placed',
orderId: 'ORD-1234',
total: 109.97,
timestamp: new Date().toISOString()
})
}
]
});
// ===== CONSUMER =====
const consumer = kafka.consumer({ groupId: 'email-service' });
await consumer.connect();
await consumer.subscribe({ topic: 'order-events', fromBeginning: false });
await consumer.run({
eachMessage: async ({ topic, partition, message }) => {
const event = JSON.parse(message.value.toString());
console.log(`Partition ${partition} | Offset ${message.offset}`);
console.log('Event:', event);
// Process the event...
// Offset is auto-committed (Kafka tracks where each consumer group is)
}
});
# docker-compose.yml
services:
kafka:
image: confluentinc/cp-kafka:7.6.0
ports:
- "9092:9092"
environment:
KAFKA_NODE_ID: 1
KAFKA_PROCESS_ROLES: broker,controller
KAFKA_LISTENERS: PLAINTEXT://0.0.0.0:9092,CONTROLLER://0.0.0.0:9093
KAFKA_ADVERTISED_LISTENERS: PLAINTEXT://localhost:9092
KAFKA_CONTROLLER_QUORUM_VOTERS: 1@localhost:9093
KAFKA_CONTROLLER_LISTENER_NAMES: CONTROLLER
KAFKA_LISTENER_SECURITY_PROTOCOL_MAP: CONTROLLER:PLAINTEXT,PLAINTEXT:PLAINTEXT
CLUSTER_ID: 'MkU3OEVBNTcwNTJENDM2Qk'
KAFKA_OFFSETS_TOPIC_REPLICATION_FACTOR: 1
| Feature | RabbitMQ | Kafka |
|---|---|---|
| Model | Message broker (deliver & delete) | Event log (append & retain) |
| After reading | Message is deleted from queue | Message stays (retention period) |
| Throughput | Thousands/sec | Millions/sec |
| Ordering | Per queue (FIFO) | Per partition |
| Replay | Not possible (messages gone) | Can replay from any offset |
| Routing | Flexible (exchanges, bindings, patterns) | Simple (topics + partitions) |
| Complexity | Easier to set up and operate | More complex, more infrastructure |
| Best for | Task queues, RPC, simple Pub/Sub | Event sourcing, streaming, analytics, high throughput |
Use RabbitMQ when: You need a traditional job/task queue. You want messages to be delivered and then gone. You need flexible routing. Your throughput is moderate (thousands/sec). You want something simpler to operate.
Use Kafka when: You need to store events long-term. You need to replay events. You need extremely high throughput. Multiple consumers need the same data. You're doing event sourcing or stream processing. You need strong ordering guarantees.
If you're on AWS, you don't have to run RabbitMQ or Kafka yourself. AWS provides managed services:
SQS is a fully managed message queue. It's conceptually similar to a RabbitMQ queue but you don't manage any servers. AWS handles scaling, durability, and availability.
// AWS SDK v3 (Node.js)
const { SQSClient, SendMessageCommand, ReceiveMessageCommand } = require('@aws-sdk/client-sqs');
const sqs = new SQSClient({ region: 'us-east-1' });
const queueUrl = 'https://sqs.us-east-1.amazonaws.com/123456789/my-queue';
// Send a message
await sqs.send(new SendMessageCommand({
QueueUrl: queueUrl,
MessageBody: JSON.stringify({ orderId: 'ORD-1234', action: 'process' })
}));
// Receive messages (long polling)
const response = await sqs.send(new ReceiveMessageCommand({
QueueUrl: queueUrl,
MaxNumberOfMessages: 10,
WaitTimeSeconds: 20 // Long polling -- waits up to 20s for messages
}));
SNS is AWS's Pub/Sub service. You publish to a topic, and all subscribers get the message. Subscribers can be SQS queues, Lambda functions, HTTP endpoints, or email addresses.
The most common AWS pattern: use SNS for fan-out, SQS for each consumer:
+--> SQS Queue --> Email Service
|
SNS Topic "order-placed" ---->+--> SQS Queue --> Inventory Service
|
+--> SQS Queue --> Analytics Service
Each service has its own SQS queue. SNS delivers to all queues.
Each service processes at its own pace.
SNS pushes messages immediately. If your service is down when SNS pushes, the message is lost. By putting an SQS queue in front of each service, messages are buffered -- if the service is down, messages queue up and wait. When the service comes back, it processes the backlog. This is why SNS + SQS is the standard pattern on AWS.
If you want Kafka without managing Kafka infrastructure, AWS offers Amazon MSK (Managed Streaming for Apache Kafka). Same Kafka APIs, but AWS handles the servers, patching, and scaling. It's more expensive than running your own but saves operational burden.
Pattern: Point-to-Point queue
Tool: RabbitMQ or SQS
User uploads profile photo
--> API puts "resize-image" job in queue
--> API responds to user immediately: "Processing..."
--> Worker picks up job, resizes image, saves to S3
--> Updates database with new image URL
Pattern: Pub/Sub with consumer groups
Tool: Kafka or SNS+SQS
Order Service publishes "order.placed"
--> Payment Service charges card
--> Email Service sends confirmation
--> Inventory Service decrements stock
--> Analytics Service logs event
Pattern: Kafka event streaming
Tool: Kafka
User likes a post
--> Event published to "user-activity" topic
--> Feed Service updates followers' feeds
--> Notification Service pushes alerts
--> ML Service updates recommendation model
Pattern: Kafka fan-in
Tool: Kafka
Service A logs -->
Service B logs --> Kafka "logs" topic --> Elasticsearch --> Kibana dashboard
Service C logs -->
Pattern: Queue as buffer
Tool: RabbitMQ or SQS
Flash sale: 10,000 orders/sec hit your API
--> API puts each order in queue instantly (fast)
--> Backend processes orders at 500/sec (sustainable rate)
--> Queue absorbs the burst, nothing crashes
| Guarantee | What It Means | Risk |
|---|---|---|
| At-most-once | Message delivered 0 or 1 times. Fire and forget. | Messages can be lost |
| At-least-once | Message delivered 1 or more times. Retries on failure. | Messages can be duplicated |
| Exactly-once | Message delivered exactly 1 time. The holy grail. | Very hard/expensive to implement |
In distributed systems, truly exactly-once delivery is extremely difficult. Most systems achieve at-least-once delivery and make their consumers idempotent -- meaning processing the same message twice has no extra effect. For example, if you get two "charge customer $50" messages with the same order ID, check if you already charged for that order before charging again.
An operation is idempotent if doing it twice produces the same result as doing it once. This is critical when using message queues because duplicates are always possible.
// NOT idempotent -- running twice adds $50 twice
balance += 50;
// Idempotent -- running twice has same effect as once
if (!processedOrders.has(orderId)) {
balance += 50;
processedOrders.add(orderId);
}
A DLQ is where messages go when they can't be processed after multiple retries. Instead of retrying forever or losing the message, it gets moved to a special queue where you can inspect it, fix the issue, and reprocess it.
Main Queue --> Consumer tries processing
| |
| Fails 3 times
| |
v v
Dead Letter Queue (DLQ)
(inspect, fix, reprocess)
When consumers can't keep up with producers, the queue fills up. Backpressure is the mechanism for handling this -- either slow down producers, scale up consumers, or shed load.
Most queues guarantee FIFO within a partition/queue but NOT globally. If strict ordering matters (e.g., bank transactions for one account), make sure related messages go to the same partition using a partition key.