Stop wiring services together with brittle synchronous calls. Learn how queues, pub/sub, and streams decouple your system, and which managed building block to reach for when.
You ship an order service. It takes a payment, then calls the email service, then the analytics service, then the warehouse service, one after another, in the same request. It works great in the demo. Then the email provider has a bad afternoon, its API hangs for 30 seconds, and suddenly customers can't place orders at all. Nobody is buying email. They're buying products. But because checkout calls email *synchronously*, a hiccup in a non-critical service became an outage in your most critical one.
The fix isn't a faster email service. It's changing the *shape* of the conversation. Instead of checkout calling everyone and waiting, checkout announces "an order was placed" and walks away. Whoever cares, email, analytics, the warehouse, reacts on their own time. That announcement is an event, and designing around it is event-driven architecture.
Who this is for
Developers and junior cloud engineers who can already build a service that calls another service over HTTP, and are starting to feel the pain, cascading failures, slow requests, services that have to know too much about each other. No prior messaging experience needed. We use AWS, GCP, and Azure names, but the ideas are identical everywhere.
The mental model: a newsroom, not a phone tree
Event-driven architecture means services communicate by emitting and reacting to events, facts about what happened, instead of directly commanding each other.
Synchronous calls are a phone tree. To tell five teams something, you phone each one, wait for them to pick up, and you're stuck on the line until the last call ends. If one person doesn't answer, you're frozen mid-tree.
Event-driven is a newsroom. A reporter publishes a story. They don't know, or care, who reads it. Subscribers (the sports desk, the weather team, an archive bot) each pick it up and do their own thing, at their own pace. The reporter's job is done the moment the story is filed. Add a tenth subscriber tomorrow and the reporter's code never changes.
A reporter files a storyA producer publishes an event
The story itself (a fact that happened)The event / message payload
The newswire everyone reads fromThe topic / queue / stream
The sports & weather desksIndependent consumers / subscribers
Anyone can start reading the wire tomorrowAdd a consumer without touching the producer
The newsroom maps cleanly onto the building blocks.
The picture: one event, many reactions
Here's the order flow rebuilt around an event. Checkout publishes once to a topic; the topic fans the event out to every interested consumer. Checkout finishes in milliseconds and is completely unaffected if a downstream consumer is slow or down.
A producer publishes once; the topic fans the event out to independent consumers (fan-out).
1
Customer clicks Buy
Checkout charges the card and persists the order. That part is still synchronous, it must succeed before we promise anything.
2
Checkout publishes one event
It sends a single OrderPlaced message to the topic with the order id and details, then returns 200 to the customer. Total added latency: a few milliseconds.
3
The topic fans out
The messaging service delivers a copy to every subscriber, email, warehouse, analytics. Checkout has no idea who they are.
4
Consumers react independently
Email sends a receipt. The warehouse reserves stock. Analytics records the sale. If email is down, the warehouse and analytics are unaffected, and email retries later.
Queues vs pub/sub vs streams
"Messaging" is three different patterns wearing the same coat. Picking the wrong one is the most common early mistake, so anchor on these distinctions before you reach for a service.
Queue, one message, one worker. The message is consumed and gone. Use it to *distribute work*: ten workers pull from one queue and each grabs different items.
Pub/Sub, one message, every subscriber gets a copy. Use it to *broadcast a fact* so multiple systems can react (this is fan-out).
Stream, an append-only log you can replay. Messages aren't deleted on read; consumers track their own position. Use it for *ordered history, replay, and analytics*.
Queue
Pub/Sub
Stream
Delivery
1 message → 1 consumer
1 message → all subscribers
1 log → many readers, each at own offset
Ordering
Best-effort (FIFO variants exist)
Usually unordered
Strong, per-partition order
Fan-out
No (work is split, not copied)
Yes, the whole point
Yes (each consumer reads the full log)
Replay
No, read = gone
No, miss it, miss it
Yes, rewind to any offset
Best for
Background jobs, task distribution
Broadcasting events, decoupling
Event sourcing, metrics, audit trails
Managed examples
SQS, Pub/Sub (pull), Service Bus queues
SNS, EventBridge, Pub/Sub, Service Bus topics
Kinesis, Kafka / MSK, Event Hubs
The same event needs different plumbing depending on what you want from it.
Queue + pub/sub is the classic combo
On AWS the textbook fan-out is SNS → SQS: SNS broadcasts the event, and each consumer has its own SQS queue subscribed to the topic. You get broadcast AND a durable buffer per consumer, so a slow consumer never blocks the others. EventBridge plays the same role with richer routing rules.
Publishing and consuming an event
Concretely, publishing is a one-liner and consuming is a small loop. Here's the SNS → SQS fan-out from the diagram, in Python with boto3. First, checkout publishes the event:
publish_order.py
python
import json
import boto3
sns = boto3.client("sns")
TOPIC_ARN = "arn:aws:sns:eu-west-1:123456789012:OrderPlaced"defpublish_order_placed(order_id: str, total: float, email: str) -> None:
event = {
"type": "OrderPlaced",
"order_id": order_id,
"total": total,
"customer_email": email,
}
sns.publish(
TopicArn=TOPIC_ARN,
Message=json.dumps(event),
# idempotency: a stable id lets consumers dedupe
MessageAttributes={
"event_id": {"DataType": "String", "StringValue": order_id},
},
)
# checkout returns to the customer right here, no waiting on consumers
Each consumer owns an SQS queue subscribed to that topic. The email worker just polls its queue, does its job, and deletes the message to acknowledge it:
email_consumer.py
python
import json
import boto3
sqs = boto3.client("sqs")
QUEUE_URL = "https://sqs.eu-west-1.amazonaws.com/123456789012/email-queue"whileTrue:
resp = sqs.receive_message(
QueueUrl=QUEUE_URL,
MaxNumberOfMessages=10,
WaitTimeSeconds=20, # long polling, cheaper, less spin
)
for msg in resp.get("Messages", []):
envelope = json.loads(msg["Body"]) # SNS wraps the payload
event = json.loads(envelope["Message"]) # our actual eventifnotalready_processed(event["order_id"]):
send_receipt(event["customer_email"], event["order_id"])
mark_processed(event["order_id"])
# delete = acknowledge. Only do this AFTER the work succeeded.
sqs.delete_message(QueueUrl=QUEUE_URL, ReceiptHandle=msg["ReceiptHandle"])
Notice the already_processed / mark_processed guard, and that we delete the message only *after* the work succeeds. That's not optional decoration, it's the heart of running this safely. Here's why.
Delivery guarantees: why idempotency is non-negotiable
Almost every managed messaging service gives you at-least-once delivery. Read that carefully: *at least* once. Not exactly once. The same event can, and eventually will, be delivered to your consumer more than once. This isn't a bug; it's the honest trade-off that makes the system reliable.
It happens for boring, unavoidable reasons. A consumer processes a message, but its acknowledgement gets lost on the network. The broker never hears "done," so after a visibility timeout it redelivers, and now you've sent two receipt emails for one order. The number of guarantees in plain terms:
At-most-once, fire and forget. Fast, but you can silently lose messages. Rarely what you want.
At-least-once, never lost, sometimes duplicated. The realistic default for SQS, SNS, Pub/Sub, and Kinesis.
Exactly-once, the dream. A few services offer it in narrow conditions (Kafka transactions, SQS FIFO dedup windows), but it's limited and costs throughput. Don't architect around assuming it.
Design for duplicates, not against them
Since you'll get duplicates, make your consumers idempotent: processing the same event twice has the same effect as processing it once. Use the event's stable id as a dedup key, record "I've handled order_id X," and skip it if it shows up again. Idempotency turns at-least-once from a liability into a non-issue.
The same instinct covers the *other* failure: a message your consumer can never process (bad data, a permanent bug). Without a backstop it gets redelivered forever, a "poison pill." Configure a dead-letter queue (DLQ) so a message that fails N times moves aside for inspection instead of blocking the line.
Common mistakes that cost hours
Treating at-least-once as exactly-once. No dedup guard means double-charged cards and duplicate emails the first time a network blip causes a redelivery. Make consumers idempotent from day one.
Acknowledging before the work is done. Delete or ack the message only *after* processing succeeds. Ack first and crash, and the event is gone forever.
No dead-letter queue. A single malformed message retries endlessly, drowns your logs, and can stall the whole queue. Always wire a DLQ with a sane retry count.
Putting commands in events. An event states a fact (OrderPlaced), not an instruction (SendEmail). If the producer is telling a specific consumer what to do, you've just rebuilt a synchronous call with extra steps and lost the decoupling.
Reaching for a stream when you needed a queue. Kafka/Kinesis are powerful and operationally heavy. If you just need background jobs, a plain queue (SQS) is simpler, cheaper, and enough.
Forgetting ordering isn't free. Standard queues and pub/sub don't guarantee order. If OrderShipped can arrive before OrderPlaced, either use a FIFO/partitioned option or make consumers tolerant of out-of-order events.
When event-driven beats a synchronous call
Event-driven isn't free, you trade the simplicity of a function call for eventual consistency, harder debugging, and new infrastructure. So don't make *everything* an event. Reach for it when the trade pays off:
Multiple consumers care about the same thing, fan-out beats calling each one yourself.
The reaction can happen later, receipts, analytics, and indexing don't need to finish before you answer the user.
You want failure isolation, a down consumer shouldn't take down the producer.
Load is spiky, a queue absorbs bursts and lets workers drain at their own pace.
Keep it synchronous when the caller genuinely needs the answer *now* to continue, reading a user's profile to render a page, or checking inventory before confirming a price. A request that can't proceed without the response shouldn't be fire-and-forget.
The whole article in seven lines
Event-driven = services emit and react to facts, instead of commanding each other directly.
Think newsroom (publish, anyone subscribes), not phone tree (call everyone, wait).
Queue = work split across workers. Pub/Sub = broadcast a copy to all. Stream = replayable ordered log.
Fan-out (one event, many consumers) is the superpower, add consumers without touching the producer.
Delivery is at-least-once, so duplicates are guaranteed, make every consumer idempotent.
Always add a dead-letter queue so a poison message can't block the line.
Use events for fan-out, deferrable work, and failure isolation; stay synchronous when the caller needs the answer to continue.
Where to go next
You now have the vocabulary and the mental model. The fastest way to make it stick is to wire a real producer and consumer, then design a system where events change how it scales.
See how decoupling shapes growth: Scalability Principles shows why queues and fan-out are the backbone of systems that scale.
Ready to build the rest of the stack? The Cloud Engineer path walks you from networking to compute to event-driven systems, level by level.
Want to go deeper?
This article covers concepts taught hands-on in the Cloud Engineer and DevOps career paths, with real terminal labs, production scenarios, and structured lessons.