Testing Backend Services: A Practical Strategy for APIs
Stop arguing about coverage numbers. Here is a concrete testing strategy for backend APIs, the pyramid vs the trophy, what to mock and what not to, real-DB integration tests with Testcontainers, contract tests for microservices, and how to kill flaky tests.
You ship a REST or gRPC service, you have *some* tests, and you are not sure they are the right ones. Maybe coverage is 85% but bugs still reach production. Maybe CI is red half the time for reasons no one understands. This article gives you a strategy: which tests to write, where to put the effort, what to mock, and how to make the suite trustworthy. Junior-friendly, but the trade-offs are real.
A test suite has exactly one job: tell you the truth about whether your code works. A green suite that misses real bugs is worse than no suite, it gives false confidence. The most common failure mode in backend testing is a thousand unit tests that mock everything, so they pass even when the database, the serializer, and the HTTP layer are all broken.
The fix is not *more* tests. It is the *right shape* of tests, placed where they catch the bugs you actually ship. Let us build that shape from the ground up.
The mental model: bench-test the parts, road-test the car
A good test suite is a set of bets about where bugs live, and you should put your money where the failures actually happen.
Bench-testing a single piston on a rigUnit test: one pure function, no I/O, microseconds to run
Bolting the engine to a test stand and running itIntegration test: real DB, real repository, wired together
Two factories agreeing on the bolt-pattern specContract test: producer and consumer agree on the API shape
Driving the finished car on a real roadEnd-to-end test: full system, real HTTP, real dependencies
Different test levels answer different questions, none of them replaces the others.
A piston that passes on the bench can still fail in the assembled engine, bolted to the wrong torque, fed the wrong fuel. That gap between *the part works* and *the whole works* is exactly where integration tests live, and it is the level most teams under-invest in.
The pyramid and the trophy
The classic test pyramid says: lots of fast unit tests at the bottom, fewer integration tests in the middle, a handful of slow end-to-end tests at the top. Cheap and fast where you can, expensive and slow only where you must.
The pyramid (bottom-heavy on units) vs the trophy (weighted toward integration), same levels, different emphasis.
The testing trophy (popularized by Kent C. Dodds) reweights this. Types and static analysis form a free base; then it argues integration tests give the best confidence-per-dollar, so the *middle* should be the fattest layer, not the bottom. For backend services this is usually right: most real bugs live at the seams between your code and the database, not inside a single pure function.
Level
Speed
Confidence
Typical count
Static / types
Instant
Low (shape only)
Whole codebase
Unit
Microseconds
Low–medium
Hundreds
Integration
100ms–2s
High
Dozens–hundreds
Contract
Fast
High at boundaries
Per consumer
E2E
Seconds–minutes
Highest
A handful
Each level trades speed for confidence. Count and effort should follow the trophy: weight the middle.
1
Let types catch the dumb stuff
A TypeScript compile or a Python type-check rejects whole classes of bugs before a single test runs. This is your free first gate.
2
Unit-test the logic-dense, I/O-free code
Pricing rules, validators, parsers, state machines. Pure in, pure out, fast and deterministic.
3
Integration-test against a real database
Spin up real Postgres, run real migrations, hit the real repository. This is where the confidence is.
4
Contract-test every service boundary
If another team calls you, lock the shape with a contract so you cannot break them silently.
5
E2E-test only the critical user journeys
Sign up, check out, the one flow that must never break. A few, run in CI, kept ruthlessly stable.
Four levels, side by side
Before writing code, get crisp on what each level *is for*. The biggest waste in backend testing is writing an expensive integration test for something a unit test covers, or mocking the database in a test whose whole point is to exercise the database.
Type
Scope
Speed
What it catches
What you mock
Unit
One function/class
Microseconds
Logic errors, edge cases
Everything external
Integration
Code + real DB/queue
100ms–2s
SQL, mapping, migrations, tx
3rd-party APIs only
Contract
API shape between services
Fast
Breaking changes at boundaries
The other service
E2E
The whole running system
Seconds+
Wiring, config, deploy issues
Almost nothing
Pick the cheapest level that actually catches the bug you are worried about.
Rule of thumb
If the bug you fear is in *your logic*, write a unit test. If it is in *how your code talks to the database*, write an integration test. If it is in *what you promise another service*, write a contract test. Match the test to the failure.
A unit test: pure logic, no I/O
Unit tests shine on code that is all decision and no I/O. Here is a discount calculator, no database, no network, just rules. Notice each test names a specific behavior and an edge case.
pricing.ts
typescript
exportinterface Cart {
subtotalCents: number;
couponCode?: string;
isFirstOrder: boolean;
}
// Pure function: same input -> same output, no side effects.exportfunctiondiscountCents(cart: Cart): number {
let discount = 0;
if (cart.couponCode === "SAVE10") {
discount += Math.round(cart.subtotalCents * 0.1);
}
if (cart.isFirstOrder) {
discount += 500; // flat $5 welcome credit
}
// Never discount more than the cart is worth.return Math.min(discount, cart.subtotalCents);
}
pricing.test.ts
typescript
import { describe, it, expect } from"vitest";
import { discountCents } from"./pricing";
describe("discountCents", () => {
it("applies a 10% coupon", () => {
expect(discountCents({ subtotalCents: 10_000, isFirstOrder: false, couponCode: "SAVE10" }))
.toBe(1_000);
});
it("stacks the first-order credit on top of the coupon", () => {
expect(discountCents({ subtotalCents: 10_000, isFirstOrder: true, couponCode: "SAVE10" }))
.toBe(1_500);
});
it("never discounts more than the cart total", () => {
// $3 cart, $5 welcome credit -> capped at $3.expect(discountCents({ subtotalCents: 300, isFirstOrder: true }))
.toBe(300);
});
});
These run in microseconds and never flake, because there is nothing to flake, no clock, no network, no shared state. That is the gold standard. The trick is keeping your logic *in functions like this* so it is unit-testable, instead of tangled inside a route handler.
An integration test: real Postgres via Testcontainers
Here is the test that earns its keep. Instead of mocking the database, which would test your mock, not your SQL, we spin up a real Postgres in a throwaway Docker container, run real migrations, and exercise the real repository. Testcontainers makes this a few lines: it starts a container, hands you a connection string, and tears it down when the suite ends.
userRepo.integration.test.ts
typescript
import { describe, it, expect, beforeAll, afterAll } from"vitest";
import { PostgreSqlContainer, StartedPostgreSqlContainer } from"@testcontainers/postgresql";
import { Pool } from"pg";
import { UserRepository } from"./userRepo";
import { runMigrations } from"./migrate";
let container: StartedPostgreSqlContainer;
let pool: Pool;
let repo: UserRepository;
beforeAll(async () => {
// Real Postgres in Docker, disposable, isolated, identical to prod engine.
container = awaitnewPostgreSqlContainer("postgres:16-alpine").start();
pool = newPool({ connectionString: container.getConnectionUri() });
awaitrunMigrations(pool); // exercise the SAME migrations prod runs
repo = newUserRepository(pool);
}, 60_000); // pulling the image the first time is slowafterAll(async () => {
await pool?.end();
await container?.stop();
});
describe("UserRepository (real DB)", () => {
it("persists and reads back a user", async () => {
const created = await repo.create({ email: "ada@example.com", name: "Ada" });
const found = await repo.findById(created.id);
expect(found?.email).toBe("ada@example.com");
});
it("enforces the unique email constraint", async () => {
await repo.create({ email: "dup@example.com", name: "One" });
// This only fails if the real UNIQUE index exists, a mock would never catch it.awaitexpect(repo.create({ email: "dup@example.com", name: "Two" }))
.rejects.toThrow(/unique|duplicate/i);
});
});
Why this beats a mocked DB
The unique-constraint test passes only if the real index exists, the migration ran, and your code maps the driver error correctly. A mocked repository would happily return success and ship the bug. This is the difference between testing your code and testing your assumptions about your code. For the transactional behavior these tests rely on, see [database transactions and consistency](/blog/database-transactions-and-consistency).
Keep tests isolated from each other: wrap each test in a transaction you roll back, or truncate tables in a beforeEach. Sharing a container across the file is fine and fast; sharing *state* between tests is how you get order-dependent flakes. To get comfortable running Postgres in a container locally, try the docker lab.
A contract test: don't break your consumers
In a microservice world, the scary bugs are at the *boundaries*: you rename a JSON field, your tests stay green, and another team's service breaks in production. Consumer-driven contract testing (the Pact approach) flips the control: the *consumer* declares exactly what it needs, that expectation becomes a contract, and the *provider* is tested against it.
orderClient.pact.test.ts
typescript
import { PactV3, MatchersV3 } from"@pact-foundation/pact";
import { getUser } from"./userClient";
const { like, integer } = MatchersV3;
const provider = newPactV3({
consumer: "orders-service",
provider: "users-service",
});
describe("users-service contract", () => {
it("returns the fields orders-service depends on", async () => {
provider
.given("user 42 exists")
.uponReceiving("a request for user 42")
.withRequest({ method: "GET", path: "/users/42" })
.willRespondWith({
status: 200,
// Match on TYPE and SHAPE, not exact values, we care about the contract.
body: like({ id: integer(42), email: like("ada@example.com") }),
});
await provider.executeTest(async (mock) => {
const user = awaitgetUser(mock.url, 42);
expect(user.email).toBeDefined(); // consumer only needs id + email
});
});
});
Running this generates a pact file describing the consumer's expectations. The provider's CI then verifies it can satisfy every published pact before it deploys. If the users-service tries to remove email, *its* build goes red, not the orders-service's production traffic. The contract becomes a tripwire on the boundary.
Contract tests are not E2E
A contract test never runs both services together, it tests each side against the agreed shape, in isolation. That is why it is fast and stable while still guarding the integration. Use a broker (Pactflow or self-hosted) to share pacts between the two CI pipelines.
Testing async workers and queues
Queue workers are notoriously under-tested because the result is not in the HTTP response, it is a side effect that happens *later*. Two layers cover them. First, unit-test the handler directly with a fake message; it is just a function. Second, integration-test the round trip against a real broker (Testcontainers runs RabbitMQ, Kafka, or Redis the same way it ran Postgres).
emailWorker.test.ts
typescript
import { describe, it, expect, vi } from"vitest";
import { handleOrderPlaced } from"./emailWorker";
it("sends one confirmation email per order event", async () => {
// The email provider is a 3rd-party boundary -> fake it.const mailer = { send: vi.fn().mockResolvedValue({ id: "msg_1" }) };
awaithandleOrderPlaced({ orderId: "o_99", email: "ada@example.com" }, { mailer });
expect(mailer.send).toHaveBeenCalledOnce();
expect(mailer.send).toHaveBeenCalledWith(
expect.objectContaining({ to: "ada@example.com" }),
);
});
it("is idempotent, replaying the same event sends nothing twice", async () => {
const mailer = { send: vi.fn().mockResolvedValue({ id: "msg_1" }) };
const event = { orderId: "o_99", email: "ada@example.com" };
awaithandleOrderPlaced(event, { mailer });
awaithandleOrderPlaced(event, { mailer }); // at-least-once delivery happens!expect(mailer.send).toHaveBeenCalledOnce(); // dedupe must hold
});
Test idempotency explicitly
Real queues deliver at-least-once, so duplicate messages WILL happen. The second test above is the one that saves you at 2am: it proves replaying an event does not double-charge or double-email. If your worker is not idempotent, this test is how you find out before production does.
What to mock, and what never to
This is the debate that derails the most code reviews. The honest answer: mock at boundaries you do not own; use the real thing for boundaries you do. A *fake* (a working in-memory stand-in) is usually better than a *mock* (a scripted expectation), because a fake tests behavior while a mock tests that you called it a certain way, which couples your test to your implementation.
Dependency
Mock it?
Why
Your own database
No, use a real one
Testcontainers makes it cheap; mocks hide SQL bugs
3rd-party payment / email API
Yes, fake or mock
Slow, costs money, you don't control it
Another internal service
Use a contract
Pact guards the shape without running both
The system clock
Yes, inject it
Real time makes tests flaky and non-deterministic
Randomness / UUIDs
Yes, seed or inject
Determinism beats realism in tests
The default. Deviate only with a reason you can say out loud.
Do not mock your own database
A mocked DB tests your understanding of the database, not the database. Constraints, cascades, transaction isolation, JSON column quirks, and migration drift all slip through. If a test's whole purpose is the persistence layer, mocking the persistence layer makes the test meaningless.
Killing flaky tests
A flaky test, one that passes and fails on identical code, is more dangerous than a missing test, because it trains your team to ignore red CI. Once people start re-running until green, the suite has stopped telling the truth. Almost every flake comes from one of three sources.
Time. Tests that depend on Date.now(), sleep, timeouts, or ordering-by-timestamp. Fix: inject the clock and control it; never sleep to wait, poll for the condition instead.
Ordering. Tests that pass alone but fail in a suite because they assume the order they run in. Fix: make each test set up and tear down its own state; never rely on a previous test's leftovers.
Shared state. A reused database row, a global singleton, a cached connection, a leaked port. Fix: isolate, transaction-per-test rollback, fresh fixtures, unique identifiers per test run.
Quarantine, then fix
When a flake appears, tag it and move it out of the blocking path immediately so it stops eroding trust, then fix it on a deadline. A flaky test left in the gate teaches everyone to ignore failures, which defeats the entire point of having tests.
What coverage actually buys you
Coverage measures which lines *ran* during tests, not whether anything was *asserted*. You can hit 100% coverage with zero expect calls. So treat coverage as a smoke detector, not a goal: a sudden drop tells you a new code path is untested; a high number tells you almost nothing about quality.
Coverage tells you what you did NOT test. It cannot tell you whether what you did test was worth testing.
Chasing a coverage *target* produces low-value tests written to color in lines, getters, trivial mappers, generated code. Better signals: does the suite catch a deliberately introduced bug (mutation testing), and do the critical paths have integration coverage? A 70% suite weighted toward integration on the money-making flows beats a 95% suite of mocked unit trivia every time.
Common mistakes that cost hours
Mocking the database in tests whose whole point is persistence, you end up testing the mock.
No integration layer at all, hundreds of mocked unit tests, green CI, and bugs at every seam.
Asserting on implementation, not behavior, checking *that a method was called* instead of *what the system did*. Refactors break these tests even when behavior is correct.
`sleep` instead of polling, fixed waits are slow when long and flaky when short. Poll for the condition.
Shared mutable fixtures, one giant seed file every test depends on; change it and 50 tests break for unrelated reasons.
E2E for everything, pushing logic checks up to the slowest, flakiest level because the unit seam was never built.
Treating coverage % as the goal, writing tests to color in lines instead of to catch bugs.
Ignoring idempotency on workers, at-least-once delivery means duplicates; if you never test a replay, production finds the bug.
Takeaways
The whole article in nine lines
A test suite has one job: tell the truth about whether your code works.
Follow the trophy, not just the pyramid, weight the middle: integration tests give the best confidence per dollar.
Unit-test pure logic; integration-test how your code talks to the DB; contract-test what you promise other services; E2E only critical journeys.
Use a real Postgres via Testcontainers, never mock your own database.
Mock boundaries you do not own (payments, email, the clock); prefer fakes over mocks.
Consumer-driven contracts (Pact) catch breaking changes without running both services together.
Test worker idempotency explicitly, at-least-once delivery guarantees duplicates.
Kill flakes at the source: time, ordering, shared state. Quarantine then fix.
Coverage is a smoke detector, not a goal, it shows what you did NOT test, not whether your tests are worth running.
Where to go next
Testing is one half of a feedback loop, the other half is running it automatically on every change. Wire your unit, integration, and contract tests into a pipeline so a broken seam goes red before it merges, and add security scanning while you are there.
Run your suite on every push in the CI/CD lab, gate merges on green.
Get comfortable spinning up real dependencies in the docker lab, the foundation Testcontainers builds on.
Start small: pick your most important endpoint, write one integration test against a real database, and put it in CI today. One honest test that exercises the real seam is worth more than a hundred mocks that confirm what you already believed.
Want to go deeper?
This article covers concepts taught hands-on in the Cloud Engineer and DevOps career paths, with real terminal labs, production scenarios, and structured lessons.