Working with the LLM API

On this page

You have an API key. Now what?
The one mental model: it is just a function over a list
The request/response lifecycle
The four message roles
A basic chat request
Streaming the response
Getting structured JSON output
A first look at tool / function calling
Errors, retries, and rate limits
Common mistakes that cost hours
Takeaways
Where to go next

TL;DR

Treat the LLM API as a function over a message list. You will walk away able to send chat requests, stream responses, force structured JSON, wire up tool calling, and handle the errors and rate limits that bite in production.

You have an API key. Now what?

Every demo makes it look easy: one line of code, the model replies, the crowd cheers. Then you open the docs and drown in roles, streaming chunks, tool schemas, token counts, and a 429 error you have never seen before. The chatbot you typed into hid all of this. The API does not.

Here is the good news: under the surface, almost every modern LLM provider speaks the same dialect. Learn the shape once and you can move between providers by swapping a base URL and a model name. This article is that shape, the handful of moves you will use in literally every AI feature you ever build.

Who this is for

Engineers who can write a bit of Python and have made an HTTP request before, but have never called an LLM from code. By the end you will be able to send a chat request, stream the reply, force JSON output, let the model call a function, and handle the errors that show up in production. No ML background required, we treat the model as a service you call.

The one mental model: it is just a function over a list

An LLM API call is a pure-ish function: you hand it the entire conversation so far, and it returns the next message. It remembers nothing between calls, you resend the history every time.

That last sentence trips up everyone. There is no session, no server-side memory of your last question. You are the database. The "conversation" is just an array you keep appending to and resending. Once that clicks, the rest is detail.

A waiter who forgets you the moment they leave the tableThe model is stateless, it only knows what is in the messages you sent this call

Reading the whole email thread before replyingYou resend the full message history on every request

A house style guide pinned above the deskThe system role, standing instructions that shape every reply

A specialist you can phone mid-taskTool / function calling, the model asks you to run code and hands back the result

Mapping the API onto things you already understand.

The request/response lifecycle

Before any code, hold the picture in your head. Your app builds a messages array, sends it to the API, and gets tokens back, usually streamed one chunk at a time. Sometimes the reply is plain text. Sometimes the model branches: instead of answering, it asks you to run a tool, you run it, and you loop the result back in.

One request, two possible endings: a streamed text answer, or a tool-call detour that loops back.

1
Assemble the messages
Start with a system message (the rules), add the running history, then append the new user message. This array is the entire input.
2
Send the request
POST the messages, the model name, and parameters like temperature and max_tokens to the chat endpoint with your API key in the header.
3
Receive the reply
Either wait for the full response, or stream it, the server pushes partial tokens so you can render text as it is generated.
4
Branch if needed
If you provided tools and the model decides to use one, it returns a tool-call request instead of text. You run the tool and append its output to the messages.
5
Loop or finish
After a tool result you call the API again with the extended history. When the model returns plain text, you append it and you are done, ready for the next user turn.

The four message roles

Every entry in the messages array has a role. There are only four you need, and each has a distinct job. Get the roles right and the model behaves; mix them up and you get confused, off-topic replies.

Role	Purpose
system	Standing instructions and persona set once at the top, tone, format, rules, constraints. The model treats this as highest-priority context for the whole conversation.
user	What the human said. Each turn the user takes adds one of these. This is the question or instruction the model should respond to.
assistant	What the model said previously. You append the model's replies back into the array so it can see its own prior turns, this is how multi-turn memory works.
tool	The output of a function the model asked to call. You add this after running a tool so the model can use the result to finish its answer.

The message roles and what each one is for.

Pro tip

Keep exactly one system message and keep it first. Need to change the rules mid-conversation? Edit the system message and resend, do not stack five of them.

A basic chat request

Here is the whole thing in Python, using the OpenAI-style SDK that most providers are compatible with. The same code points at a different provider by changing base_url and model.

chat.py

python

from openai import OpenAI

client = OpenAI()  # reads OPENAI_API_KEY from the environment

messages = [
    {"role": "system", "content": "You are a concise assistant. Answer in one sentence."},
    {"role": "user", "content": "What is a vector database, in plain English?"},
]

response = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=messages,
    temperature=0.2,   # lower = more deterministic
    max_tokens=200,    # cap the reply length (and your bill)
)

reply = response.choices[0].message.content
print(reply)

# Crucial: to continue the conversation, append the reply
# and the next user turn, then call create() again.
messages.append({"role": "assistant", "content": reply})

# Token accounting comes back on every response:
usage = response.usage
print(f"prompt={usage.prompt_tokens} "
      f"completion={usage.completion_tokens} "
      f"total={usage.total_tokens}")

Two things to internalize. First, the reply lives at choices[0].message.content, a nested path because the API can return multiple choices. Second, you pay per token, both for what you send (prompt_tokens) and what you get back (completion_tokens). A token is roughly 4 characters of English, or about ¾ of a word. Long histories get expensive fast, which is why trimming old turns matters.

Streaming the response

Waiting ten seconds for a full paragraph feels broken. Streaming flips a flag and the server pushes tokens as they are generated, so text appears word-by-word, the experience you know from every chat UI. You assemble the chunks yourself.

stream.py

python

stream = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=messages,
    stream=True,   # the only change
)

full_reply = []
for chunk in stream:
    delta = chunk.choices[0].delta.content
    if delta:                # some chunks carry no text (role, finish)
        print(delta, end="", flush=True)
        full_reply.append(delta)

# Re-join the pieces to store the complete turn in history.
reply = "".join(full_reply)
messages.append({"role": "assistant", "content": reply})

Watch out

Streaming responses usually do not include a usage/token count on the chunks. If you need the count, either request it explicitly (some providers support a stream option for this) or count tokens yourself with the provider's tokenizer library.

Getting structured JSON output

Free-form prose is great for humans and terrible for code. When the next step is data["price"], you need guaranteed, parseable JSON, not "Sure! Here is the JSON: ..." wrapped in markdown. Modern APIs let you pin the output to a JSON schema, and the model is constrained to produce exactly that shape.

structured.py

python

import json

schema = {
    "type": "object",
    "properties": {
        "sentiment": {"type": "string", "enum": ["positive", "neutral", "negative"]},
        "confidence": {"type": "number"},
        "keywords": {"type": "array", "items": {"type": "string"}},
    },
    "required": ["sentiment", "confidence", "keywords"],
    "additionalProperties": False,
}

response = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[
        {"role": "system", "content": "Classify the review. Reply only with JSON."},
        {"role": "user", "content": "The battery dies in two hours but the screen is gorgeous."},
    ],
    response_format={
        "type": "json_schema",
        "json_schema": {"name": "review", "schema": schema, "strict": True},
    },
)

# Now it is safe to parse, the shape is guaranteed.
data = json.loads(response.choices[0].message.content)
print(data["sentiment"], data["confidence"])

The strict: True flag is what turns a polite request into a hard guarantee, the model physically cannot emit a token that breaks the schema. Still wrap json.loads in a try/except in production: a truncated response (you hit max_tokens mid-object) produces incomplete JSON. Want to go deeper on schemas and letting the model call your code? That is its own topic, see Structured Output & Tool Calling.

A first look at tool / function calling

Models cannot check today's weather, query your database, or do reliable arithmetic. Tool calling fixes that: you describe a function with a JSON schema, and when the model needs it, instead of answering it returns a structured request, "call get_weather with {city: 'Amsterdam'}". Your code runs the real function, appends the result as a tool message, and calls the API again. The model weaves the result into a natural answer.

1
Describe the tool
Pass a tools list: each entry is a name, a description, and a JSON schema for its arguments. The description tells the model when to use it.
2
Model decides
If the request needs the tool, the response contains a tool_call (name + JSON arguments) instead of normal content.
3
You execute
Parse the arguments, run your actual function, and capture the return value. The provider never runs your code, you do.
4
Feed it back
Append a message with role tool containing the result, then call the API again. The model now has the data it was missing and writes the final reply.

Note

Tool calling is the foundation of every "agent." An agent is just this loop running until the model stops asking for tools. Master the single hop here first; the loop is the easy part.

Errors, retries, and rate limits

Your happy-path code works on the first try in a notebook and falls over the moment real traffic hits it. LLM APIs are network calls to a busy, shared service, failure is normal, not exceptional. Plan for it.

`429 Too Many Requests`, you hit a rate limit. Limits are usually measured in both requests-per-minute and tokens-per-minute. The fix is exponential backoff: wait 1s, then 2s, then 4s (plus a little random jitter) before retrying.
`5x` server errors, transient overload on the provider's side. Safe to retry with the same backoff. Most official SDKs retry these automatically; check before rolling your own.
`400 Bad Request`, your fault: malformed messages, a context window that is too long, or a bad schema. Do not retry, it will fail identically. Fix the input.
`401 / 403`, bad or missing API key, or no access to that model. Retrying never helps.
Context-length errors, your messages plus the requested completion exceed the model's window. Trim old turns, summarize history, or pick a model with a larger context.
Timeouts, set a sane client timeout. A streamed request can legitimately run for many seconds; a non-streamed one that hangs for a minute is usually stuck, fail and retry.
Always cap retries (e.g. 3–5 attempts) and respect any Retry-After header the API sends, it tells you exactly how long to wait.

Pro tip

Two cheap wins: stay under your limits by batching or queueing requests rather than firing them in parallel, and request a quota increase early, provider rate limits often scale with your usage tier and account age.

Common mistakes that cost hours

1Expecting the API to remember. It is stateless. If you do not resend the history, every turn starts from a blank slate.
2Forgetting to append the assistant reply. Skip this and the model loses the thread of its own previous answers.
3Hardcoding the API key in source. Read it from an environment variable or secret manager, never commit it, never ship it to the browser.
4Ignoring token usage until the bill arrives. Long histories and verbose system prompts are billed on every single call. Trim ruthlessly.
5Parsing JSON without a schema or a try/except. Without strict mode the model may wrap JSON in prose; even with it, truncation breaks json.loads.
6Retrying `400`s and `401`s. They are deterministic failures. Retrying just burns time and quota.
7Setting `max_tokens` too low, which silently truncates answers mid-sentence (or mid-JSON) instead of erroring loudly.
8Firing hundreds of parallel requests and tripping the rate limit, instead of using a small concurrency limit with backoff.

Takeaways

The whole article in nine lines

An LLM call is a function: send the whole conversation, get the next message back.
The API is stateless, you keep the history and resend it every turn.
Four roles: system (rules), user (human), assistant (model), tool (function results).
Set stream=True to render tokens as they arrive; reassemble the chunks for history.
Use a JSON schema with strict mode when downstream code must parse the output.
Tool calling = the model asks you to run a function; you run it and feed the result back. That loop is an agent.
You pay per token, both directions, watch usage and trim long histories.
Retry 429 and 5xx with exponential backoff and jitter; never retry 400 or 401.
Cap retries, respect Retry-After, and always guard json.loads in production.

Where to go next

You can now hold a real conversation with a model from code, stream it, force it into JSON, and survive a bad night on the provider's servers. Two directions build directly on this.

Go deeper on schemas, multi-tool loops, and agent patterns in Structured Output & Tool Calling.
Curious why any of this works at all, what "tokens" and "context windows" really are under the hood? Read How LLMs Actually Work.
See where these skills fit in the bigger picture on the AI Engineer career path.

Your LLM API integration is misbehaving. Where do you look first?

Check your understanding

1. What is the central mental model the article gives for an LLM API call?

2. What does the article advise about the system message?

Frequently asked questions

Does the LLM API remember my previous messages?

No. An API call is a pure-ish function that remembers nothing between calls, so you resend the full history every time. There is no session or server-side memory: you are the database, and the conversation is just an array you keep appending to and resending.

What are the message roles I need to know?

There are only four roles you need, and each has a distinct job. Keep exactly one system message and keep it first; if you need to change the rules mid-conversation, edit that system message and resend rather than stacking several of them.

How do I switch between different LLM providers?

Almost every modern provider speaks the same dialect, so you can often move between them by swapping a base URL and a model name. Using an OpenAI-style SDK, the same code points at a different provider by changing base_url and model.

Why does my LLM bill grow so fast with long conversations?

You pay per token both for what you send (prompt_tokens) and what you get back (completion_tokens), and a token is roughly 4 characters of English, or about three-quarters of a word. Since you resend the whole history each turn, long conversations get expensive fast, which is why trimming old context matters.

Was this article helpful?

Want to go deeper?

This article covers concepts taught hands-on in the Cloud Engineer and DevOps career paths, with real terminal labs, production scenarios, and structured lessons.

Explore Career Paths Try the Labs

Keep reading

AI Engineering

RAG Architecture Explained for Backend Engineers

Read

AI Engineering

What Is an AI Engineer?

Read

AI Engineering

How LLMs Actually Work (for Engineers)

Read

Working with the LLM API

01You have an API key. Now what?

02The one mental model: it is just a function over a list

03The request/response lifecycle

04The four message roles

05A basic chat request

06Streaming the response

07Getting structured JSON output

08A first look at tool / function calling

09Errors, retries, and rate limits

10Common mistakes that cost hours

11Takeaways

12Where to go next

Frequently asked questions

Want to go deeper?

RAG Architecture Explained for Backend Engineers

What Is an AI Engineer?

How LLMs Actually Work (for Engineers)

You have an API key. Now what?

The one mental model: it is just a function over a list

The request/response lifecycle

The four message roles

A basic chat request

Streaming the response

Getting structured JSON output

A first look at tool / function calling

Errors, retries, and rate limits

Common mistakes that cost hours

Takeaways

Where to go next