The hands-on basics every AI engineer needs: the messages format, streaming, structured JSON output, tool calling, and surviving tokens, errors, and rate limits.
Every demo makes it look easy: one line of code, the model replies, the crowd cheers. Then you open the docs and drown in roles, streaming chunks, tool schemas, token counts, and a 429 error you have never seen before. The chatbot you typed into hid all of this. The API does not.
Here is the good news: under the surface, almost every modern LLM provider speaks the same dialect. Learn the shape once and you can move between providers by swapping a base URL and a model name. This article is that shape, the handful of moves you will use in literally every AI feature you ever build.
Who this is for
Engineers who can write a bit of Python and have made an HTTP request before, but have never called an LLM from code. By the end you will be able to send a chat request, stream the reply, force JSON output, let the model call a function, and handle the errors that show up in production. No ML background required, we treat the model as a service you call.
The one mental model: it is just a function over a list
An LLM API call is a pure-ish function: you hand it the entire conversation so far, and it returns the next message. It remembers nothing between calls, you resend the history every time.
That last sentence trips up everyone. There is no session, no server-side memory of your last question. You are the database. The "conversation" is just an array you keep appending to and resending. Once that clicks, the rest is detail.
A waiter who forgets you the moment they leave the tableThe model is stateless, it only knows what is in the messages you sent this call
Reading the whole email thread before replyingYou resend the full message history on every request
A house style guide pinned above the deskThe system role, standing instructions that shape every reply
A specialist you can phone mid-taskTool / function calling, the model asks you to run code and hands back the result
Mapping the API onto things you already understand.
The request/response lifecycle
Before any code, hold the picture in your head. Your app builds a messages array, sends it to the API, and gets tokens back, usually streamed one chunk at a time. Sometimes the reply is plain text. Sometimes the model branches: instead of answering, it asks you to run a tool, you run it, and you loop the result back in.
One request, two possible endings: a streamed text answer, or a tool-call detour that loops back.
1
Assemble the messages
Start with a system message (the rules), add the running history, then append the new user message. This array is the entire input.
2
Send the request
POST the messages, the model name, and parameters like temperature and max_tokens to the chat endpoint with your API key in the header.
3
Receive the reply
Either wait for the full response, or stream it, the server pushes partial tokens so you can render text as it is generated.
4
Branch if needed
If you provided tools and the model decides to use one, it returns a tool-call request instead of text. You run the tool and append its output to the messages.
5
Loop or finish
After a tool result you call the API again with the extended history. When the model returns plain text, you append it and you are done, ready for the next user turn.
The four message roles
Every entry in the messages array has a role. There are only four you need, and each has a distinct job. Get the roles right and the model behaves; mix them up and you get confused, off-topic replies.
Role
Purpose
system
Standing instructions and persona set once at the top, tone, format, rules, constraints. The model treats this as highest-priority context for the whole conversation.
user
What the human said. Each turn the user takes adds one of these. This is the question or instruction the model should respond to.
assistant
What the model said previously. You append the model's replies back into the array so it can see its own prior turns, this is how multi-turn memory works.
tool
The output of a function the model asked to call. You add this after running a tool so the model can use the result to finish its answer.
The message roles and what each one is for.
Pro tip
Keep exactly one `system` message and keep it first. Need to change the rules mid-conversation? Edit the system message and resend, do not stack five of them.
A basic chat request
Here is the whole thing in Python, using the OpenAI-style SDK that most providers are compatible with. The same code points at a different provider by changing base_url and model.
chat.py
python
from openai import OpenAI
client = OpenAI() # reads OPENAI_API_KEY from the environment
messages = [
{"role": "system", "content": "You are a concise assistant. Answer in one sentence."},
{"role": "user", "content": "What is a vector database, in plain English?"},
]
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=messages,
temperature=0.2, # lower = more deterministic
max_tokens=200, # cap the reply length (and your bill)
)
reply = response.choices[0].message.content
print(reply)
# Crucial: to continue the conversation, append the reply# and the next user turn, then call create() again.
messages.append({"role": "assistant", "content": reply})
# Token accounting comes back on every response:
usage = response.usage
print(f"prompt={usage.prompt_tokens} "
f"completion={usage.completion_tokens} "
f"total={usage.total_tokens}")
Two things to internalize. First, the reply lives at choices[0].message.content, a nested path because the API can return multiple choices. Second, you pay per token, both for what you send (prompt_tokens) and what you get back (completion_tokens). A token is roughly 4 characters of English, or about ¾ of a word. Long histories get expensive fast, which is why trimming old turns matters.
Streaming the response
Waiting ten seconds for a full paragraph feels broken. Streaming flips a flag and the server pushes tokens as they are generated, so text appears word-by-word, the experience you know from every chat UI. You assemble the chunks yourself.
stream.py
python
stream = client.chat.completions.create(
model="gpt-4o-mini",
messages=messages,
stream=True, # the only change
)
full_reply = []
for chunk in stream:
delta = chunk.choices[0].delta.content
if delta: # some chunks carry no text (role, finish)print(delta, end="", flush=True)
full_reply.append(delta)
# Re-join the pieces to store the complete turn in history.
reply = "".join(full_reply)
messages.append({"role": "assistant", "content": reply})
Watch out
Streaming responses usually do **not** include a usage/token count on the chunks. If you need the count, either request it explicitly (some providers support a stream option for this) or count tokens yourself with the provider's tokenizer library.
Getting structured JSON output
Free-form prose is great for humans and terrible for code. When the next step is data["price"], you need guaranteed, parseable JSON, not "Sure! Here is the JSON: ..." wrapped in markdown. Modern APIs let you pin the output to a JSON schema, and the model is constrained to produce exactly that shape.
structured.py
python
import json
schema = {
"type": "object",
"properties": {
"sentiment": {"type": "string", "enum": ["positive", "neutral", "negative"]},
"confidence": {"type": "number"},
"keywords": {"type": "array", "items": {"type": "string"}},
},
"required": ["sentiment", "confidence", "keywords"],
"additionalProperties": False,
}
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{"role": "system", "content": "Classify the review. Reply only with JSON."},
{"role": "user", "content": "The battery dies in two hours but the screen is gorgeous."},
],
response_format={
"type": "json_schema",
"json_schema": {"name": "review", "schema": schema, "strict": True},
},
)
# Now it is safe to parse, the shape is guaranteed.
data = json.loads(response.choices[0].message.content)
print(data["sentiment"], data["confidence"])
The strict: True flag is what turns a polite request into a hard guarantee, the model physically cannot emit a token that breaks the schema. Still wrap json.loads in a try/except in production: a truncated response (you hit max_tokens mid-object) produces incomplete JSON. Want to go deeper on schemas and letting the model call your code? That is its own topic, see Structured Output & Tool Calling.
A first look at tool / function calling
Models cannot check today's weather, query your database, or do reliable arithmetic. Tool calling fixes that: you describe a function with a JSON schema, and when the model needs it, instead of answering it returns a structured request, "call get_weather with {city: 'Amsterdam'}". Your code runs the real function, appends the result as a tool message, and calls the API again. The model weaves the result into a natural answer.
1
Describe the tool
Pass a `tools` list: each entry is a name, a description, and a JSON schema for its arguments. The description tells the model when to use it.
2
Model decides
If the request needs the tool, the response contains a tool_call (name + JSON arguments) instead of normal content.
3
You execute
Parse the arguments, run your actual function, and capture the return value. The provider never runs your code, you do.
4
Feed it back
Append a message with role `tool` containing the result, then call the API again. The model now has the data it was missing and writes the final reply.
Note
Tool calling is the foundation of every "agent." An agent is just this loop running until the model stops asking for tools. Master the single hop here first; the loop is the easy part.
Errors, retries, and rate limits
Your happy-path code works on the first try in a notebook and falls over the moment real traffic hits it. LLM APIs are network calls to a busy, shared service, failure is normal, not exceptional. Plan for it.
`429 Too Many Requests`, you hit a rate limit. Limits are usually measured in both requests-per-minute and tokens-per-minute. The fix is exponential backoff: wait 1s, then 2s, then 4s (plus a little random jitter) before retrying.
`5x` server errors, transient overload on the provider's side. Safe to retry with the same backoff. Most official SDKs retry these automatically; check before rolling your own.
`400 Bad Request`, your fault: malformed messages, a context window that is too long, or a bad schema. Do not retry, it will fail identically. Fix the input.
`401 / 403`, bad or missing API key, or no access to that model. Retrying never helps.
Context-length errors, your messages plus the requested completion exceed the model's window. Trim old turns, summarize history, or pick a model with a larger context.
Timeouts, set a sane client timeout. A streamed request can legitimately run for many seconds; a non-streamed one that hangs for a minute is usually stuck, fail and retry.
Always cap retries (e.g. 3–5 attempts) and respect any Retry-After header the API sends, it tells you exactly how long to wait.
Pro tip
Two cheap wins: stay under your limits by batching or queueing requests rather than firing them in parallel, and request a quota increase early, provider rate limits often scale with your usage tier and account age.
Common mistakes that cost hours
Expecting the API to remember. It is stateless. If you do not resend the history, every turn starts from a blank slate.
Forgetting to append the assistant reply. Skip this and the model loses the thread of its own previous answers.
Hardcoding the API key in source. Read it from an environment variable or secret manager, never commit it, never ship it to the browser.
Ignoring token usage until the bill arrives. Long histories and verbose system prompts are billed on every single call. Trim ruthlessly.
Parsing JSON without a schema or a try/except. Without strict mode the model may wrap JSON in prose; even with it, truncation breaks json.loads.
Retrying `400`s and `401`s. They are deterministic failures. Retrying just burns time and quota.
Setting `max_tokens` too low, which silently truncates answers mid-sentence (or mid-JSON) instead of erroring loudly.
Firing hundreds of parallel requests and tripping the rate limit, instead of using a small concurrency limit with backoff.
Takeaways
The whole article in nine lines
An LLM call is a function: send the whole conversation, get the next message back.
The API is stateless, **you** keep the history and resend it every turn.
Set `stream=True` to render tokens as they arrive; reassemble the chunks for history.
Use a JSON schema with `strict` mode when downstream code must parse the output.
Tool calling = the model asks you to run a function; you run it and feed the result back. That loop is an agent.
You pay per token, both directions, watch `usage` and trim long histories.
Retry `429` and `5xx` with exponential backoff and jitter; never retry `400` or `401`.
Cap retries, respect `Retry-After`, and always guard `json.loads` in production.
Where to go next
You can now hold a real conversation with a model from code, stream it, force it into JSON, and survive a bad night on the provider's servers. Two directions build directly on this.
This article covers concepts taught hands-on in the Cloud Engineer and DevOps career paths, with real terminal labs, production scenarios, and structured lessons.