Most LLM projects disappoint — not because the model was bad, but because it was asked to do the wrong thing.
The pitch is usually compelling. A system that understands your data, answers your analysts' questions, and surfaces the right action at the right time. A general-purpose intelligence layer on top of your business.
The reality tends to be messier: confident answers that are occasionally wrong in ways that are hard to detect, brittle behavior that shifts with small prompt changes, and evaluation frameworks that no one can quite agree on.
When I see this pattern, the failure is almost never the model.
It's the architecture.
What LLMs are genuinely good at
LLMs are excellent at a specific class of tasks. They are not general reasoning engines. They are extraordinarily capable transformation machines.
What they do well:
- Text to structure. Parsing messy, unstructured input into a clean, typed representation a downstream system can use.
- Structure to text. Turning a system's output — numbers, codes, flags — into a clear explanation a human can act on.
- Routing. Classifying intent and deciding which tool, handler, or path applies.
- Bridging. Connecting human language to system logic. Translating a business question into a query, a command, a structured request.
These are coordination tasks. They are genuinely hard for traditional software. LLMs handle them well.
What LLMs are not good at
The failure modes show up predictably:
- Multi-step reasoning where each step must be reliable. LLMs can produce plausible-looking chains of logic that break at one link — and the break is invisible until something goes wrong downstream.
- Knowing what they don't know. Silence and hallucination are often indistinguishable from the outside.
- Numerical and constraint satisfaction. Anything where exact correctness is required and approximately correct is not acceptable.
- Consistency at scale. Identical inputs can produce meaningfully different outputs across sessions. This is often fine for generation tasks. It is not fine for decision logic.
The common thread: these are not limitations that better prompting fixes. They are structural properties of how these models work.
The glue insight
The teams getting real, sustainable value from LLMs are not treating them as the intelligence layer.
They are using them as connective tissue between well-designed components.
In practice, this looks like:
- An LLM that parses a user's natural-language request into a structured query — which a validated, deterministic system then executes.
- An LLM that classifies and routes incoming documents — which a downstream pipeline then processes with rule-based or ML-based logic.
- An LLM that explains what a model or system decided — after the decision has already been made by auditable business logic.
- An LLM that extracts structured fields from unstructured text — feeding clean inputs into an optimization or forecasting model.
In each case, the LLM handles the part where human language meets machine logic. The "brain" — the part that makes consequential decisions — is still a deterministic, inspectable, testable system.
The failure pattern
When LLM systems disappoint, the architecture usually looks like this: the LLM is given both the comprehension and the decision authority.
There is no separation between "understand the request" and "execute the action." The LLM is asked to reason about business rules, hold state, satisfy constraints, and produce reliable outputs — often in a single pass.
This creates three problems that compound each other:
- Evaluation is hard. When behavior is emergent, it is difficult to know whether the system is working correctly across the full input space.
- Recovery is hard. When a decision is wrong, there is no component to inspect or fix. The model is a black box with respect to the business logic it was supposed to encode.
- Trust erodes. Stakeholders discover edge cases where the system behaves unexpectedly. Without a clear model of why, confidence collapses — often unfairly, because the underlying capability was real.
A design principle worth keeping
Separate comprehension from execution.
Let the LLM layer do what it is good at: understand intent, extract structure, produce explanations, route requests.
Let your logic layer — validated ML models, rules, constraints, optimization — do what it is good at: execute decisions reliably and consistently, in ways that can be tested and audited.
This is not a limitation on what LLMs can do. It is a recognition of where their value is highest.
An LLM that correctly routes and frames 95% of requests is genuinely valuable. The same LLM, asked to be the final decision authority, becomes a liability.
Why this matters beyond architecture
There is a deeper point here, one that connects to the rest of the decision-making problems I write about on this site.
The hard part of any AI system is not the model. It is the design of the system that uses the model.
LLMs make this more true, not less. The expressiveness of language, the fluency of outputs, the apparent generality — all of these make it easier to skip the design work. The model seems smart enough to handle it.
It usually isn't.
And the systems that fail are rarely the ones where the LLM was bad. They are the ones where the surrounding system was never designed.
Closing thoughts
LLMs are genuinely powerful. That is not in question.
But the teams that are getting durable value from them right now are not the ones with the best prompts or the largest context windows. They are the ones who designed systems where the LLM does what it is good at — and something else handles the rest.
Glue is not a consolation prize. In most software systems, good glue is the difference between a collection of components and something that actually works.