TheMurrow

What “AI Agents” Actually Are

A plain-English guide to how agent loops work, why tool access changes everything, where systems fail quietly, and when to deploy agents with restraint.

By TheMurrow Editorial
February 25, 2026
What “AI Agents” Actually Are

Key Points

  • 1Define agents precisely: they run an action/observation loop with goals, tools, and state—far beyond a chatbot with integrations.
  • 2Prefer constrained tool calling over UI “computer use,” and demand full traces, validation, permissions, and safe fallbacks before real autonomy.
  • 3Expect predictable failures—wrong tools, bad parameters, tool-bypass, and error cascades—then engineer guardrails so mistakes can’t scale quietly.

The first time an “AI agent” makes a mistake, it rarely looks like a mistake. It looks like momentum.

A cheerful assistant promises to “handle it,” then quietly schedules the wrong meeting, pulls the wrong customer record, or files a support ticket under the wrong product line. Nobody screams “hallucination.” The damage arrives as ordinary operational friction—an hour lost here, an awkward follow-up there, and a growing suspicion that the software is confident in ways it hasn’t earned.

The conversation about AI agents has drifted into two unhelpful extremes: breathless futurism or knee-jerk dismissal. Both miss what matters. Agents are neither magic nor mere chatbots. They are a specific software pattern—powerful when engineered with restraint, risky when shipped with vibes.

“The most dangerous thing an AI agent can do is fail quietly—because quiet failures scale.”

— TheMurrow Editorial

What people actually mean by “AI agents” (minus the hype)

The term “AI agent” gets used loosely, but it points to a real and increasingly common software pattern. The confusion comes from the fact that many products labeled “agent” are still essentially conversational interfaces with a few integrations. The more precise meaning—the one that matters for engineering, procurement, and risk—describes a system that can decide what to do next, take actions through tools, and repeat that process across multiple steps.

This section sets the baseline: what an agent is in plain English, and what separates an agent from a chatbot wearing an “autonomy” costume. The emphasis is on loops, tools, state, and traces—because those are the operational realities that determine whether an agent is useful, auditable, and safe in production.

If you’re evaluating agent products, this definition is the difference between buying a helpful dispatcher and buying a fluent improviser with access to your calendar and CRM.

The plain-English definition

An AI agent is software that uses an AI model to decide what to do next—often over multiple steps—and can call tools or take actions to reach a goal. The crucial idea isn’t that the model “knows everything.” The crucial idea is that the model runs a loop: it tries something, observes results, then tries the next thing.

OpenAI’s developer materials describe “agentic applications” as systems where “a model can use additional context and tools,” can “hand off to other specialized agents,” and can keep “a full trace of what happened.” That emphasis on tool use and traceability points to a mature view: an agent is a system, not a single model prompt.

Anthropic’s documentation for “computer use” makes the structure even clearer. It describes an “agent loop” where the model requests tool actions and the application executes them, returning results for the next decision. In other words: the agent decides; the system does; the agent reacts.

The line between an agent and a dressed-up chatbot

A lot of products called “agents” are closer to chatbots with a couple of integrations. The stronger meaning—what engineers typically mean when they’re being precise—includes:

- a goal (explicit or implied),
- autonomy across multiple steps (not just one response),
- tool or action access (search, databases, calendar, CRM, browser control),
- and some notion of state (memory, workflow context, or a record of prior steps).

You can feel the difference in practice. A chatbot answers. An agent attempts.

“A chatbot talks. An agent attempts.”

— TheMurrow Editorial

The agent loop: the hidden machinery behind the buzzword

Once you stop thinking of agents as “smart chat” and start thinking of them as a loop, most of the hype dissolves into mechanics—and that’s a good thing. The loop explains why agents can feel powerful (they keep going) and why they can be fragile (each step can introduce error). It also clarifies what real autonomy looks like in production systems: the model is rarely doing work directly. Instead, it’s orchestrating work across tools and systems that already exist.

In practice, the agent loop becomes the backbone of everything that matters operationally: observability, debugging, compliance, and incident response. Without a clear loop and a record of actions, you can’t answer basic questions like “what happened?” and “why did it do that?”—especially once you introduce handoffs between specialized agents.

This is where “agent” stops being a marketing label and becomes an architecture decision.

The five-step pattern most agents share

Strip away branding and most “AI agent” products boil down to the same loop:

1. A user provides a goal.
2. The model proposes an action (often a tool call).
3. The system executes the action.
4. The system returns results.
5. The model decides the next step.

That loop matters because it explains both the promise and the fragility. Each cycle adds opportunities to correct course—and opportunities to compound errors.

One underappreciated detail: even when an agent looks autonomous, it’s usually not “doing the work” directly. It’s orchestrating the work. The real world is full of databases, UIs, and rules engines. Agents are a new kind of dispatcher.

Why “full trace” isn’t a developer luxury

When OpenAI highlights keeping “a full trace of what happened,” it’s addressing a hard operational truth: agent systems produce long chains of decisions. Without traceability, you can’t audit failures, fix workflows, or prove compliance.

Traceability becomes even more important as agents “hand off to other specialized agents.” Multi-step handoffs can improve performance, but they also create ambiguity: which component made the wrong call? A trace is how you answer that question without guessing.

“If you can’t trace it, you can’t trust it.”

— TheMurrow Editorial

LLMs as planners, not omniscient brains

The easiest way to misunderstand agents is to overestimate what the model itself provides. LLMs are impressive, but they remain probabilistic systems optimized for producing plausible language. In agent settings, that means they can be excellent planners and routers—selecting likely next steps, translating goals into actions, and stitching together tool outputs. But they are not, on their own, reliable sources of truth, compliance, or safe execution.

This is why tool use isn’t a “nice to have.” It’s a design response to known weaknesses: models guess when they should query; they sound confident when they should verify; they produce formatting that looks right until it breaks downstream. Agents can mitigate those tendencies when the system forces the model to consult authoritative sources and when execution is constrained by validations and permissions.

Evaluating an agent, then, is less about whether the model sounds smart and more about whether the system makes smart behavior the path of least resistance.

What LLMs are good at—and what they aren’t

Modern large language models (LLMs) are strong at language reasoning, summarizing, and selecting plausible next actions. They are unreliable for:

- up-to-date facts (unless connected to retrieval),
- arithmetic (unless checked),
- strict formatting and compliance (unless enforced),
- and safe, correct real-world actions (unless constrained).

That list explains why tool use isn’t a bonus feature. It’s an attempt to make a probabilistic text model behave like a dependable software component.

Tool use as a design response, not a party trick

The research direction behind tool use has been explicit for years. The 2023 paper Toolformer describes an approach where language models learn to use external tools, motivated by the idea that models shouldn’t “guess” when they can query.

A well-designed agent treats the model as a planner and router: it decides which system should answer the question. A calculator answers math. A search API answers “what’s new.” A CRM answers “what did we promise this client.” The LLM becomes the glue between specialized capabilities.

The practical implication for readers evaluating agent products: reliability depends less on the poetry of the model and more on the quality of the tools, the guardrails, and the execution layer.

The agent patterns engineers actually use (and why they matter)

Behind the scenes, most agent systems are variations on a few patterns that researchers and engineers have been refining for years. Knowing these patterns helps you recognize what a product is actually doing, what kinds of failures to expect, and how to improve reliability.

These patterns also hint at a broader shift: the industry is moving away from “one model to rule them all” and toward systems that combine models with specialized modules, explicit action/observation cycles, and sometimes multiple cooperating agents. The upside is better performance on real tasks; the downside is complexity, coordination, and more surface area for errors.

Understanding ReAct, MRKL, and multi-agent collaboration won’t make you an agent engineer overnight—but it will make demos easier to interrogate and architectures easier to sanity-check.

ReAct: Reason + Act, step by step

One influential pattern is ReAct (“Reason + Act”), published in October 2022. The idea is simple: interleave small “thinking” steps with actions, so the model can observe real results before continuing.

The ReAct paper reports improvements across several settings and argues for a pragmatic benefit: consulting external sources can reduce hallucinations and error propagation. A model that can check Wikipedia via an API is less likely to invent a citation—at least in theory.

For editorially minded readers, ReAct also clarifies what “agentic” means: not a single grand answer, but a chain of smaller, checkable moves. That is the kind of structure organizations can audit and improve.

MRKL systems: modular routing to specialized modules

Another common approach comes from the MRKL paper (May 2022), which outlines a modular system combining an LLM with specialized components such as search and symbolic solvers. The goal is straightforward: overcome model limitations by delegating.

The MRKL framing—“modular” and “neuro-symbolic”—signals a design philosophy that feels almost old-fashioned in a good way. Instead of hoping the model can do everything, you build a system where different parts do what they’re best at.

Multi-agent collaboration: when one agent isn’t enough

A third pattern is multi-agent collaboration, where multiple specialized agents converse or hand off tasks. Microsoft Research’s AutoGen work (presented at COLM 2024) positions multi-agent conversation as a general framework for building applications.

The promise is specialization: one agent writes code, another verifies, another handles user communication. The risk is coordination overhead and blame diffusion. When a system fails, “the agents disagreed” isn’t an explanation. It’s a liability unless you have strong traces and clear authority rules.

Tool calling vs. “computer use”: two very different kinds of power

Not all “agent actions” are created equal. Some actions are structured, permissioned, and easy to audit. Others are effectively giving the model a pair of hands on a user interface. Both can be useful, but they carry very different operational risk.

Tool calling is generally the enterprise-friendly approach: the model asks for an action in a structured way, the system validates it, and the tool executes with controlled permissions. Computer use, by contrast, is attractive because it can automate anything a person can click—especially legacy systems without APIs. But that same generality creates brittleness and a larger blast radius for mistakes.

If you remember only one procurement lesson from this section, make it this: “can click around” is not a feature you accept casually. It’s a capability you sandbox, log, and constrain—or you avoid.

Tool calling: structured, auditable, and easier to constrain

Tool calling means the model outputs a structured request—“call tool X with arguments Y”—and the application executes it. It’s common in enterprise systems because it’s easier to log, validate, and sandbox.

Tool calling also supports strong design hygiene:

- arguments can be schema-validated,
- tools can enforce permissions,
- outputs can be checked before the model continues.

When readers hear that an agent can “use your CRM,” tool calling is often what’s meant: a controlled bridge between the model and a system of record.

Computer use: screenshots, clicks, and a bigger blast radius

Computer use (sometimes called browser or desktop use) is different. Anthropic describes it as screenshot-based perception plus mouse/keyboard control. OpenAI similarly positions “computer use” as a built-in tool alongside web search and file search.

Computer use is attractive because it can automate legacy interfaces without custom APIs. It’s also brittle: UIs change, buttons move, pop-ups appear, and the model can misclick. Anthropic’s documentation emphasizes heightened security considerations, recommending sandboxing via VM/container and avoiding sensitive credentials.

That warning deserves to be read as policy, not footnote. Once an agent can click around a real system, mistakes stop being hypothetical. They become operational incidents.

Where agents break in practice: the failure modes you should expect

Agent failures are rarely cinematic. They’re mundane, compounding, and expensive in the way operational mistakes are expensive: they waste time, create cleanup work, and gradually erode trust. What changes with agents is that errors don’t stay in the chat window. They propagate into calendars, ticket queues, emails, and databases.

Tool access doesn’t eliminate hallucinations—it relocates them into tool choice, parameter construction, and claims about what the system did. Meanwhile, the very loop that makes agents adaptable also gives small errors multiple chances to snowball into a workflow.

Planning for these failure modes isn’t pessimism. It’s basic hygiene for any system that can take actions in production. The right expectation is routine failure with bounded impact—not rare catastrophe with unbounded access.

Hallucination, now with tool access

Hallucinations don’t disappear when you add tools. They change shape.

In agent systems, “hallucination” can mean:

- choosing the wrong tool,
- inventing or mangling tool parameters,
- or pretending to have used a tool (“tool bypass”) when it didn’t.

Research attention has started to shift toward these problems. A 2026 arXiv paper (noted in the research materials) studies hallucinations in agent tool selection, including malformed parameters and tool-bypass behaviors. The message is blunt: once an agent can act, hallucinations stop being embarrassing and start being expensive.

Error propagation: small mistakes become workflows

The agent loop that makes systems flexible also creates failure cascades. A wrong tool call can produce wrong data, which becomes the basis for the next decision, which triggers another action. Unlike a single chatbot answer, agent output can alter systems of record—calendars, tickets, emails, or databases.

Brittleness and overconfidence

Agents also fail in non-mysterious ways. They struggle with edge cases, unexpected UI states, and ambiguous instructions. The tone of modern LLMs—fluent, assured—adds a social hazard: users may assume competence because the writing sounds competent.

Organizations deploying agents should plan for routine failure, not rare catastrophe. Routine failure is what drains trust.

What AI agents are good for (today) and how to evaluate them

Agents are most valuable when they operate inside a narrow lane: clear goals, limited tools, and outputs you can verify. That’s not a limitation to apologize for—it’s the difference between reliable automation and unpredictable autonomy.

To evaluate an agent product, focus on boundaries and evidence. Where does autonomy stop? What approvals exist before external messages go out or systems of record change? What’s logged? What happens when a tool fails—does the system stop, ask, or guess?

The best use cases tend to look unglamorous: ticket triage, routing, structured updates, information gathering from approved sources. These align with how agent loops behave and with what organizations can realistically monitor and constrain.

Realistic use cases: automation where the steps are known

Agents perform best when a task has:

- a clear goal,
- a limited toolset,
- and a workflow that can be verified.

Think: triaging support tickets, drafting and routing emails for approval, gathering information from approved sources, or updating structured fields in internal systems. These are not glamorous examples, but they match how agent loops actually behave.

A practical way to evaluate an “agentic” product is to ask: where does the autonomy stop? If the system can take action, what approvals exist? What can it do without a human? What’s logged?

Case study pattern: support triage and routing (why it works)

Support triage is a useful illustration because it’s naturally modular:

- classify the issue,
- retrieve relevant account context,
- suggest next steps,
- and file/update a ticket.

A tool-calling agent can call a knowledge base search tool, then call a ticketing API with validated fields. The system can require human review before sending customer-facing messages. That combination—tool calling plus verification—fits the MRKL mindset: let the model route, let tools do the authoritative work.

Case study pattern: legacy UI automation (why it’s tempting—and risky)

Computer use shines when no API exists: an agent can “drive” a browser to pull a report or enter a form. The same power makes it dangerous. UI-driven automation can fail when a dialog box appears or when the agent encounters unexpected content.

Anthropic’s explicit guidance to sandbox computer use and avoid sensitive credentials should frame procurement conversations. If a vendor proposes “we’ll just let the agent log in and click around,” ask what isolation they provide, what credentials are exposed, and what happens when the UI changes.

Practical takeaways: how to buy, build, or deploy agents without regret

If agents have a reputation problem, it’s because the easiest version to demo is not the safest version to ship. Real deployments need boring fundamentals: logs, validation, permissioning, and sane fallback behavior when tools fail.

This section translates the earlier concepts—agent loops, traces, tool calling, computer use, and multi-agent coordination—into practical demands and build habits. The goal is not to slow adoption; it’s to keep autonomy bounded and behavior legible.

The overarching procurement and engineering principle is simple: a system that can act must also be a system you can audit. If a vendor can’t show failure handling, or if a team can’t explain what the agent does when uncertainty rises, you’re not looking at automation—you’re looking at theater.

What to demand from vendors

Readers evaluating agent products—especially in enterprise settings—should insist on basics that sound boring because they are:

- Full traces and logs of decisions and actions (OpenAI emphasizes traceability for a reason).
- Tool-call validation (schemas, argument checks, error handling).
- Permissioning that maps to real roles (what can the agent read/write?).
- Fallback behavior when tools fail (does it stop, ask, or guess?).

A vendor demo that never shows failure handling is not a demo. It’s theater.

What to build into your own agent systems

Engineering teams can reduce risk with a few design habits drawn directly from the agent patterns discussed:

- Use the model as a router (MRKL) rather than a monolith.
- Prefer tool calling for critical systems; treat computer use as last resort.
- Use stepwise action/observation loops (ReAct) with checkpoints.
- Consider multi-agent setups (AutoGen-style) only when traceability and ownership are clear.

Multiple perspectives: the optimism and the caution are both rational

The optimistic view is that agents finally turn LLMs into productive software components by giving them tools and constraints. The cautious view is that adding actions increases the cost of error, and real systems aren’t forgiving.

Both can be true. Agents can meaningfully reduce busywork in narrow domains. Agents can also create new classes of operational risk, especially when “autonomy” becomes a selling point rather than a carefully bounded feature.

A sober stance is not anti-agent. It’s pro-engineering.

Conclusion: the future of agents will be decided by restraint

AI agents are best understood as a loop: decide, act, observe, repeat. That loop can deliver real value—especially when it routes tasks to authoritative tools instead of improvising answers. The same loop can also turn small model errors into system-level consequences.

OpenAI’s focus on tool use, handoffs to specialized agents, and “full trace” reflects where the field is heading: toward systems that can be audited. Anthropic’s emphasis on sandboxing for computer use reflects the other half of the truth: the moment an agent can click, it can also misclick.

The next year of “agent” adoption will reward teams that treat autonomy as a privilege earned through constraints, logs, and verification. The teams that treat autonomy as a vibe will ship confident software that quietly breaks things.

1) What is an AI agent in plain English?

An AI agent is software that uses an AI model to decide what to do next over multiple steps, often by calling tools or taking actions like updating a system or sending a message. The key feature is the agent loop: it acts, observes results, then chooses the next step until it reaches a goal or gets stuck.

2) How is an AI agent different from a chatbot?

A chatbot mainly generates responses. An agent goes further: it has a goal and can take multi-step actions, such as searching a database, creating a ticket, or scheduling a meeting. Many products blur the terms, so look for evidence of tool access, state/memory, and traceable actions—not just conversation.

3) What does “tool calling” mean?

Tool calling means the model outputs a structured request to call a specific tool with specific arguments, and the application executes it. Tool calling is easier to audit and constrain than free-form “do it yourself” behavior because arguments can be validated and permissions can be enforced.

4) What is “computer use,” and why do people warn about it?

Computer use lets an agent operate a UI by viewing screenshots and controlling mouse/keyboard. It’s useful for automating legacy systems without APIs. It’s also riskier and more brittle: UI changes can break workflows, and misclicks can cause real harm. Anthropic recommends sandboxing (VM/container) and avoiding sensitive credentials.

5) Do agents reduce hallucinations?

Sometimes, but not automatically. Tool-based designs (as discussed in research like ReAct and Toolformer) can reduce factual guessing by letting the system check sources. Agents introduce new failure modes too: choosing the wrong tool, inventing parameters, or claiming to have used a tool when they didn’t.

6) What are the most common real-world failure modes for agents?

Common failures include tool-selection errors, malformed tool parameters, “tool bypass” (pretending a tool was used), and error propagation across steps. UI-based “computer use” adds brittleness: pop-ups, layout changes, or unexpected states can derail the workflow. Strong logging and guardrails matter more than polished demos.

7) When should a team consider multi-agent systems?

Multi-agent systems can help when tasks benefit from specialization—planning, execution, verification, and communication handled by different agents. Microsoft Research’s AutoGen frames this as a flexible application pattern. Multi-agent setups increase coordination complexity, so they make sense only when traces, ownership, and escalation rules are clearly designed.

Key Insight

Agents are neither magic nor mere chatbots: they are a looped system (decide → act → observe) whose reliability depends on tools, guardrails, and traceability.

The agent loop (core pattern)

  1. 1.A user provides a goal
  2. 2.The model proposes an action (often a tool call)
  3. 3.The system executes the action
  4. 4.The system returns results
  5. 5.The model decides the next step

Chatbot vs. agent (behavioral difference)

Before
  • Answers a prompt
  • single-turn response
  • limited or no action execution
After
  • Pursues a goal
  • multi-step autonomy
  • calls tools/takes actions with state and logs

Vendor must-haves for agent deployments

  • Full traces and logs of decisions and actions
  • Tool-call validation (schemas, argument checks, error handling)
  • Permissioning mapped to real roles (read/write boundaries)
  • Fallback behavior when tools fail (stop, ask, or escalate—don’t guess)

Editor's Note

A vendor demo that never shows failure handling is not a demo. It’s theater.
5
Most agent systems reduce to a five-step loop: goal → action → execution → results → next step—each cycle adds correction opportunities and error risks.
2022
Key agent patterns emerged in 2022 (MRKL in May; ReAct in October), reflecting a shift toward modular tools plus stepwise action/observation loops.
2023
Toolformer (2023) popularized the idea that models shouldn’t guess when they can query—tool use as reliability engineering, not a novelty feature.
2024
AutoGen (presented at COLM 2024) frames multi-agent conversation as a general application pattern—useful for specialization, risky without traces and authority rules.
T
About the Author
TheMurrow Editorial is a writer for TheMurrow covering explainers.

Frequently Asked Questions

What is an AI agent in plain English?

An AI agent is software that uses an AI model to decide what to do next over multiple steps, often by calling tools or taking actions like updating a system or sending a message. The key feature is the agent loop: it acts, observes results, then chooses the next step until it reaches a goal or gets stuck.

How is an AI agent different from a chatbot?

A chatbot mainly generates responses. An agent goes further: it has a goal and can take multi-step actions, such as searching a database, creating a ticket, or scheduling a meeting. Many products blur the terms, so look for evidence of tool access, state/memory, and traceable actions—not just conversation.

What does “tool calling” mean?

Tool calling means the model outputs a structured request to call a specific tool with specific arguments, and the application executes it. Tool calling is easier to audit and constrain than free-form behavior because arguments can be validated and permissions can be enforced.

What is “computer use,” and why do people warn about it?

Computer use lets an agent operate a UI by viewing screenshots and controlling mouse/keyboard. It’s useful for automating legacy systems without APIs, but it’s riskier and more brittle: UI changes can break workflows, and misclicks can cause real harm. Anthropic recommends sandboxing (VM/container) and avoiding sensitive credentials.

Do agents reduce hallucinations?

Sometimes, but not automatically. Tool-based designs (as discussed in ReAct and Toolformer) can reduce factual guessing by letting the system check sources. Agents also introduce new failure modes: choosing the wrong tool, inventing parameters, or claiming to have used a tool when they didn’t.

What are the most common real-world failure modes for agents?

Common failures include tool-selection errors, malformed tool parameters, “tool bypass” (pretending a tool was used), and error propagation across steps. UI-based “computer use” adds brittleness: pop-ups, layout changes, or unexpected states can derail the workflow. Strong logging and guardrails matter more than polished demos.

More in Explainers

You Might Also Like