AI Agents Are Becoming Your Middleman—But Here’s the 2-Line Web ‘Handshake’ That Determines Whether They Can Buy, Book, or Break Things
A plain-text file at your domain root still decides what many automated systems can reach—just as agents shift from reading pages to taking actions. The catch: it’s a handshake, not a lock.

Key Points
- 1Understand robots.txt’s “two-line handshake”: `User-agent` plus `Allow`/`Disallow` can broadly shape what compliant agents can even reach.
- 2Treat RFC 9309 as policy baseline—not protection: robots.txt is advisory, so real security still requires authentication and authorization.
- 3Separate bot purposes before blocking: training crawlers, search/retrieval crawlers, and user-triggered browsing may use different user agents and rules.
A plain-text file at the root of your website now sits uncomfortably close to the front door of the AI economy.
For decades, `/robots.txt` was a quiet agreement between publishers and crawlers: a simple set of instructions, checked early, that suggested what a bot should and shouldn’t fetch. It wasn’t glamorous. It rarely made headlines. It also helped keep the web legible at scale.
Now the stakes have changed. AI systems are no longer limited to reading pages for indexing and summaries. They are increasingly positioned as intermediaries that browse on a user’s behalf—and, in some cases, take actions. Booking, buying, changing settings, following links: tasks that turn “access” from an abstract question into a practical one.
The irony is that the web’s most familiar “handshake” with machines remains almost comically small: two lines of text that can allow or block entire classes of automated agents. That minimalism is both its power and its problem.
“A two-line file can decide whether an automated agent gets to read your work—or even reach the pages where actions begin.”
— — TheMurrow Editorial
The two-line handshake that still runs the web
In practice, many site operators rely on an almost minimalist pattern—a “two-line handshake” that grants or denies broad access:
- `User-agent:
- `Allow: /` or `Disallow: /`
Those two directives can permit or forbid a class of crawlers from accessing all paths. That’s not a trick or a hack; RFC 9309 describes how a robots.txt file is composed of groups that begin with one or more `User-agent:` lines, followed by `Allow:` and/or `Disallow:` rules.
Why two lines carry so much weight
Yet that convenience also produces overreach. A two-line decision can be blunt: it may keep out a training crawler but also reduce discoverability in certain search-like AI experiences, depending on how a provider uses its crawlers.
The critical limitation: advisory, not enforcement
RFC 9309 also notes that anyone who needs true access control should use real application-layer security—for example, authentication. The file can shape good-faith behavior, but it cannot compel it.
“Robots.txt is a handshake, not a lock.”
— — TheMurrow Editorial
RFC 9309: when a convention became an internet standard
For publishers and developers, the standardization is more than bureaucratic housekeeping. It means disputes can increasingly be framed in reference to a shared text: what a crawler should do, how it should decide which rule applies, and how groups should be evaluated.
How robots.txt matches and resolves rules
- One or more `User-agent:` lines
- Followed by rules such as `Allow:` and `Disallow:`
Matching is based on the crawler’s declared product token (its user-agent token). The resolution behavior is formalized with a longest-match style approach for choosing which rule applies. That matters because modern sites often combine broad rules (“Disallow all”) with narrow exceptions (“Allow this path”), and the order and specificity can determine outcomes.
Why standardization matters for “agent middlemen”
The standard doesn’t solve those policy questions, but it clarifies the technical base layer. That clarity is a necessary precondition for the harder governance debate now arriving.
When agents shift from reading to doing, access becomes risk
OpenAI has described the security risks that emerge when systems retrieve web content and could be manipulated by crafted pages or links—especially around prompt-injection and URL-based data exfiltration attempts. These are not abstract concerns. They reflect a practical reality: when an agent “browses,” the content it reads can try to influence what it does next.
The “agent as browser” threat model
That creates new risk surfaces, including:
- Prompt injection embedded in web pages that tries to override instructions
- Data exfiltration attempts that entice an agent to reveal or transmit sensitive information
- Confusion between a trusted tool instruction and a malicious web instruction
OpenAI’s link-safety discussion frames these risks as central to agent design: if models retrieve web content, they can be tricked into leaking or mishandling data unless mitigations are in place.
Why robots.txt becomes more consequential—without becoming stronger
If an agent is only indexing public pages, a permissive robots policy is a visibility choice. If an agent might follow links into transactional flows, the same permissive policy can become a safety choice. Publishers face a modern dilemma: how to remain discoverable without exposing sensitive surfaces to automated browsing in ways they didn’t anticipate.
“As soon as software can act, ‘who is allowed to fetch’ turns into ‘who is allowed to affect.’”
— — TheMurrow Editorial
The bot taxonomy problem: crawler, search, training, or user-triggered fetch?
The ecosystem often conflates at least two categories:
1) Crawlers used for indexing, training collection, or search-style retrieval
2) User-triggered agent actions where a model visits a URL because a person asked it to
Those categories can look identical from a server log: both involve automated requests. Yet their purposes—and the policies around them—can differ sharply.
Why the distinction matters
OpenAI’s own documentation underscores the stakes: it documents multiple crawlers and frames robots.txt as the mechanism by which webmasters manage how their sites interact with OpenAI systems.
That sounds simple until you ask: which system are you controlling—training, search, or user browsing?
A real-world governance dilemma (without the drama)
- Will blocking a crawler stop training collection?
- Will it also reduce answer quality or visibility in search-like experiences?
- Does a user-triggered fetch behave like a crawler or like a browser?
The honest answer is that different providers may draw the lines differently. The web wants one switch labeled “AI,” but it has several electrical circuits—and robots.txt can only label the wires it knows.
Key Insight
Targetable user agents: what OpenAI publishes, and why it matters
OpenAI documents multiple bots and explicitly positions robots.txt as the way webmasters manage interactions with OpenAI systems. Two names recur in the documentation and public references:
- OAI-SearchBot, a named crawler with a published user-agent string (per OpenAI documentation)
- GPTBot, widely referenced as the training-oriented crawler that can be disallowed via robots.txt
That separation is the beginning of a more granular approach: publishers can choose to allow one function (for example, search retrieval) while restricting another (for example, training collection). It isn’t perfect, but it is more nuanced than a single “block AI” toggle.
Practical example: the minimal rules that shape visibility
- Allow broad access:
- `User-agent: `
- `Allow: /`
- Block broad access:
- `User-agent: `
- `Disallow: /`
But granular control comes from targeting specific user agents with distinct groups. RFC 9309 supports group-based structures, and providers that publish distinct tokens make that structure actionable.
The hard part: clarity for humans, not just machines
If you manage a site with a public editorial surface, a subscriber-only tier, and a transactional account area, you may want different behavior in each zone. Robots.txt can express some of that with path rules, but it cannot authenticate. It cannot distinguish a paid subscriber from a random bot. It cannot enforce “read but don’t act.”
Which leads to the next point: robots.txt is necessary, but it can’t carry the entire burden alone.
Robots.txt isn’t security—so what should publishers actually do?
That sounds obvious, yet the agent era tempts people to treat robots.txt like a firewall: “If I disallow it, I’m safe.” The standard itself rejects that assumption.
What robots.txt is good for
- Signaling preferences to compliant crawlers
- Reducing load from unwanted crawling
- Separating different classes of automated access when user agents are distinct and honest
It also creates a public, auditable statement of intent. That matters when norms are still forming.
What robots.txt cannot do
- Verify a requester’s identity
- Prevent a non-compliant bot from fetching content
- Protect sensitive information if it’s publicly reachable
- Guarantee an agent won’t take action in an interface that allows it
Publishers who have genuine risk—account pages, internal tools, admin panels, personal data—need authentication, authorization, and conventional security controls. Robots.txt can complement those controls, but it cannot replace them.
A practical posture: “polite exclusion + real locks”
1) Use robots.txt as a policy signal for well-behaved crawlers
2) Use authentication and careful design for anything that should not be accessed or acted upon by unknown clients—human or automated
The agent era doesn’t change the fundamentals of security. It changes how quickly the fundamentals become relevant to ordinary publishing decisions.
“If it matters, lock it. If it’s a preference, publish it.”
— — TheMurrow Editorial
What robots.txt can and can’t do
- ✓Signal preferences to compliant crawlers
- ✓Reduce load from unwanted crawling
- ✓Separate automated access classes when tokens are distinct and honest
- ✓Verify identity (it can’t)
- ✓Prevent non-compliant bots from fetching (it can’t)
- ✓Protect sensitive data if it’s publicly reachable (it can’t)
Case study: the “buy, book, or break things” problem
Agentic systems blur that boundary. A user might ask an assistant to “find the best option and book it.” A system might browse listings and then proceed into steps that look like ordinary web navigation, but at machine speed and with different failure modes.
Where robots.txt helps—and where it doesn’t
Yet robots.txt cannot ensure the next request isn’t a user-triggered agent fetch. It also cannot prevent a malicious actor from ignoring the file. In a world where browsing can lead into action, publishers need stronger boundaries than “please don’t.”
Link safety and why providers care
The publisher’s role is not to solve model safety. The publisher’s role is to avoid mistaking a courtesy protocol for an access control system—and to recognize that “publicly reachable” increasingly means “reachable by automation.”
Practical takeaways: how to think like an editor and an operator
A checklist for sane robots decisions
- Use specific user agents when possible: providers that publish tokens make targeted rules feasible
- Avoid relying on robots.txt for sensitive areas: put authentication in front of anything that truly matters
- Segment your site by intent: public editorial content vs. transactional flows vs. private dashboards
- Monitor and revisit: as providers change user agents and products, your assumptions can drift
Multiple perspectives worth taking seriously
Providers argue that clear, standard mechanisms like robots.txt help maintain a functional web: machine-readable rules, respected at scale, reduce friction. Skeptics respond that “respected at scale” is still voluntary—and that the incentives of AI intermediaries do not always align with creators’.
Both perspectives share one truth: the old handshake is now doing new work. Publishers should treat it with the seriousness of any public-facing policy—because, increasingly, it is one.
Publisher posture in the agent era
Conclusion: the handshake is still small, but the door is bigger
RFC 9309 gave the protocol a formal spine in September 2022, and the AI boom gave it renewed relevance. The file can still do something important: it can express intent in a shared language that many automated systems recognize early. That is real power, even if it is not enforcement.
Publishers now face a more adult version of the old choice: decide what you welcome, decide what you refuse, and decide what must be protected by actual locks. A two-line handshake can signal your terms. It cannot defend them.
The next phase of the web will not be shaped only by models and product launches. It will be shaped by thousands of quiet decisions made in plain text—at the root of a domain—about who gets to knock, and what happens when the door opens.
1) What is the “two-line handshake” in robots.txt?
2) Is robots.txt legally or technically enforceable?
3) Why did RFC 9309 matter, and when was it published?
4) If I block AI bots, will I disappear from AI answers?
5) What’s the difference between a crawler and a user-triggered agent visit?
6) Which OpenAI bots can webmasters target in robots.txt?
7) What should I do if I’m worried about agents taking actions on my site?
Frequently Asked Questions
What is the “two-line handshake” in robots.txt?
The phrase refers to the simplest useful robots.txt structure: a `User-agent:` line followed by an `Allow: /` or `Disallow: /` rule. Those two lines can broadly permit or block a class of crawlers across an entire site. RFC 9309 formalizes how these directives are structured and interpreted, making the minimalist approach both common and standards-based.
Is robots.txt legally or technically enforceable?
Robots.txt is advisory, not enforcement. RFC 9309 describes it as a protocol that depends on crawler compliance. A well-behaved bot may check it early and follow it, but a malicious or non-compliant bot can ignore it. Anyone needing real protection should use authentication and application-layer security controls rather than relying on robots.txt.
Why did RFC 9309 matter, and when was it published?
RFC 9309 (Robots Exclusion Protocol) was published in September 2022. It matters because it formalized a long-running convention into an internet standard, clarifying mechanics such as group structure, user-agent matching, and rule resolution behavior. That clarity helps publishers and bot operators point to a shared baseline when policies are disputed.
If I block AI bots, will I disappear from AI answers?
It depends on which bots you block and how an AI provider sources information. The ecosystem includes different automated agents—training crawlers, search/retrieval crawlers, and user-triggered browsing. OpenAI, for example, documents multiple bots and encourages using robots.txt for management. Blocking one user agent may affect one function more than another.
What’s the difference between a crawler and a user-triggered agent visit?
A crawler typically fetches content systematically for indexing, training collection, or search-style retrieval. A user-triggered agent visit happens because a person asked an AI system to open a specific URL as part of a task. From your server’s perspective, both are automated requests, but their purpose and policy treatment can differ—one reason governance is still confusing.
Which OpenAI bots can webmasters target in robots.txt?
OpenAI documents distinct user agents and positions robots.txt as the control mechanism. Two widely referenced examples are OAI-SearchBot (a named crawler with a published user-agent string in OpenAI documentation) and GPTBot, widely referenced as the training-oriented crawler that can be disallowed via robots.txt. Targetable tokens enable more granular choices than a blanket allow or deny.















