Network Working Group G. Illyes
Internet-Draft Google LLC.
Intended status: Informational 18 April 2026
Expires: 20 October 2026
Robots Exclusion Protocol Extension for URL Level Control
draft-illyes-repext-03
Abstract
This document extends RFC9309 by specifying additional URL level
controls through an HTTP response header and, for historical reasons,
through HTML meta tags originally developed in 1996. Additionally it
moves the HTTP response header out of the experimental header space
(i.e., "X-") and defines the combinability of multiple headers, which
was previously not possible.
Status of This Memo
This Internet-Draft is submitted in full conformance with the
provisions of BCP 78 and BCP 79.
Internet-Drafts are working documents of the Internet Engineering
Task Force (IETF). Note that other groups may also distribute
working documents as Internet-Drafts. The list of current Internet-
Drafts is at https://datatracker.ietf.org/drafts/current/.
Internet-Drafts are draft documents valid for a maximum of six months
and may be updated, replaced, or obsoleted by other documents at any
time. It is inappropriate to use Internet-Drafts as reference
material or to cite them other than as "work in progress."
This Internet-Draft will expire on 20 October 2026.
Copyright Notice
Copyright (c) 2026 IETF Trust and the persons identified as the
document authors. All rights reserved.
This document is subject to BCP 78 and the IETF Trust's Legal
Provisions Relating to IETF Documents (https://trustee.ietf.org/
license-info) in effect on the date of publication of this document.
Please review these documents carefully, as they describe your rights
and restrictions with respect to this document.
Illyes Expires 20 October 2026 [Page 1]
Internet-Draft REPext for URL level April 2026
Table of Contents
1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 2
2. Conventions and Definitions . . . . . . . . . . . . . . . . . 2
3. Specification . . . . . . . . . . . . . . . . . . . . . . . . 3
3.1. Robots control . . . . . . . . . . . . . . . . . . . . . 3
3.1.1. HTTP Response Header . . . . . . . . . . . . . . . . 3
3.1.2. HTML Meta Element . . . . . . . . . . . . . . . . . . 4
3.1.3. Robots controls rules . . . . . . . . . . . . . . . . 5
3.1.4. Caching of values . . . . . . . . . . . . . . . . . . 5
4. Security Considerations . . . . . . . . . . . . . . . . . . . 5
5. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 6
5.1. HTTP Field Name Registration . . . . . . . . . . . . . . 6
5.2. Robots Control Rules Registry . . . . . . . . . . . . . . 6
5.3. Deprecation of X-Robots-Tag . . . . . . . . . . . . . . . 7
6. References . . . . . . . . . . . . . . . . . . . . . . . . . 7
6.1. Normative References . . . . . . . . . . . . . . . . . . 7
6.2. Informative References . . . . . . . . . . . . . . . . . 8
Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . 8
Author's Address . . . . . . . . . . . . . . . . . . . . . . . . 8
1. Introduction
While the Robots Exclusion Protocol [ROBOTSTXT] enables service
owners to control how, if at all, automated clients known as crawlers
may access the URLs on their services as defined by [WEBLINKING], the
protocol doesn't provide controls on how the data returned by their
service may be used upon allowed access.
Originally developed in 1996 and widely adopted since, the use-case
control is left to URL level controls implemented in the response
headers, or in case of HTML in the form of a meta tag as defined by
[HTML-META]. This document specifies these control tags, and in case
of the response header field, brings it to standards compliance with
[HTTP-SEMANTICS].
Application developers are requested to honor these tags. The tags
are not a form of access authorization however.
2. Conventions and Definitions
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
"SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and
"OPTIONAL" in this document are to be interpreted as described in
BCP 14 [RFC2119] [RFC8174] when, and only when, they appear in all
capitals, as shown here.
Illyes Expires 20 October 2026 [Page 2]
Internet-Draft REPext for URL level April 2026
This specification uses the following terms from
[STRUCTURED-FIELD-VALUES]: List, String, Parameter.
3. Specification
3.1. Robots control
The URL level crawler controls are a key-value pair that can be
specified two ways:
* an HTTP response header structured field as specified by
[STRUCTURED-FIELD-VALUES].
* for historical reasons, in case of HTML, one or more meta tags as
defined by the [HTML-META] specification.
3.1.1. HTTP Response Header
The robots-tag field is a List as defined in
[STRUCTURED-FIELD-VALUES]. Each member of the List is an Item
representing a product token. Rules applicable to a product token
are defined as Parameters of that Item. For historical reasons,
implementors SHOULD also support the experimental field name X-
Robots-Tag.
The product token is either a specific string as defined in
Section 2.2.1 of [ROBOTSTXT] or the global identifier *. All rules
defined in this specification are restrictive. The absence of a rule
for a specific instruction constitutes as no instruction.
If a product token appears multiple times in the field, or if rules
are provided for both a specific product token and the global *
identifier, the implementor MUST apply the union of all restrictive
rules. A restriction applied to the global * token applies to all
accessors regardless of whether a specific product token is present.
For example, the following response header field specifies a global
nosnippet rule for all accessors, and an additional noindex rule
specifically for ExampleBot:
Robots-Tag: *;nosnippet, ExampleBot;noindex
The structured field in the examples is deserialized into the
following objects:
Illyes Expires 20 October 2026 [Page 3]
Internet-Draft REPext for URL level April 2026
"*" = [
["nosnippet", true]
],
"ExampleBot" = [
["noindex", true],
["nosnippet", true]
]
Implementors SHOULD impose a parsing limit on the field value to
protect their systems. The parsing limit MUST be at least 8
kibibytes [KiB].
For a robots-tag field that exceeds the implementor's parsing limit,
the implementor MUST process the data up to that limit. Any complete
and valid List members found within the processed bytes MUST be
honored. Any partially transmitted or truncated List member at the
limit, and all subsequent bytes in that field, MUST be ignored. This
ensures consistency with the processing of robots.txt files as
specified in Section 2.1.1 of [ROBOTSTXT].
3.1.2. HTML Meta Element
For historical reasons the robots-tag header values may be specified
by HTML service owners as an HTML meta tag. In case of the meta tag,
the name attribute is used to specify the product token, and the
content attribute to specify the comma separated robots-tag rules.
As with the header, the product token may be a global token, robots,
which signifies that the rules apply to all requestors, or a specific
product token applicable to a single requestor. For example:
Multiple robots meta elements may appear in a single HTML document.
The implementor MUST apply the union of all rules found in all
applicable elements. This includes rules specified for the global
robots token and rules specified for the accessor's specific product
token.
Because all rules specified in this document are restrictive, they
are inherently additive. An accessor cannot "opt-out" of a
restriction defined in a global 'robots' tag by being mentioned in a
specific tag without that restriction. If any applicable tag
contains a restrictive rule, that rule MUST be honored.
Illyes Expires 20 October 2026 [Page 4]
Internet-Draft REPext for URL level April 2026
3.1.3. Robots controls rules
The possible values of the rules are:
* noindex - instructs the parser to not store the served data in its
publicly accessible index.
* nosnippet - instructs the parser to not reproduce any stored data
as an excerpt snippet.
The values are case insensitive.
Implementors may support other rules as specified in Section 2.2.4 of
[ROBOTSTXT].
3.1.4. Caching of values
Implementors SHOULD link the validity of robots-tag rules to the
freshness of the associated resource. Rules extracted from an HTTP
response header or an HTML meta tag remain in effect until the
resource is recrawled or the cached representation expires.
Implementors SHOULD determine the freshness of the rules using
standard HTTP cache directives, such as Cache-Control or Expires, as
defined in [HTTP-SEMANTICS]. In the absence of explicit cache
directives, implementors MAY use heuristics to determine the refresh
interval, typically matching the scheduled recrawl frequency of the
resource.
If a resource is not recrawled due to a lack of perceived change, the
last known robots-tag rules MUST continue to be honored.
4. Security Considerations
The robots-tag is not a substitute for valid content security
measures. To control access to the URL paths where the robots-tag
appears, service owners SHOULD employ a valid security measure such
as HTTP Authentication as defined in [HTTP-SEMANTICS].
The content of the robots-tag header field is not secure, private or
integrity-guaranteed, and due caution should be exercised when using
it. Use of Transport Layer Security ([TLS]) with HTTP
([HTTP-SEMANTICS]) is currently the only end-to-end way to provide
such protection.
Illyes Expires 20 October 2026 [Page 5]
Internet-Draft REPext for URL level April 2026
In case of a robots-tag specified in a HTML meta element,
implementors should consider only the meta elements specified in the
head element of the HTML document, which is generally only accessible
to the service owner.
Implementors who execute client-side code MUST NOT treat the post-
execution DOM state as the exclusive source of truth. Instead,
implementors MUST apply the union of all restrictive rules identified
in both the initial HTML source and the final DOM state. A
restriction present in either state MUST be honored.
To protect against memory overflow attacks, implementers should
enforce a limit on how much data they will parse; see [Section 3.1.1]
for the lower limit.
5. IANA Considerations
5.1. HTTP Field Name Registration
IANA is requested to register the following HTTP field name in the
"Hypertext Transfer Protocol (HTTP) Field Name Registry" according to
the procedures defined in Section 18.4 of [HTTP-SEMANTICS]:
Field Name: Robots-Tag
Template: None
Status: permanent
Reference: [This document]
Comments: This field name supersedes the experimental X-Robots-Tag
field.
5.2. Robots Control Rules Registry
IANA is requested to create a new "Robots Control Rules" registry.
This registry manages the tokens used as parameters within the
Robots-Tag HTTP field and, by extension, the content attribute of the
robots meta element.
The registration policy for this registry is "IETF Review" or "Expert
Review" as defined in [IANA-GUIDELINES].
The initial entries for this registry are:
Rule Name: noindex
Illyes Expires 20 October 2026 [Page 6]
Internet-Draft REPext for URL level April 2026
Description: Instructs the parser to not store the served data in its
publicly accessible index.
Reference: [This document, Section 3.1.3]
Rule Name: nosnippet
Description: Instructs the parser to not reproduce any stored data as
an excerpt snippet.
Reference: [This document, Section 3.1.3]
5.3. Deprecation of X-Robots-Tag
The X-Robots-Tag field name was used prior to the standardization of
the Robots-Tag field. New implementations MUST NOT use the X- prefix
for this field, adhering to the principles in [X-DEPRECATION]. While
parsers SHOULD continue to support X-Robots-Tag for backward
compatibility with legacy systems, it is formally deprecated by this
document.
6. References
6.1. Normative References
[HTTP-CACHING]
Fielding, R., Ed., Nottingham, M., Ed., and J. Reschke,
Ed., "HTTP Caching", STD 98, RFC 9111,
DOI 10.17487/RFC9111, June 2022,
.
[HTTP-SEMANTICS]
Fielding, R., Ed., Nottingham, M., Ed., and J. Reschke,
Ed., "HTTP Semantics", STD 97, RFC 9110,
DOI 10.17487/RFC9110, June 2022,
.
[IANA-GUIDELINES]
Cotton, M., Leiba, B., and T. Narten, "Guidelines for
Writing an IANA Considerations Section in RFCs", BCP 26,
RFC 8126, DOI 10.17487/RFC8126, June 2017,
.
[RFC2119] Bradner, S., "Key words for use in RFCs to Indicate
Requirement Levels", BCP 14, RFC 2119,
DOI 10.17487/RFC2119, March 1997,
.
Illyes Expires 20 October 2026 [Page 7]
Internet-Draft REPext for URL level April 2026
[RFC8174] Leiba, B., "Ambiguity of Uppercase vs Lowercase in RFC
2119 Key Words", BCP 14, RFC 8174, DOI 10.17487/RFC8174,
May 2017, .
[ROBOTSTXT]
Koster, M., Illyes, G., Zeller, H., and L. Sassman,
"Robots Exclusion Protocol", RFC 9309,
DOI 10.17487/RFC9309, September 2022,
.
[STRUCTURED-FIELD-VALUES]
Nottingham, M. and P. Kamp, "Structured Field Values for
HTTP", RFC 8941, DOI 10.17487/RFC8941, February 2021,
.
[TLS] Rescorla, E., "The Transport Layer Security (TLS) Protocol
Version 1.3", RFC 8446, DOI 10.17487/RFC8446, August 2018,
.
[WEBLINKING]
Nottingham, M., "Web Linking", RFC 8288,
DOI 10.17487/RFC8288, October 2017,
.
[X-DEPRECATION]
Saint-Andre, P., Crocker, D., and M. Nottingham,
"Deprecating the "X-" Prefix and Similar Constructs in
Application Protocols", BCP 178, RFC 6648,
DOI 10.17487/RFC6648, June 2012,
.
6.2. Informative References
[HTML-META]
"HTML Meta Element", 14 April 2026,
.
[KiB] "KibiByte", 14 October 2022,
.
Acknowledgments
TODO acknowledge.
Author's Address
Illyes Expires 20 October 2026 [Page 8]
Internet-Draft REPext for URL level April 2026
Gary Illyes
Google LLC.
Brandschenkestrasse 110
CH-8002 Zürich
Switzerland
Email: garyillyes@google.com
Illyes Expires 20 October 2026 [Page 9]