Licensing the Web: The Future of Crawlers
Two emerging standards promise to reshape how bots access content.
Yesterday, one of my team members shared a link on our Slack to an Ars Technica article about a new standard released on Wednesday called Really Simple Licensing. It was developed by Eckart Walther, the co-creator of RSS, Doug Leeds, former CEO of Ask.com, and Geraud Boyer, a former engineering director at Twitter. It is backed by several organizations calling themselves the RSL Collective, which includes content publishers and search companies.
Pay-per-output? AI firms blindsided by beefed up robots.txt instructions. on Ars Technica
The standard itself uses existing protocols such as OAuth, XML, and Schema.org, and explains how you can grant a license (paid or free) to crawlers through robots.txt files, HTTP headers, and even embedded in ebooks, media, or other content files. The main parts of the new standard are:
The license file. An XML-based document that specifies the terms under which digital assets can be used. It can describe permitted and prohibited uses (like AI training, search indexing, summarization), payment models (free, subscription, per crawl, etc.), and legal statements (Creative Commons, MIT).
The license server. If referenced, a license server issues and enforces licenses (even free ones), and when content is encrypted, it provides the keys needed to access it. RSL License Servers implement the Open Licensing Protocol (OLP), which handles license management, billing, and verification.
Authentication and encryption. RSL requires crawlers to present a valid license. If a crawler fails to do so, the server responds with an HTTP 401 Unauthorized and guidance on how to obtain a license. Additionally, content can be distributed in an encrypted form, with keys only available through the license server.
RSL moves beyond being just a descriptive framework and establishes a foundation for enforceable licensing on the web. However, enforcement still relies on partners to work. The RSL Collective is currently collaborating with Fastly, a content delivery network, so that websites can grant crawlers access based on the website’s stated licence.
However, RSL is hardly alone in attempting to solve (and monetize!) the crawling that is saturating the internet. There is, of course, what Cloudflare is doing with their pay per crawl model.
Introducing pay per crawl: Enabling content owners to charge AI crawlers for access on Cloudflare’s blog.
Cloudflare’s solution is essentially an experiment in reviving the long-dormant HTTP 402 “Payment Required” status code. Instead of giving crawlers either free rein or no access at all, site owners can configure rules that charge a flat fee every time a crawler requests a page. Publishers can still allow some crawlers through for free, block others entirely, or monetize access.
To make this work securely, Cloudflare ties directly into the IETF’s Web Bot Auth (webbotauth) proposals. Web Bot Auth is about creating a standard way to cryptographically authenticate non-browser clients like crawlers, AI agents, or web archivers, so publishers and content creators can be sure they’re dealing with the real thing, not a spoofed bot. Cloudflare’s system requires crawlers to register keys and sign requests in line with these IETF proposals, ensuring that only verified bots can participate in the pay per crawl ecosystem.
RSL and Web Bot Auth ultimately differ in scope, but they are complementary. RSL couples rules with enforcement via partner organizations, allowing sites to define licensing terms, requiring license tokens, and utilizing encryption to ensure only authorized use. Web Bot Auth, by contrast, focuses on the identity of the crawler, supplying the cryptographic backbone to confirm that a crawler is truly who it claims to be. The benefit to Web Bot Auth is, of course, that the partner organization — Cloudflare — is already starting to offer the solution.
All of this innovation around mitigating web crawlers is fascinating, but the real question is which of these approaches big AI companies will actually adopt. If the history of the internet teaches us anything, it’s that experimentation eventually coalesces around a standardized approach to doing things. The open question for me is which standard will prevail, and which approach will best serve libraries and open infrastructure in the long run.
I’d love to hear your thoughts: which of these approaches do you think best benefits libraries and open infrastructure? Share your opinion in my subscriber chat.


RSL and Web bot auth both seem interesting and I'm looking forward to seeing how they develop. But, as someone whose site has been taken offline by over-zealous scrapers more than once now I have little hope that AI companies will be good enough citizens to actually respect something like RSL unless folks are willing to blockade their content with logins and/or paywalls. Robots.txt is now pretty much routinely ignored IME, and even CloudFlare's bot protection struggles to keep up with the strategies employed by scraper bots. We've resorted to rate-limiting and invoking JS challenges for specific ASNs to keep things running on our end, though I'm sure we'll have to change those tactics as scrapers change theirs.
Interesting! I am curious where W3C’s work on the Open Digital Rights Language (ODRL) fits into this picture. I think ODRL is more about metadata descriptions and less about automation or enforcement. The real challenge seems to be coupling rules with enforcement (as you point out). But setting up license servers and related infrastructure could be a hurdle for many.
Your point about adoption is important which of these approaches will the big AI companies actually take on board? This reminds me a bit of how independent mail servers used to be common (I maintained one myself years ago), but over time most people shifted to a few large providers. I wonder if we’ll see a similar centralisation dynamic here.