The Cost of Open by Default in the AI Era
Can We Protect Donor Materials from Generative AI?
In my recent post, Openness Has Limits, I argued that the “open by default” mantra cultural heritage organizations have espoused is buckling under the pressure of the generative AI boom. The feedback was that many of you wanted to dig deeper into the why of that.
As a technologist, my perspective on openness is rooted in the daily realities of managing systems and protecting the cultural heritage artifacts entrusted to libraries and museums. While the open web has brought incredible progress, we are now facing four distinct pressures that make me question where we draw the line:
The Intellectual Harvest of Cultural Heritage: AI companies are no longer just looking for content; they are harvesting the intellectual frameworks we’ve spent decades building. As I’ve noted before, AI companies are not only focused on crawling and harvesting authors’ content. They are also interested in financial gain from the curation carried out by cultural heritage organizations over thousands of years, as well as in the frameworks we have built to describe, relate, and explain the content.
The Infrastructure Tax: Scraping bots aren’t always polite. Ironically, while I was drafting this post, my team was battling ByteDance servers crawling our catalog. Ultimately, the bots on ByteDance’s servers caused a full application outage. I’ve said before that these outages constitute a denial-of-service attack against our human users. Anyone running cultural heritage infrastructure on the web is essentially being forced to pay an infrastructure tax (computing costs plus staff time) to keep our digital doors open.
The Harvesting of Physical Collections: Recently, the Google Books Library Project has been reinvigorated, and rest assured, it has been reaching out to large research libraries to identify books that they haven’t scanned yet. My understanding is that some libraries have successfully negotiated one-time payments from Google, but in my opinion, this approach is short-sighted. Libraries either forget or don’t understand that these companies don’t just “read” the data once. They revisit it every time they build a new model, effectively getting a perpetual license to our libraries for a one-time entry fee.
The Erosion of Trust: Our entire model of stewardship relies on trust. When a donor, writer, or artist hands over their life’s work, they expect libraries and museums to serve as protectors of their works. Currently, we cannot provide them with any true guarantees that their materials won’t be ingested, synthesized, and commodified by AI companies. If we can’t offer guarantees, the pipeline for materials may dry up, and the cultures we seek to document and preserve may be poorer for it.
This last point—the fragile relationship between the donor, the library or museum, and the technology—is exactly what I want to dig into today.
Contractual Barriers to Donor Deposits
I recently reviewed a donor contract that really put point 4 above into focus for me.
Below is a summarized and reworded version of the language I reviewed, based on information from the Author’s Guild, the Society of Authors, and the Australian Society of Authors websites.
Organizations must safeguard materials against their use for:
Generative AI Development: Training, fine-tuning, or otherwise developing generative artificial intelligence models.
Mimicry of Authors: Creating outputs such as chatbots, conversational agents, or other simulated responses that emulate, mimic, or synthesize the author’s literary voice, style, or persona.
Derivative Works: Using the materials to enable the generation of new expressive works that are derived from or substantially similar to the materials.
I should also say that the contract includes positive language allowing us to use AI for internal purposes, such as OCR and metadata creation. But as a technologist, my first read had me concerned that the language creates a significant implementation challenge.
How do technologists adhere to these terms in a world where the internet is essentially an open buffet of content?
The End of Public Availability?
To ensure absolute compliance with the terms outlined above, I see only two things my team can implement to safeguard the materials. And guess what, neither of them feels open:
Removing the content from the public internet entirely. I’ve had extensive conversations with my team about our current application that would serve up the material in question, and they are confident that our technical blocks prevent the vast majority of automated crawlers from downloading our materials. I’ve even verified that our content is successfully hidden from the Wayback Machine (which, fun fact, uses data from Common Crawl, which I’ve mentioned before in relation to AI). However, “automated” is the keyword there. If an AI company is motivated enough, it can simply pay humans to build targeted scrapers (there’s a subReddit for that) or maybe vibe code something to bypass these blocks. If the content is on the open web, a determined actor with enough resources will eventually get it.
Requiring a legal waiver from every single user. This type of action is a direct response to the shifting legal ground around “meaningful consent.” For years, a case called hiQ Labs, Inc. v. LinkedIn Corporation, 17-16783 (via Courtlistener.com), served as the north star for scrapers; it essentially held that content on the public web could be harvested without legal repercussions. But that legacy has shifted since the AI boom began; courts have started to emphasize that institutions must take “sufficient” and “affirmative steps” to signal that bots are unwelcome. A simple
robots.txtfile (historically used as a “please don’t” sign) can be easily ignored and is not a legally binding barrier. Without an explicit, user-facing legal waiver that forces a visitor to agree to terms, AI companies can continue to hide behind the ambiguity of public access to claim they have a website’s consent.
This feels like a step backward for cultural heritage institutions. We spent decades trying to tear down the walled gardens around content and criticizing publishers for locking away knowledge. Now, the predatory nature of AI scraping is forcing us to build our own walls.
But maybe there is a middle path? I’ve previously written about emerging standards such as Really Simple Licensing (RSL) and Web Bot Auth. These protocols aim to move beyond the binary of open or blocked by creating enforceable, machine-actionable methods to prohibit bots (or make them pay) from using content on the open web. If one of these standards prevails, we might eventually be able to offer openness with intention, allowing authenticated, responsible crawlers while enforcing the strict prohibitions required by our donors.
Until then, we are essentially waiting for the final word from the landmark lawsuit The New York Times Company v. Microsoft Corporation, 1:23-cv-11195 (via Courtlistener.com). The Times is suing OpenAI/Microsoft, claiming that millions of its articles were used without permission to train AI models, essentially “free-riding” (a common term in the open source universe) on its massive investment in journalism to create a product that now directly competes for its audience’s attention without driving views (and therefore preventing revenue generation through ad dollars).
As of early 2026, the discovery phase has already proven that “regurgitation” (where an AI outputs copyrighted content word-for-word) is a documented reality of generative AI chatbots. This challenges the industry’s defense that AI training is a “transformative” fair use that doesn’t harm the original market. Until the legal dust settles, though, the only way to adhere to the terms I described above may be to keep it under lock and key.
Technical Sufficiency vs. Legal Reality
When I meet with General Counsel next month, I’ll ask: What does “sufficient protection” mean for the university? How much risk is the institution willing to take with our current infrastructure and legal agreements adhering to these terms? What must my team implement to meet the legal thresholds still being established? If a donor is highly litigious, will our “best effort” be enough? Or are we entering an era where high-value digital collections must be siloed behind click-through agreements or worse, stored away never to see the digital light of day?
As always, I’m curious to hear your thoughts. Are we being too protective, or are we the only ones taking the stewardship part of our job seriously?



If I had cleaned up my email before writing this post, I would have found this Washington Post article through the Data & Society (an independent non-profit research organization) newsletter I receive. I would have included a link in the section about harvesting books.
https://wapo.st/4a6aaIo (gift link)
I'm personally not hopeful about things like RSL. Realistically, if you want to prevent something from being scraped, you either put it behind *good* access control or you don't put online at all. Obviously that means significantly limiting public access.
ETA: Also, if you're preventing the Wayback machine from archiving your content, I hope that you have strong long-term sustainability plans and procedures.