For 30 years, the San Francisco-based non-profit Archive.org has archived the public web through its Wayback Machine, preserving more than a billion web pages that journalists, researchers, historians and lawyers rely on to view deleted or altered online content.
Now that archive faces an existential threat — from the very media that depend on it. Research by the Nieman Foundation at Harvard found that at least 241 news outlets across nine countries are blocking Archive.org’s crawlers. The list includes the Guardian, the New York Times, Le Monde and USA Today Co.
The paradox is stark. USA Today recently used the Wayback Machine to expose efforts by U.S. immigration authorities to withhold information about detention policy; yet the company has blocked the archive’s access to its own content. The driving fear among publishers is artificial intelligence: they worry that AI firms such as OpenAI and Google will harvest journalistic content via the archive to train large language models without permission or compensation. A New York Times spokesperson said the paper’s content on the Internet Archive is being used by AI companies “in violation of copyright law to directly compete with us.”
The Archive has faced large-scale automated scraping. Mark Graham, director of the Wayback Machine, told Wired that several companies intermittently generated tens of thousands of requests per second, at times overloading Archive.org’s servers. Archive.org’s mission of open access makes it reluctant to block crawlers, but that openness has prompted publishers to impose access restrictions.
Critics warn the consequences are serious. The Electronic Frontier Foundation compared blocking the Internet Archive to forbidding libraries from keeping copies of newspapers. More than 100 journalists have signed an open letter urging support for the archive, noting that in a digital media landscape marked by link rot, consolidation and cost-cutting, reporters often rely on the Wayback Machine to recover pages that would otherwise vanish. Without routine web preservation, large swathes of recent journalistic history could be lost.
Archive.org is negotiating with publishers to restore access, Graham said, but the outcome is uncertain. Observers like Martin Fehrensen, founder of socialmediawatchblog.de, argue that web archiving is the closest thing to a chain of custody for the open web; losing it would damage Wikipedia’s sourcing, platform accountability research, and the availability of digital evidence that can be used in court. Fehrensen recommends a publisher dialogue that creates a clear technical separation between archiving and AI training, and ultimately a special legal status for web archives. He also argues web archiving should be treated as public infrastructure rather than hinging on a single NGO.
The Archive has endured other crises in recent years: a September 2024 cyberattack that exposed data from 31 million user accounts, and the loss of the Hachette v. Internet Archive case, which forced removal of more than 500,000 e-books from its lending program and left the organization facing substantial damage claims. Those setbacks were damaging, but the current wave of publisher-imposed blocks poses a different, arguably deeper threat because it stems from coordinated corporate choices that undermine the Wayback Machine’s core mission of comprehensive web preservation.
How the conflict between publishers protecting their content from AI training and the public interest in preserving the historical web will be resolved remains unclear. For now, the trend toward locking down parts of the public web raises stark questions about society’s ability to understand and document what happens online.
This article was originally published in German.