For three decades the nonprofit Internet Archive has run the Wayback Machine, saving snapshots of the public web that journalists, researchers, historians and lawyers use to recover deleted or changed pages. Now that archive faces an existential threat from many of the news organizations that depend on it.
Research by Harvard’s Nieman Foundation found that at least 241 news outlets across nine countries have blocked the Internet Archive’s crawlers. The list includes high-profile titles such as The Guardian, The New York Times, Le Monde and USA Today Co. The situation contains a sharp paradox: USA Today recently relied on the Wayback Machine to reveal how U.S. immigration authorities tried to withhold detention-policy documents, yet the company has restricted the archive’s access to its own site.
Publishers say their moves are driven by fear of artificial intelligence companies harvesting journalistic content to train large language models without permission or compensation. A New York Times spokesperson told reporters the paper’s material on the Internet Archive is being used by AI firms “in violation of copyright law to directly compete with us.”
The Internet Archive has also endured heavy automated scraping from commercial actors. Mark Graham, director of the Wayback Machine, told Wired that some companies intermittently generated tens of thousands of requests per second, sometimes overloading Archive.org’s servers. Because the archive’s mission emphasizes open access, it has been reluctant to block crawlers wholesale—but that openness has prompted publishers to impose access restrictions themselves.
Critics warn the consequences are grave. The Electronic Frontier Foundation compared blocking the Internet Archive to forbidding libraries from keeping newspaper copies. More than 100 journalists have signed an open letter urging support for the archive, noting that in an era of link rot, consolidation and newsroom cutbacks reporters often rely on the Wayback Machine to recover pages that would otherwise vanish. Without routine web preservation, large swaths of recent journalistic history could disappear.
The Internet Archive says it is negotiating with publishers to restore access, but the outcome is uncertain. Observers such as Martin Fehrensen argue that web archiving functions as a de facto chain of custody for the open web; losing it would weaken Wikipedia’s sourcing, platform-accountability research and the availability of digital evidence used in court. Fehrensen recommends a publisher dialogue to create a clear technical separation between archiving and AI training, a special legal status for web archives, and treating web preservation as public infrastructure rather than relying on a single NGO.
The archive has faced other recent crises—a September 2024 cyberattack that exposed data from 31 million user accounts and the court loss in Hachette v. Internet Archive, which forced removal of more than 500,000 e-books from its lending program and left the organization facing significant damage claims. Those setbacks harmed the nonprofit, but the current wave of publisher-imposed blocks poses a different threat: coordinated corporate choices that undermine the Wayback Machine’s core mission of comprehensive web preservation.
How to reconcile publishers’ efforts to protect their work from AI training with the public interest in preserving the historical web remains unresolved. For now, the trend toward locking down portions of the public web raises stark questions about society’s ability to document and understand what happens online.
This article was originally published in German.