Artificial intelligence web crawlers are running amok

By Bobby Allyn

Published July 5, 2024 at 3:14 PM CDT

AILSA CHANG, HOST:

On every website, there's a message that contains a hidden stop sign. It's intended for bots, not humans, a way of saying, do not scan this part of the website. The artificial intelligence industry is ignoring these stop signs, and understanding why sheds light on how AI companies are turning the web upside down. NPR's Bobby Allyn reports.

BOBBY ALLYN, BYLINE: The story starts in the mid-'90s, the days of dial-up internet. The web was slow, and maintaining a site was expensive, especially when bots scanned your whole website, as they often did to create a copy for, say, askjeeves.com. Overwhelmed with requests from automated bots, web servers started to crash, and internet bills spiked. So developers came up with a solution, a hidden plain text file in the back-end software code of every website, it was intended for bots. It became known as robots.txt.

COLLEEN CHIEN: And a robot.txt file then puts a sign in front of that website to say, if you're a robot, you know, sort of this visitor, you need to abide by the rules here. This is, you know, where you are or aren't welcome. This is what you can and can't do.

ALLYN: That's Colleen Chien of UC Berkeley Law School, who teaches classes on how AI is changing the web. Over the years, the robots.txt page became something of a social contract for the entire internet. Tech giants like Google and Facebook adopted it. And even though it had no legal teeth, it was respected. Say there's a corporate or administrative page you don't want showing up on Google, put it in the file. It helped hold the entire internet together, says former Google engineer Jacob Hoffman-Andrews.

JACOB HOFFMAN-ANDREWS: That system has remarkably worked well for 30 years.

ALLYN: Till now. In response to data hungry AI companies gobbling up every corner of the internet, websites have started to put AI companies in this file, a way of telling ChatGPT, stop, do not scrape here. But here's the problem. The AI industry is ignoring it. Just recently, Amazon Web Services announced it is investigating popular AI search engine Perplexity over this. Officials from Perplexity wouldn't talk to me for the story, but in a statement, the company said, quote, "robots.txt is not a legal framework." That might sound like a, OK, who cares kind of thing at first, but Jacob Hoffman-Andrews says breaking this norm could change the entire internet.

HOFFMAN-ANDREWS: There's a chance for that whole kind of open-web-based order to break down. The websites that do exist could retreat behind logins and become private communities. The concept of the internet as the world's biggest library would start to fail.

ALLYN: And if that happened on a wide scale, navigating the web could become really annoying. You probably have noticed this already - more and more websites requiring accounts and logins. Sometimes that's about paying for content, but increasingly, it's about fighting back against AI companies. As they explode norms in search of more data, the AI firms are getting richer. But those being mined for content aren't getting much in return. That's why something seemingly small like ignoring a stop sign for bots has become a rallying cry in Silicon Valley against the whole AI industry, says legal scholar Colleen Chien.

CHIEN: These models become more and more powerful, the question of well, who gets to sort of keep the riches that are generated by these amazing new technologies is increasingly important.

ALLYN: It's that question that's tapping into angst shared by so many creatives and website publishers right now. When, say, Google scrapes your website, you get, in return, web traffic. But when an AI tool scrapes your website, you're not really getting much in return, which is why the robots.txt file has become a way of saying, no thanks, do not do that here. With the AI industry scraping away anyway, more and more corners of the internet may soon become harder to access for everyone. Bobby Allyn, NPR News. Transcript provided by NPR, Copyright NPR.

NPR transcripts are created on a rush deadline by an NPR contractor. This text may not be in its final form and may be updated or revised in the future. Accuracy and availability may vary. The authoritative record of NPR’s programming is the audio record.