© 2024 KUAF
NPR Affiliate since 1985
Play Live Radio
Next Up:
0:00
0:00
0:00 0:00
Available On Air Stations
KUAF and Ozarks at Large are hosting NWA Mayoral Candidate Forums on Oct. 15, 22 & 28. Click here for more information!

Artificial intelligence web crawlers are running amok

AILSA CHANG, HOST:

On every website, there's a message that contains a hidden stop sign. It's intended for bots, not humans, a way of saying, do not scan this part of the website. The artificial intelligence industry is ignoring these stop signs, and understanding why sheds light on how AI companies are turning the web upside down. NPR's Bobby Allyn reports.

BOBBY ALLYN, BYLINE: The story starts in the mid-'90s, the days of dial-up internet. The web was slow, and maintaining a site was expensive, especially when bots scanned your whole website, as they often did to create a copy for, say, askjeeves.com. Overwhelmed with requests from automated bots, web servers started to crash, and internet bills spiked. So developers came up with a solution, a hidden plain text file in the back-end software code of every website, it was intended for bots. It became known as robots.txt.

COLLEEN CHIEN: And a robot.txt file then puts a sign in front of that website to say, if you're a robot, you know, sort of this visitor, you need to abide by the rules here. This is, you know, where you are or aren't welcome. This is what you can and can't do.

ALLYN: That's Colleen Chien of UC Berkeley Law School, who teaches classes on how AI is changing the web. Over the years, the robots.txt page became something of a social contract for the entire internet. Tech giants like Google and Facebook adopted it. And even though it had no legal teeth, it was respected. Say there's a corporate or administrative page you don't want showing up on Google, put it in the file. It helped hold the entire internet together, says former Google engineer Jacob Hoffman-Andrews.

JACOB HOFFMAN-ANDREWS: That system has remarkably worked well for 30 years.

ALLYN: Till now. In response to data hungry AI companies gobbling up every corner of the internet, websites have started to put AI companies in this file, a way of telling ChatGPT, stop, do not scrape here. But here's the problem. The AI industry is ignoring it. Just recently, Amazon Web Services announced it is investigating popular AI search engine Perplexity over this. Officials from Perplexity wouldn't talk to me for the story, but in a statement, the company said, quote, "robots.txt is not a legal framework." That might sound like a, OK, who cares kind of thing at first, but Jacob Hoffman-Andrews says breaking this norm could change the entire internet.

HOFFMAN-ANDREWS: There's a chance for that whole kind of open-web-based order to break down. The websites that do exist could retreat behind logins and become private communities. The concept of the internet as the world's biggest library would start to fail.

ALLYN: And if that happened on a wide scale, navigating the web could become really annoying. You probably have noticed this already - more and more websites requiring accounts and logins. Sometimes that's about paying for content, but increasingly, it's about fighting back against AI companies. As they explode norms in search of more data, the AI firms are getting richer. But those being mined for content aren't getting much in return. That's why something seemingly small like ignoring a stop sign for bots has become a rallying cry in Silicon Valley against the whole AI industry, says legal scholar Colleen Chien.

CHIEN: These models become more and more powerful, the question of well, who gets to sort of keep the riches that are generated by these amazing new technologies is increasingly important.

ALLYN: It's that question that's tapping into angst shared by so many creatives and website publishers right now. When, say, Google scrapes your website, you get, in return, web traffic. But when an AI tool scrapes your website, you're not really getting much in return, which is why the robots.txt file has become a way of saying, no thanks, do not do that here. With the AI industry scraping away anyway, more and more corners of the internet may soon become harder to access for everyone. Bobby Allyn, NPR News. Transcript provided by NPR, Copyright NPR.

NPR transcripts are created on a rush deadline by an NPR contractor. This text may not be in its final form and may be updated or revised in the future. Accuracy and availability may vary. The authoritative record of NPR’s programming is the audio record.

Bobby Allyn is a business reporter at NPR based in San Francisco. He covers technology and how Silicon Valley's largest companies are transforming how we live and reshaping society.
Related Content
  • There are about 600,000 asset-limited, income-constrained and employed, or ALICE, households in Arkansas. A new cohort is working together to institute policy changes that can help ALICE homes. Ozarks at Large's Kyle Kellams talked with representatives of two of the cohort members. Mollie Palmer is vice president of communications and engagement with Heart of Arkansas United Way, and Phillip Jett is CEO of Encore Bank.
  • Halloween comes to Walton Arts Center this week. Beetlejuice opens Oct. 22 with a cast of ghosts and a hyperactive demon. Ozarks at Large's Kyle Kellams talked with Megan McGinnis. She is the recently deceased Barbara Maitland in the musical and played the role for a time on Broadway. She said after working on stage and in film, Beetlejuice is her favorite work experience.
  • On today's shows, the private sector and non-profits are working together to help asset-limited, income-constrained and employed or "ALICE" households in Arkansas. Plus, we won’t say his name three times, but an energetic demon is at Walton Arts Center this week.