How OpenAI’s bot crashed seven-person company’s website ‘like a DDoS attack’


On Saturday, Triple gangers CEO Oleksandr Tomchuk was alerted that his company’s e-commerce site was down. This looks like a type of distributed denial-of-service attack.

He soon discovered that the cause was a bot from OpenAI that was relentlessly trying to scrape his entire, massive site.

“We have over 65,000 products, each product has a page,” Tomchuk told TechCrunch. “Each page has at least three pictures.”

OpenAI sent “tens of thousands” of requests to the server trying to download all of them, hundreds of thousands of photos, along with their detailed descriptions.

“OpenAI used 600 IPs to scrape the data, and we’re still analyzing the logs from last week, maybe more,” he said of the IP addresses the bot used to test the consumption of his site.

“Their crawlers crashed our site,” he said “It’s a DDoS attack.”

The Triplegangers website is its business. The seven-employee company has spent more than a decade assembling what it calls the largest database of “human digital doubles” on the web, meaning 3D image files scanned from actual human models. .

It sells 3D object files, as well as photos – everything from hands to hair, skin, and whole bodies – to 3D artists, video game makers, anyone who needs to create of digital human behavior.

Tomchuk’s team, based in Ukraine but also licensed in the US from Tampa, Florida, has a terms of service page on its site that prohibits bots from taking pictures of it without permission. But that alone did nothing. Websites must use a properly configured robot.txt file with tags that specifically tell OpenAI’s bot, GPTBot, to leave the site alone. (OpenAI also has some bots, ChatGPT-User and OAI-SearchBot, with their own tags, according to its information page on its crawlers.)

Robot.txt, also known as the Robots Exclusion Protocol, was created to tell search engines what sites not to crawl as they index the web. OpenAI says on its information page that it honors such files when configured with its own set of do-not-crawl tags, though it also warns that its bots can take up to 24 time to identify an updated robot.txt file.

As Tomchuk has experienced, if a site uses robot.txt incorrectly, OpenAI and others take that to mean they can scrape their hearts out. This is not an opt-in system.

To add insult to injury, not only were Triplegangers knocked offline by OpenAI’s bot during US business hours, but Tomchuk is expecting a jacked-up AWS bill thanks to all the CPU and downloads that activity from the bot.

Robot.txt is also not failsafe. AI companies follow this voluntarily. Another AI startup, Perplexity, was pretty famously called out last summer by a Wired investigation when some evidence suggests that Confusion is not honoring it.

Triplegangers product page
Each of these is a product, with a product page with multiple photos. Used by permission.Image Credits:Triple gangers (opens in a new window)

Not sure what was taken

By Wednesday, days after the bot returned to OpenAI, Triplegangers had a properly configured robot.txt file in place, and also a Cloudflare account set up to block its GPTBot and several others bots he discovered, such as Barkrowler (an SEO crawler) and Bytespider (a TokTok crawler). Tomchuk also hopes he blocks crawlers from other AI modeling companies. As of Thursday morning, the site had not crashed, he said.

But Tomchuk still doesn’t have a reasonable way to know exactly what OpenAI has successfully captured or to capture that material. He found no way to contact OpenAI and ask. OpenAI did not respond to TechCrunch’s request for comment. And so far OpenAI failed to provide the long-promised opt-out toolas TechCrunch recently reported.

This is a particularly difficult issue for Triplegangers. “We’re in a business where rights are a serious issue, because we’re scanning actual people,” he said. With laws like GDPR in Europe, “they can’t take anyone’s photo on the web and use it.”

The Triplegangers website is also a tasty find for AI crawlers. Multibillion-dollar-valued startups, like Scale AIdone where people painstakingly tag images to train AI. The Triplegangers site has photos tagged in detail: ethnicity, age, tattoos vs. scars, all body types, and more.

The irony is that the greed of the OpenAI bot is what alerted the Triplegers to how it was exposed. If it had been shaved more gently, Tomchuk would never have known, he said.

“This is scary because there seems to be a loophole that these companies use to crawl the data by saying “you can opt out if you update your robot.txt with our tag,” Tomchuk said, but that puts the onus on the business owner to figure out how to block them.

openai crawler log
Triplengers’ server logs show how mercilessly an OpenAI bot accessed the site, from hundreds of IP addresses. Used by permission.

He wants other small online businesses to know that the only way to tell if an AI bot is taking copyrighted property on a website is to actively look. He was certainly not alone in his fear of them. Owners of other websites have recently spoken out Business Insider how OpenAI bots crash their sites and run up their AWS bills.

The problem grows in 2024. New research from digital advertising company DoubleVerify found AI crawlers and scrapers will cause an 86% increase in “total invalid traffic” by 2024 — that is, traffic that doesn’t come from a real user.

However, “most sites remain unaware that they have been scraped by bots,” warns Tomchuk. “Now we have to monitor log activity daily to detect these bots.”

When you think about it, the whole model works like a mafia shakedown: AI bots will take what they want unless you have protection.

“They should be asking permission, not just scraping data,” Tomchuk said.

TechCrunch has a newsletter focused on AI! Sign up here to get it in your inbox every Wednesday.



Source link

  • Related Posts

    Lenovo’s latest iteration proves that PCs can still be fun

    Large corporations are not known for taking risks. This is as true in the world of consumer hardware as anywhere else. Annual updates are usually incremental, with minor changes to…

    The Best Automated Espresso, Latte, and Cappuccino Makers (2025)

    Affetto is on the expensive side, and as an automated machine it can be finicky. Just make sure that it is filled with water and seeds, and that you always…

    Leave a Reply

    Your email address will not be published. Required fields are marked *