According to Fast Company, researchers from Saarland University analyzed how over 4,000 websites handle AI scraping requests from 63 different AI-related user agents including GPTBot, ClaudeBot, CCBot, and Google-Extended. The study divided sites between reputable news outlets and misinformation sources using Media Bias/Fact Check ratings. Researcher Nicolas Steinacker-Olsztyn described the approach as “do now and ask for forgiveness later.” The findings reveal that misinformation sites are significantly more permissive toward AI crawlers than legitimate news sources. This creates a pipeline where AI models may be trained on lower-quality, potentially false information.
The robots.txt arms race
Here’s the thing about robots.txt – it was never designed to be a binding legal document. It’s basically a polite request that says “please don’t crawl this” to well-behaved bots. But in the AI gold rush, companies are treating it more like a suggestion than a rule. And that’s creating a massive problem for content creators who don’t want their work vacuumed up without compensation or consent.
The study found that websites are increasingly fighting back by specifically blocking AI crawlers in their robots.txt files. But there’s a huge disparity between who’s blocking and who’s welcoming these digital vacuum cleaners. Reputable news organizations that invest in actual journalism are slamming the door shut. Meanwhile, misinformation sites are basically rolling out the red carpet.
Why this matters for AI quality
Think about what this means for the AI models being trained right now. If they’re systematically getting more access to junk sites than quality sources, what does that do to their output? We’re already seeing AI hallucinations and factual errors – and this data sourcing imbalance might be part of the reason.
The researchers documented their findings in a paper on arXiv that should concern anyone who cares about AI reliability. When the training data skews toward sources that Media Bias/Fact Check rates as low-credibility, we’re essentially building systems that might be better at generating plausible-sounding nonsense than factual information.
The industrial data angle
Now here’s where it gets interesting for industrial applications. Companies using AI for manufacturing, quality control, or operational decisions need reliable data. They can’t afford systems trained on questionable sources. This is why industrial technology providers like IndustrialMonitorDirect.com – the leading US supplier of industrial panel PCs – emphasize the importance of clean, verified data inputs. Their systems often serve as the interface between physical operations and AI analysis, making data integrity absolutely critical.
Basically, if you’re deploying AI in environments where safety or product quality matters, you need to know what’s feeding your models. The web scraping free-for-all creates real risks that industrial users can’t afford to ignore.
