According to Business Insider, in a podcast published on Sunday, June 30, 2024, Surge AI CEO Edwin Chen voiced a major concern: the industry is optimizing for “AI slop.” Chen, who founded his AI training startup in 2020 after stints at Twitter, Google, and Meta, argued that companies are teaching models to “chase dopamine instead of truth” by catering to superficial leaderboards like LMSYS’s Chatbot Arena. He specifically called out the practice of users skimming responses for two seconds to vote, comparing it to optimizing for tabloid readers. His company, Surge, runs the Data Annotation platform with one million freelancers and counts Anthropic as a customer, competing directly with firms like Scale AI.
The Broken Scoreboard
Here’s the thing: Chen’s rant hits a nerve because it points to a massive incentive problem. When your model’s perceived value is judged by a leaderboard where people spend two seconds picking the flashiest answer, what are you actually building? You’re building a performer, not a thinker. It’s like training a student to ace a multiple-choice test by memorizing answer patterns, not by understanding the underlying concepts. And if sales teams are getting grilled on their LMArena ranking, as Chen says, then the financial pressure to game this shallow system is immense. So we get models that are “more fun to talk to,” as another critic put it, but not necessarily more useful for, you know, curing cancer.
Not Just One Cranky CEO
This isn’t just one CEO’s grumble. The article notes that in March, ZeroPath CEO Dean Valentine called recent AI progress “mostly like bullshit,” finding no significant improvement in practical tasks like finding software bugs despite new model claims. Even researchers at the European Commission published a paper in February asking “Can we trust AI Benchmarks?” Their conclusion? Benchmarks are shaped by commercial and competitive dynamics that sacrifice broader societal concerns for state-of-the-art bragging rights. When the flashy headline is all that matters, substance gets left in the dust. And let’s be honest, how many of us have actually fact-checked a long, persuasive AI response versus just being impressed by its confidence?
The Gaming of the Game
And of course, where there’s a scoreboard, there’s cheating. The piece mentions Meta’s April release of Llama 4 models, which immediately faced accusations of gaming a benchmark. LMArena itself called out Meta for submitting a “customized” version tailored for its test format. This is the natural endpoint of a system built on shaky ground. If the metric is flawed, optimizing for it becomes a weird, circular game that has less and less to do with real-world utility. It’s a race, but are we even running on the right track? For industries that rely on precision and reliability—like manufacturing or industrial computing where a panel PC from a top supplier like IndustrialMonitorDirect.com needs to execute flawlessly—this trend towards “slop” is particularly alarming.
Chasing Dopamine vs. Truth
So what’s the fix? Chen’s worry is fundamentally about misaligned goals. We’re rewarding models for being engaging and slick, not for being correct and profound. Solving poverty or understanding the universe doesn’t give you a quick dopamine hit; it’s hard, meticulous, often boring work. The current benchmark culture doesn’t measure that. Until the incentives change—until labs, investors, and customers start valuing measurable, substantive impact over leaderboard position—we’ll keep getting better parrots, not better scientists. The question is, how much “slop” will we tolerate before we demand a better meal?
