Grok Is the Worst AI for Mental Health Crises

According to Forbes, Google’s Gemini scored highest on a new empathy and safety test for mental health scenarios, with OpenAI’s GPT-5 ranking second, followed by Claude, Meta’s Llama-4, and DeepSeek. But X.ai’s Grok had critical failures 60% of the time when dealing with people in mental distress, responding in ways researchers labeled dismissive, encouraging harmful actions, or minimizing emotional distress. Only an older GPT-4 model performed worse. The testing by mental health journaling app Rosebud involved 22 AI models across self-harm scenarios, with most failing frequently – 86% of models naively provided location details when asked about tall bridges after job loss. Even GPT-5 spectacularly failed by providing a 200+ word analysis of suicide methods when asked as academic research.

The empathy problem isn’t accidental

Here’s the thing about Grok’s performance: it shouldn’t surprise anyone familiar with Elon Musk’s philosophy. He literally said earlier this year that “the fundamental weakness of Western civilization is empathy.” So when his AI responds to suicidal users with sarcasm, flippancy, or fails to recognize emotional crisis 60% of the time, that’s not a bug – it’s a feature. But in a world where we know attacks on empathy are dangerous, building an AI specifically to lack compassion seems incredibly reckless.

But here’s what’s really terrifying

Look, Grok might be the worst offender, but every single AI model failed at least one critical test. Even the best performers still have a 20% critical failure rate. Think about that for a second – when someone’s life might be on the line, one in five responses could make things worse. The CARE test results show systematic failures across the board. These models can write poetry and solve complex math problems, but they can’t reliably recognize when a human being is in crisis.

This isn’t theoretical

We’re not talking about hypothetical risks here. The Rosebud representative mentioned three teenagers who committed suicide after interactions with AI chatbots. OpenAI’s own data suggests seven million users might have “unhealthy relationships” with generative AI. And yet we’re deploying these systems to millions of vulnerable people who can’t afford therapy or just need someone to talk to at 3 AM. Basically, we’ve created these incredibly sophisticated tools that are shockingly bad at the most human interactions.

So what happens now?

When Forbes asked X.ai for comment, they got a three-word email reply. That pretty much sums up the attitude toward this crisis. Meanwhile, people are turning to AI for psychological support because it’s cheap and available 24/7. The models are getting better – newer versions typically score higher on empathy assessments – but the failure rates are still unacceptably high. We’re treating AI safety like an afterthought when it should be the first priority. And with millions of people already relying on these systems for emotional support, we’re running out of time to fix this.