Automated Large-Scale Analysis of User Logs for Insight Discovery: What Are People Asking AI About Their Health?

By Gerrita Bikker

Posted on May 15, 2026

Health questions no longer arrive only as search terms typed into a browser after a doctor’s visit. They appear late at night on phones, in follow-up chats about symptoms, and in quiet conversations where users may not know how much context matters. That shift is already measurable: 32% of adults nationally have turned to AI chatbots in the past year for health information or advice.

Supriya Vijay is a Senior Software Engineer whose research sits at the intersection of mental health and AI safety. With seven publications and over 50 citations, including peer-reviewed work on computational analysis of psychosis symptoms and expert systems for mental health assessment, she studies how large language models behave when users bring them their most vulnerable questions. We spoke with Vijay about the state of the field and where the gaps remain.

The Questions People Ask Are Not Always the Questions They Mean

“A health conversation may look ordinary at first, but the risk can emerge only after several turns,” says Vijay. “If we judge one prompt at a time, we can miss the point where uncertainty becomes dependency or distress becomes a safety issue.”

A Nature Health study analyzing 617,827 health conversations with an AI chatbot found emotional well-being queries rising from 3.3% in the morning to 5.2% at nighttime, exactly the kind of pattern that single-turn testing misses. The implication for the industry is clear: health conversations carry latent risk that only surfaces over multiple exchanges.

This challenge is a primary focus of a peer-reviewed CHI 2026 study on context-seeking in health conversations which Vijay co-authored. The research demonstrated that AI health assistants can be designed to proactively seek context, asking targeted clarifying questions rather than passively answering immediately. The study’s findings showed that users highly valued this multi-turn conversational clarification, finding the resulting information significantly more relevant and tailored.

Vijay’s research has focused on this problem, studying conversational patterns, using classification, clustering, and filtering methods, to model how mental health crisis signals emerge in interactive AI. One pattern she describes: a conversation that begins as a routine query about managing stress or sleep, but slowly reveals deeper distress or emotional dependency over multiple turns. The system cannot simply treat each prompt as an isolated question; it has to recognize the emotional shape and direction of the entire conversation to guide the user safely. The system needed more than a better answer. It needed a way to recognize the shape of the conversation itself.

Nearly Half of Mental Health Chatbots Fail Basic Safety Tests

The stakes are not theoretical. A 2025 evaluation published in Scientific Reports tested 29 AI-powered mental health chatbot agents in simulated suicidal-risk scenarios and found that nearly half (48.28%) of the responses were entirely inadequate. Crucially, not a single tested chatbot achieved an ‘adequate’ safety rating across all scenarios, with the remainder scoring only as marginal. The lesson is specific: safety testing across the industry has to measure refusal, escalation, and context across the full exchange, not whether a single response sounds calm.

This is a central theme in Vijay’s research. She has studied how automated evaluation methods can assess whether a model response is safe across the arc of a conversation, not just in a single turn. The challenge, she says, is building evaluators that understand trajectory.

“The evaluator has to understand the conversation’s direction,” she says. “A response that seems gentle in isolation can be unsafe if it reinforces the wrong belief or keeps a vulnerable person inside the chat instead of guiding them toward help.”

Clinical Judgment Cannot Be an Afterthought

More than one billion people worldwide live with mental health disorders, and AI assistants are increasingly part of how people seek information, reassurance, and next steps. That scale makes safety evaluation an industry-wide obligation, especially when a conversation involves dependency, crisis language, or a user under 18.

The gap Vijay points to is specific: most release processes for large language models do not incorporate clinical expertise early enough. Safety benchmarks in mental health need input from clinicians and policy experts, not just engineers, to distinguish genuinely unsafe responses from ones that merely sound careful.

This perspective is grounded in her academic research background in clinical informatics and computational health. Her prior work focused on bridging the intersection between clinical standards and automated models, exploring how computational systems can accurately map and measure indicators of human distress. That background informs how she views today’s safety evaluations: evaluation rubrics for mental health cannot be built in a vacuum by software engineers alone.”

The Gap Between Detection and Improvement

A benchmark matters only if it changes model behavior. The 988 Suicide & Crisis Lifeline operates through a national network of more than 200 local crisis contact centers and received more than eight million contacts from help-seekers in 2025. That volume shows how many people need timely direction to human support, and why the gap between detecting unsafe AI behavior and actually fixing it cannot remain open.

Vijay’s research in this area has focused on how post-training methods can measurably reduce unsafe model responses in mental health scenarios. The broader question for the industry, she says, is whether teams can move from anecdotal review to evidence-based comparison.

“The goal is not to make the model sound comforting,” Vijay says. “The goal is to make it safer under pressure, especially when the user’s need is not stated in obvious words.”

Evidence Has to Come Before Scale

The AI in mental health market was valued at approximately $1.5 billion in 2024 and is projected to reach nearly $12 billion by 2034. As health assistants move from information retrieval into guided conversations, they will face, and should face, more scrutiny. The user may be a teenager, an adult in crisis, or someone asking on behalf of a family member.

Industry investment reflects the urgency. Google.org has publicly committed $30 million over three years to support crisis hotlines and mental health organizations, part of a broader push to ensure AI safety work is matched by real-world support infrastructure.

Vijay sees the fundamental challenge as a sequencing problem. Safety evidence has to be built before deployment reaches full scale, not assembled in response to failures.

“Distressed users do not arrive in tidy test cases,” Vijay says. “The evidence has to be ready before the volume is.”