Research evaluating emotional-support conversations found that several advanced AI chatbots can match or exceed the average human response when judged on perceived empathy. Using a structured benchmark called HEART, developed with contributions from Hippocratic AI scientist Kriti Aggarwal, evaluators compared paired responses in the same conversation and selected which felt more supportive. The benchmark measures multiple aspects of supportive dialogue across an entire exchange, including tone alignment, conversational awareness, and whether the response stays aligned with the user’s goals. In many cases, top AI systems were rated as equally or more supportive than the typical human reply, and human and AI evaluators agreed on the better response roughly 80% of the time.
The HEART benchmark also evaluated Hippocratic AI’s Polaris system, which ranked among the highest-performing models while producing responses in under one second. Despite these results, humans still performed better when conversations became tense or resistant, where subtle reframing and tone shifts were needed. Researchers emphasized that safety remains critical, as emotionally supportive responses can still cause harm if AI systems cross clinical boundaries or offer overly confident guidance. The framework is now being used to study whether highly rated responses actually improve how supported users feel over time, while also exploring voice-based interactions and cultural differences in supportive communication.