TLDR: A recent study presented at The Menopause Society 2025 Annual Meeting found that AI platforms, including ChatGPT and Google Gemini, exhibit low accuracy when providing answers to menopause and hormone therapy questions. ChatGPT 3.5, while performing best for patient-level questions, still only achieved 55% accuracy, highlighting significant concerns about the reliability of AI in medical education for both patients and clinicians.
Artificial intelligence (AI) platforms are demonstrating a concerningly low accuracy rate when tasked with answering questions related to menopause and hormone therapy, according to new research presented at The Menopause Society 2025 Annual Meeting in Orlando, Florida. The study, which evaluated four different large language models (LLMs), revealed that even the best-performing AI tools struggled to provide consistently correct information.
Among the platforms tested, ChatGPT 3.5 showed the highest accuracy for typical patient questions, correctly answering just over half (55%) of them. However, its performance significantly dropped for clinician-level questions, with only a third (33%) of responses being accurate. The paid version, ChatGPT 4.0, fared slightly worse for patient queries, with only 40% correct answers, and an overall inaccuracy rate of 33%. Google’s Gemini platform performed the most poorly, with less than a third (30%) of its patient-level answers deemed accurate. For clinician questions, Gemini had as many incomplete answers (40%) as it did correct ones. Another platform, OpenEvidence, was found to provide incorrect answers in more than half (53%) of its responses. Across all platforms, incomplete responses ranged from 7% to 40%.
Dr. Jana Karam, a postdoctoral research fellow at the Mayo Clinic in Jacksonville, Florida, emphasized the importance of these findings. “Generative artificial intelligence has rapidly advanced and is now explored in healthcare as a resource for both patient and clinician education,” Dr. Karam stated. “As large language models are increasingly used to answer medical queries, evaluating their performance in providing accurate and reliable information is essential.”
Mindy Goldman, MD, chief clinical officer at Midi Health and clinical professor emeritus at the University of California, San Francisco, who was not involved in the study, expressed little surprise at the results. “Although most everyone in medicine now uses AI in some contexts, my understanding has been that one cannot always be sure of the accuracy of responses, and clinicians should always check the references,” Dr. Goldman commented. She recounted her own experiment, asking an LLM about its accuracy, to which it responded that “generative AI’s accuracy is highly inconsistent and varies drastically by domain, task complexity, and the specific model used.” The AI’s response further warned that it could generate “plausible but potentially inaccurate information” and even “hallucination,” referring to the creation of entirely false data.
Also Read:
- Generative AI Pioneers Personalized Learning in Medical Education
- Hidden Costs of ‘Free’ AI Pilots Drain Millions from Health Systems
The study underscores a critical need for caution. For healthcare providers, Dr. Goldman advises, “this study highlights the need to check the references and additional sources, such as The Menopause Society and ACOG [American College of Obstetricians and Gynecologists], as well as doing their own PubMed searches before assuming something is true.” The research also considered the readability of patient-level responses, using metrics like word count and the Flesch Reading Ease score, indicating that some AI-generated content may be too complex for patients to easily understand.


