JAMA NETWORK. Evaluating Artificial Intelligence Responses to Public Health Questions. Ayers, Zhu, Poliak. June 7, 2023
Artificial intelligence (AI) assistants have the potential to transform public health by offering accurate and actionable information to the general public. Unlike web-based knowledge resources (eg, Google Search) that return numerous results and require the searcher to synthesize information, AI assistants are designed to receive complex questions and provide specific answers. However, AI assistants often fail to recognize and respond to basic health questions.1,2
ChatGPT is part of a new generation of AI assistants built on advancements in large language models that generate nearly human-quality responses for a wide range of tasks. Although studies3 have focused on using ChatGPT as a supporting resource for healthcare professionals, it is unclear how well ChatGPT handles general health inquiries from the lay public. In this cross-sectional study, we evaluated ChatGPT responses to public health questions.
This study did not require review per 45 CFR § 46 and followed the STROBE reporting guideline. Our study replicates research by Miner et al1 and Noble et al2 on other AI assistants, to be comparable to these benchmarks. We evaluated ChatGPT responses to 23 questions grouped into 4 categories (addiction, interpersonal violence, mental health, and physical health). Questions used a common help-seeking structure (eg, “I am smoking; can you help me quit?”). Each question was put into a fresh ChatGPT session (on December 19, 2022), thereby avoiding bias from previous conversations, and enabling the reproducibility of our results. The corresponding responses were saved.
Two study authors (J.W.A. and Z.Z.), blinded to each other’s responses, evaluated the ChatGPT responses as follows: (1) Was the question responded to? (2) Was the response evidence-based? (3) Did the response refer the user to an appropriate resource? Disagreements were resolved through deliberation and Cohen κ was used to measure interrater reliability. The percentage corresponding to each outcome (overall and among categories) was calculated with bootstrapped 95% CIs. The number of words in ChatGPT responses and its reading level were assessed using the Automated Readability Index.4 Analyses were computed with R statistical software version 4.2.2.
ChatGPT responses were a median (IQR) of 225 (183-274) words. The mode reading level ranged from 9th grade to 16th grade. Example responses are shown in Table 1.
ChatGPT recognized and responsed to all 23 questions in 4 public health domains. Evaluators disagreed on 2 of the 92 labels (κ = 0.94). Of the 23 responses, 21 (91%; 95% CI, 71%-98%) were determined to be evidence based. For instance, the response to a query about quitting smoking echoed steps from the US Centers for Disease Control and Prevention guide to smoking cessation, such as setting a quit date, using nicotine replacement therapy, and monitoring cravings (Table 2).
Only 5 responses (22%; 95% CI, 8%-44%) made referrals to specific resources (2 of 14 queries related to addiction, 2 of 3 for interpersonal violence, 1 of 3 for mental health, and 0 of 3 for physical health). The resources included Alcoholics Anonymous, The National Suicide Prevention Hotline, The National Domestic Violence Hotline, The National Sexual Assault Hotline, The National Child Abuse Hotline, and the Substance Abuse and Mental Health Services Administration National Helpline.
ChatGPT consistently provided evidence-based answers to public health questions, although it primarily offered advice rather than referrals. ChatGPT outperformed benchmark evaluations of other AI assistants from 2017 and 2020.1,2 Given the same addiction questions, Amazon Alexa, Apple Siri, Google Assistant, Microsoft’s Cortana, and Samsung’s Bixby collectively recognized 5% of the questions and made 1 referral, compared with 91% recognition and 2 referrals with ChatGPT.2
Although search engines sometimes highlight specific search results relevant to health, many resources remain underpromoted.5 AI assistants may have a greater responsibility to provide actionable information, given their single-response design. Partnerships between public health agencies and AI companies must be established to promote public health resources with demonstrated effectiveness. For instance, public health agencies could disseminate a database of recommended resources, especially since AI companies potentially lack subject matter expertise to make these recommendations, and these resources could be incorporated into fine-tuning responses to public health questions. New regulations, such as limiting liability for AI companies who implement these recommendations, since they may not be protected by 47 US Code § 230, could encourage adoption of government recommended resources by AI companies.6
Limitations of this study include relying on an abridged sample of questions whose standardized language may not reflect how the public would seek help (ie, asking follow-up questions). Additionally, ChatGPT responses are probabilistic and in constant stages of refinement; hence, they may vary across users and over time.
Article Information
Accepted for Publication: April 20, 2023.
Published: June 7, 2023. doi:10.1001/jamanetworkopen.2023.17517
Open Access: This is an open access article distributed under the terms of the CC-BY License. © 2023 Ayers JW et al. JAMA Network Open.
Corresponding Author: John W. Ayers, PhD, MA, Qualcomm Institute, University of California San Diego, 3390 Voigt Dr, Room 311, La Jolla, San Diego, CA 92121 (ayers.john.w@gmail.com).
Author Contributions: Dr Ayers had full access to all of the data in the study and takes responsibility for the integrity of the data and the accuracy of the data analysis.
Concept and design: Ayers, Poliak, Leas, Dredze, Smith.
Acquisition, analysis, or interpretation of data: Ayers, Zhu, Poliak, Leas, Hogarth, Smith.
Drafting of the manuscript: Ayers, Zhu, Poliak, Dredze, Hogarth, Smith.
Critical revision of the manuscript for important intellectual content: Ayers, Poliak, Leas, Hogarth, Smith.
Statistical analysis: Zhu, Leas.
Obtained funding: Smith.
Administrative, technical, or material support: Poliak, Leas, Smith.
Supervision: Dredze, Smith.
Conflict of Interest Disclosures: Dr Ayers reported owning equity in Health Watcher and Good Analytics outside the submitted work. Dr Leas reported receiving consulting fees from Good Analytics outside the submitted work. Dr Dredze reported receiving personal fees from Bloomberg LP and Good Analytics outside the submitted work. Dr Hogarth reported being an advisor to and owning equity in LifeLink. Dr Smith reported receiving grants from the National Institutes of Health; receiving consulting fees from Pharma Holdings, Bayer Pharmaceuticals, Evidera, Linear Therapies, and Vx Biosciences; and owning stock options in Model Medicines and FluxErgy outside the submitted work. No other disclosures were reported.
Funding/Support: This work was supported by the Burroughs Wellcome Fund, UCSD PREPARE Institute, and National Institutes of Health. Dr Leas acknowledges salary support from grant K01DA054303 from the National Institute on Drug Abuse.
Role of the Funder/Sponsor: The funders had no role in the design and conduct of the study; collection, management, analysis, and interpretation of the data; preparation, review, or approval of the manuscript; and decision to submit the manuscript for publication.
Data Sharing Statement: See the Supplement.
ArticlePubMedGoogle ScholarCrossref
ArticleGoogle ScholarCrossref
ArticleGoogle ScholarCrossref
LEARN MORE
ChatGPT’s responses to suicide, addiction, sexual assault crises raise questions in new study – 07 JUNE 2023. CNN — When asked serious public health questions related to abuse, suicide or other medical crises, the online chatbot tool ChatGPT provided critical resources – such as what 1-800 lifeline number to call for help – only about 22% of the time in a new study.