AI chatbot gives wrong cancer treatment recommendations

In a recent article published on JAMA OncologyResearchers evaluate whether large language model (LLM)-based chatbots driven by artificial intelligence (AI) algorithms could provide accurate and reliable cancer treatment recommendations.

Study: Using AI chatbots for cancer treatment insights.  Image Credit: greenbutterfly/ Study: Using AI chatbots for cancer treatment insights. Image Credit: greenbutterfly/


LLMs have shown promise in coding clinical data and making diagnostic recommendations, with some of these systems recently being used to take and subsequently pass the United States Medical Licensing Examination (USMLE). Similarly, the OpenAI ChatGPT application, which is part of the Generative Pre-Training Transformer (CPT) family of templates, was also used to identify potential research topics, as well as to update doctors, nurses and other healthcare professionals on recent developments in respective fields.

LLMs can also mimic human dialects and provide quick, detailed and coherent answers to questions. However, in some cases, LLMs may provide less reliable information, which may mislead people who often use AI to self-learn. Despite providing these systems with high-quality and reliable data, AI is still vulnerable to bias, limiting its applicability for medical applications.

The researchers predict that general users could use an LLM chatbot to query medical claims related to cancer. Thus, a chatbot that provides seemingly correct information but a wrong or inaccurate answer related to cancer diagnosis or treatment could mislead the person and generate and amplify misinformation.

About the studio

In the present study, researchers evaluate the performance of an LLM chatbot in providing treatment recommendations for prostate, lung, and breast cancer in accordance with National Comprehensive Cancer Network (NCCN) guidelines.

Since the LLM chatbot end of knowledge date was September 2021, this model relied on the 2021 NCCN guidelines for making treatment recommendations.

Four zero-shot prompt templates were also developed and used to create four variants for 26 cancer diagnosis descriptions for a final total of 104 prompts. These prompts were later fed as input to GPT-3.5 via the ChatGPT interface.

The study team included four board-certified oncologists, three of whom assessed the chatbot’s results for agreement with the NCCN 2021 guidelines based on five scoring criteria developed by the researchers. Majority rule was used to determine the final score.

The fourth oncologist helped the other three resolve disagreements, which mainly arose when the output of the LLM chatbot was unclear. For example, LLM did not specify which treatments to combine for a specific type of cancer.

Study results

A total of 104 unique suggestions evaluated against five scoring criteria yielded 520 scores, of which all three annotators agreed on 322, or 61.9% of the scores. Furthermore, the LLM chatbot provided at least one recommendation for 98% of the requests.

All responses with a treatment recommendation included at least one treatment concordant with the NCCN. In addition, 35 of the 102 results recommended one or more inconsistent treatments. In 34.6% of the cancer diagnosis descriptions, all four prompt models were assigned equal scores on all five scoring criteria.

Over 12% of chatbot responses were not considered NCCI-recommended treatments. These responses, which have been described as “hallucinations” by researchers, were primarily immunotherapy, localized treatment of advanced disease, or other targeted therapies.

The LLM chatbot recommendations also varied based on how the researchers phrased their questions. In some cases, the chatbot produced unclear results, leading to disagreements between three annotators.

Other disagreements have arisen due to differing interpretations of the NCCN’s guidelines. However, these agreements have highlighted the difficulty of reliably interpreting LLM output, especially descriptive output.


The LLM chatbot evaluated in this study mixed incorrect cancer treatment recommendations with correct recommendations, which even the experts failed to detect these errors. As a result, 33.33% of treatment recommendations were at least partially non-compliant with NCCN guidelines.

The study results demonstrate that the LLM chatbot was associated with below-average performance in providing reliable and accurate cancer treatment recommendations.

Due to the increasingly widespread use of artificial intelligence, it is imperative that healthcare professionals properly educate their patients about the potential misinformation this technology can provide. These findings also underscore the importance of federal regulations for artificial intelligence and other technologies that have the potential to cause harm to the general public due to their inherent limitations and inappropriate use.

Magazine reference:

  • Chen, S., Kann, BH, Foote, MB, et al. (2023). Using AI chatbots for cancer treatment insights. JAMA Oncology. doi:10.1001/jamaoncol.2023.2954

#chatbot #wrong #cancer #treatment #recommendations
Image Source :

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top