ARTICLE AD BOX
FILE (AP Photo/Richard Drew, file)
A new study from MIT researchers reveals that
Large Language Models
(LLMs) used for medical treatment recommendations can be swayed by nonclinical factors in patient messages, such as typos, extra spaces, missing gender markers, or informal and dramatic language. These stylistic quirks can lead the models to mistakenly advise patients to self-manage serious health conditions instead of seeking medical care. The inconsistencies caused by nonclinical language become even more pronounced in conversational settings where an LLM interacts with a patient, which is a common use case for patient-facing chatbots.Published ahead of the ACM Conference on Fairness, Accountability, and Transparency, the research shows a 7-9% increase in self-management recommendations when patient messages are altered with such variations. The effect is particularly pronounced for female patients, with models making about 7% more errors and disproportionately advising women to stay home, even when gender cues are absent from the clinical context.“This is strong evidence that models must be audited before use in health care, where they’re already deployed,” said Marzyeh Ghassemi, MIT associate professor and senior author. “LLMs take nonclinical information into account in ways we didn’t previously understand.”Lead author Abinitha Gourabathina, an MIT graduate student, noted that LLMs, often trained on medical exam questions, are used in tasks like assessing clinical severity, where their limitations are less studied. “There’s still so much we don’t know about LLMs,” she said.
The study found that colorful language, like slang or dramatic expressions, had the greatest impact on model errors. Unlike LLMs, human clinicians were unaffected by these message variations in follow-up research. “LLMs weren’t designed to prioritize patient care,” Ghassemi added, urging caution in their use for high-stakes medical decisions.The researchers plan to further investigate how LLMs infer gender and design tests to capture vulnerabilities in other patient groups, aiming to improve the reliability of AI in health care.