Disclaimer: this whole article was written by a human without the use of any large language model.
One year ago, an OBGYN physician friend of mine was venting to me about just how much time he spent writing administrative documents and boilerplate responses instead of having meaningful conversations with patients. He was frustrated and looking for a solution. He tried software, scribes, and medical students – all with mixed success. I recently showed him a note that GPT-4 generated about a fake patient in a few seconds. His response: “how long until it can deliver the baby for me?”
Six months ago, ChatGPT took most of the professional world by storm. Professionals in education, law, and software started experimenting with incorporating it into their daily work. I started receiving suspiciously longer typo-free emails from marketers. Medicine was mostly spared at the time given the performance of GPT-3.5. It was generally understood that GPT-3.5 was not relevant to clinical care due to: (1) the unique level of risks in medical communication; (2) the extremely specialized knowledge context of medicine; and (3) GPT-3.5 failing the medical Turing test: was this information written by a real human physician? (Nov et al. 2023).
In the last month, GPT-4 has changed the game. It produces shockingly convincing conversational responses to very detailed clinical questions. It not only attempts to pass as a physician, but potentially a specialist with more knowledge and certainty than an actual physician user. The problem: it can be very wrong. Our team of expert clinicians in obstetrics and maternal-fetal medicine has been evaluating its answers on a number of pregnancy topics including preeclampsia and gestational diabetes. While GPT-4 sometimes answers correctly and with surprising depth, it typically switches between competing theories of disease and management with slight variations in user prompts. It also fabricates incorrect or unproven knowledge in ways that could harm patients if it were fully relied upon. Annoyingly for clinicians looking to use it in a research capacity, GPT-4 also still invents its sources.
In obstetrics in particular, we see specific challenges for applying Large Language Models (LLMs) to clinical use cases. We lack scientific consensus on clinical etiologies and treatment paradigms for many obstetric conditions. This uncertainty amplifies the performance variability of LLMs trained on the corpus of established knowledge. Risk mitigation as a core tenet of obstetrics also mandates a thoughtful approach to any new technology, particularly one as potentially transformative as LLMs. The massive racial disparities in maternal health for Black and Indigenous patients further necessitate rigorous validation before applying LLMs to obstetrics.
The web-based data on which LLMs like GPT-4 are trained contain substantial biases, including racial bias. The use of algorithms in maternal healthcare already poses significant challenges. Algorithms can reinforce existing racial biases in care, producing worse pregnancy outcomes for people of color. For example, the Vaginal Birth After Cesarean (VBAC) risk algorithm predicted a significantly lower rate of a successful VBAC for Black and Hispanic patients. This algorithm was widely used for decades, and contributed to higher C-section rates for people of color (Vyas et al. 2020). In 2021, the VBAC algorithm was updated to remove race as a variable. LLMs by contrast have complex biases like humans that can be difficult to identify and eliminate. Unlike in the case of the VBAC algorithm, we have no simple fix ready for a world in which LLMs may amplify racial disparities in healthcare.
Last week, Google enabled distribution of Med-PaLM 2, their recently released LLM, to healthcare customers. Not to be outdone, this week at the HIMSS conference, Microsoft announced that GPT-4 will be integrated into Epic at three sites. The possible efficiency improvement to our healthcare system can save lives if implemented effectively – and my friend will be thrilled to have automatically generated messages and documentation. However, before LLMs can be safely deployed, we need to train them to fully mitigate the risks of adverse outcomes they may cause – risks we know to be disproportionately borne by patients of color.
I hope that these LLMs, when appropriately trained, may one day help obstetricians access and grow the global fund of knowledge in caring for their patients. So as I told my OBGYN friend, while GPT won’t be delivering the baby anytime soon, we must adapt these models to take their practice of obstetrics to the next level – securely and equitably for all their patients.