Miss a day, miss a lot. Subscribe to The Defender's Top News of the Day. It's free.

The artificial intelligence (AI) text chatbot ChatGPT misdiagnosed 83% of children’s health problems in a case challenge issued by doctors at a New York children’s hospital, a new study showed.

The study, published Jan. 2 in JAMA Pediatrics, a peer-reviewed journal of the American Medical Association, was led by Joseph Barile of Cohen Children’s Medical Center in New Hyde Park, New York.

Barile and other researchers challenged ChatGPT version 3.5 to diagnose children’s illnesses by randomly feeding it pediatric cases from Massachusetts General Hospital in Boston reported in the past 10 years in JAMA Pediatrics and The New England Journal of Medicine.

The study’s authors challenged ChatGPT to make a diagnosis in 100 cases of children’s health problems. The AI chatbot failed in many cases even to identify the correct organ system of the child’s affliction.

The results were graded by two researcher physicians who found the chatbot made 72 incorrect diagnoses. Another 11 diagnoses “were clinically related but too broad to be considered a correct diagnosis.”

In 43.3% of cases (36 of 83) where ChatGPT erred, the chatbot failed to correctly identify the correct organ system of the patient’s affliction.

“Most of the incorrect diagnoses generated by the chatbot (47 of 83 or 56.7%) belonged to the same organ system as the correct diagnosis,” the study reported, “but were not specific enough to be considered correct,” for example confusing psoriasis and seborrheic dermatitis.

Despite the chatbot’s high diagnostic failure rate, the authors concluded that “physicians should continue to investigate the applications” of AI chatbots to medicine,” citing AI’s growing ability “to process information and provide users with insights from vast amounts of data.”

Generative AI is increasingly being used in healthcare, with predictions that 2024 may be the year “artificial intelligence transforms medicine.”

But the JAMA Pediatrics study also showed the superior training and enduring value of physicians, the authors said.

“The underwhelming diagnostic performance of the chatbot observed in this study underscores the invaluable role that clinical experience holds,” they wrote.

“The chatbot evaluated in this study — unlike physicians — was not able to identify some relationships, such as that between autism and vitamin deficiencies.”

Dr. Ryan Cole, a pathologist and COVID-19 researcher trained at the Mayo Clinic and Columbia School of Medicine who founded a large diagnostic medical laboratory in Boise, Idaho, was not surprised the chatbot made so many diagnostic mistakes.

He cited the 2005 book by Dr. Marcia Angell, who stepped down as editor-in-chief of The New England Journal of Medicine after two decades at the prestigious journal and wrote “The Truth About the Drug Companies: How They Deceive Us and What to Do About It.” 

“Marcia wrote a kind of tell-all reporting that the majority of medical studies were corrupted and pharma-funded and less than half the medical literature can be trusted,” he said.

AI is only as good as its source and its programmers, Cole said.

“If you look at the claims in the media that the COVID shots saved millions of lives, that’s a flawed mathematical conclusion based on the flawed mathematical model of Neil Ferguson at the Imperial College of London,” he said.

“If you’re using a source like the medical journals that are at least 50% incorrect based on the people who run these institutions, of course the machine will reach incorrect conclusions.”

The JAMA Pediatrics researchers said a representative misdiagnosis by ChatGPT was a case of rash and arthralgias, or joint pain, in a teenager with autism. The chatbot said it was “immune thrombocytopenic purpura,” a rare autoimmune disorder in which a person’s blood doesn’t clot properly.

A doctor researcher correctly diagnosed the problem as scurvy.

A representative example of ChatGPT’s diagnostic success was a 15-year-old girl with unexplained intracranial hypertension, a build-up of pressure around the brain that can cause headaches and vision loss.

The physician diagnosed the problem as “primary adrenal insufficient (Addison disease).” The chatbot said it was “adrenal insufficiency (Addison disease).”

The authors said the study, published as a research letter in JAMA Pediatrics, was the first to investigate the accuracy of AI chatbots “in solely pediatric scenarios, which require the consideration of the patient’s age alongside symptoms.”

The authors said an earlier study, published June 15, 2023, in JAMA Network, found that an “artificial intelligence (AI) chatbot rendered a correct diagnosis in 39% of New England Journal of Medicine (NEJM) case challenges,” demonstrating the potential value of AI as a diagnostic tool.

“Chatbots have potential as an administrative tool for physicians, demonstrating proficiency in writing research articles and generating patient instructions,” the authors of the JAMA Pediatrics study said.

For example, electronic health record vendor Epic has collaborated with Microsoft’s OpenAI to incorporate GPT-4 with “the goal of automating components of clinical documentation.”

But AI chatbots need more training to become better diagnosticians, the authors concluded, a process called “tuning” when physicians undertake to train the bot.

Chatbots “are typically non-specifically trained on a massive amount of internet data, which can often be inaccurate,” the authors said. They “do not discriminate between reliable and unreliable information but simply regurgitate text from the training data to generate a response.”

An AI chatbot also typically lacks real-time access to new research, current health trends and disease outbreaks. But some new chatbots, “like Google’s Med-PaLM 2, have been specifically trained on medical data and may be better equipped to provide accurate diagnoses.”

Cole said AI is a potentially useful tool but is not close to replicating the knowledge and intuition of an experienced clinician.

“In medicine, we have a saying, ‘If you’re under a bridge and you hear hoofbeats overhead, it’s a horse. But then again it may be a zebra.’ Diagnosis is a skill set built up over experience and time. An autopilot may do a fine job flying a plane. But when it’s time to take off or land it’s the pilot with his years of experience in unpredictable situations that counts,” Cole said.