AAs more machine learning tools reach patients, developers are starting to realize the potential for bias. But a growing body of research wants to emphasize that even carefully trained models — models built to ignore race — can create inequalities in care.
Researchers at the Massachusetts Institute of Technology and IBM Research recently showed that algorithms based on clinical notes — the lyricists jot down free-form notes during patient visits — could predict a patient’s self-identified race, even when the data had been stripped of explicit data. mentions of race. It’s a clear sign of a major problem: race is so entrenched in clinical information that straightforward approaches like race editing fail when it comes to making sure algorithms aren’t biased.
“People have the misconception that if they only include race as a variable or don’t include race as a variable, it is enough to consider a model fair or unfair,” said Suchi Saria, director of the machine learning and health care lab at Johns Hopkins University and CEO of Bayesian Health† And the paper makes it clear that it’s not just the explicit mention of race that matters. Race information can be gleaned from all the other data out there.”
In the paper, which has not yet been peer-reviewed, researchers collected clinical nursing notes from Beth Israel Deaconess Medical Center and Columbia University Medical Center for patients who self-reported their race as white or black. After removing racial terms, they trained four different machine learning models to predict the patient’s race based solely on the notes. They performed astonishingly well. Each model achieved an area under the curve – a measure of a model’s performance – greater than 0.7, with the best models in the 0.8 range.
In itself, the fact that machine learning models can pick up on a patient’s self-reported race isn’t all that surprising — for example, the models picked up words associated with comorbidities that are more common in black patients, and skin conditions that are more commonly diagnosed in white patients. In some cases, that may not be harmful. “I would argue that there are cases where you want to include race as a variable,” said Saria, who was not involved in the study.
But the fact that the models picked up subtle racial differences ingrained in doctors’ notes illustrates how difficult — if not impossible — it is to design a race-agnostic algorithm. Race is printed on all medical records. Not just in the words doctors use, but also in the vital signs and medical images they collect using devices designed with “typical” patients in mind. “There’s no way you can erase the race from the dataset,” said MIT computer physiologist and co-author Leo Anthony Celi. “Don’t even try; it’s not going to work.”
To emphasize that point, the researchers tried to hobble their race-prediction models by removing the words that were most predictive of both races. But even when researchers took clinical notes from those 25 tip words, the best-performing model only saw its AUC drop from 0.83 to 0.73.
The results echoed another paper in the researchers’ series, recently published in The Lancet Digital Health, which examined machine learning models similarly trained to predict self-reported race based on CT scans and X-rays. The predictions also held up well, even when the images were so blurry and gray that radiologists couldn’t identify anatomical features — a result that still can’t be explained.
“The models can see things that we humans can’t appreciate or judge,” said lead author Judy Gichoya, a radiologist and machine learning researcher at Emory University. “It wasn’t just the discovery of race that was surprising. It’s because it’s hard to identify, even when that happens.”
Compounding the problem, human experts looking at the same redacted notes and images were unable to detect a patient’s race.
“I think that’s the biggest concern,” said Marzyeh Ghassemi, leader of the Healthy ML group at MIT and co-author of both papers. “If I, as a third-party software purchaser or as a clinician, ran this bad model about race-edited notes, it could give me much worse performance for all the black patients in my dataset, and I’d have no idea .”
Because there are no requirements for clinical AI tools to report their performance in different subgroups — most report a single, aggregated performance percentage — “it will fall to model users to perform these internal audits,” Ghassemi said. In a synthetic experiment, she and her colleagues also showed how a model trained on race-edited notes could perpetuate care differences for black patients, recommending analgesia as a treatment for acute pain less frequently than for white patients.
If machine learning developers can’t rely on end-users to sound the alarm about the performance of algorithms in different environments, Saria said: “The question is, how can we think better about evaluating whether the information being inferred leads to disparate assignment? and unequal care?”
The field will likely need to take a top-down approach to investigate the safety, efficacy, and fairness of clinical algorithms. Otherwise, human experts will only find the prejudices they think they are looking for† “This is where the end-to-end checklist-like work comes in,” says Saria. “It explains, from start to finish, what are the many sources of bias that can arise? How do you look it up? How do you find the signal and the questions to identify it?”
Only then can developers implement nuanced solutions, such as collecting more data from underrepresented patient groups, calibrating the model’s input, or developing clinical policies that require healthcare providers to account for the poor performance of an algorithm when taking of decisions.
“The main point is that nothing simple is going to work here,” Ghassemi said, including removing race-related words or punishing an algorithm for using race as a variable. “Healthcare data is generated by people who serve or work on or care for other people, and it will curb the exhaustion of that process.”
The results of both studies underline the need for more collection of self-reported demographic information. Not just race, but characteristics such as socioeconomic status, gender identity and sexual orientation, and a variety of social determinants of health. “That information is important for us to make sure you don’t have any unintended consequences from your algorithm,” Celi says. Indeed, this study would not have been possible without clear data on the self-reported race of patients.
Until that control is standardized, however, Celi urges caution. “We’re not ready for AI — no industry is really ready for AI — until they find out that the computers are learning things they shouldn’t.”