Like most of the fields in science machine learning and AI is facing a major reproducibility crisis over the past two years.
"If a doctor is using machine learning or an Artificial Intelligence tool to aid in patient care , and that tool does not perform up to the standards reported during research process ,then that could risk harm to the patient, and it could generally lower the quality of care. " Marzyeh Ghassemi (University of Toronto).
Ghassemi reported in a paper describing her team's analysis of 511 other papers which stated that the machine learning papers in healthcare were reproducible far less often than in other machine learning subfields. The group's report was published in the recent week in Science Translational Medicine. In a systematic review on Nature Machine Intelligence, 85 percent of tests using machine learning to detect COVID-19 in chest scanning failed miserably in reproducibility and quality check. None of these models were near ready for clinical use says the writer.
"We are surprised at how far the models are from being ready for development." Derek Driggs, co-author of the papers from the lab of Carola-Bibiane Schönlieb (University of Cambridge).
Schönlieb and colleagues formed a multidisciplinary team, the AIX-CONVENT collaboration at the beginning of the pandemic, to develop a model using chest X-Rays to predict COVID-19 severity. But following a literature review, the team found that many models appeared to include biases making them unfit for the clinic. So instead of making their own model, they are researching harder on the literature.
Driggs says " We realized the best way to help would be by setting rigid research standards that could help people develop models that could actually be useful to clinicians."
The team collected 2,212 machine learning studies and channeled them down to 415 models for detecting or predicting COVID-19 infection from chest scans. Only 62 out of those 415 models passed the two standard reproducibility and quality checklists, CLAIMS, and RQS,
According to Driggs, there is a huge reproducibility issue because many studies didn't actually report enough of their methodology to recreate their models.
The 62 models including two currently in use in clinics the team found that none were developed such that they could actually be deployed in medicine because of the biases in studies and methodological flaws.
16 out 0f those 62 models dataset images of children's lung as a healthy control but didn't mention it in the methodology and then tested the algorithms on images of adult's lung with COVID-19, thus training the models to differentiate between children and adult lung and not healthy or virus-infected. Moreover, some models lacked elaborate datasets and even their source.
Ghassemi and her team evaluated 511 machine learning papers at the University of Toronto presented at the machine learning conferences from 2017-2019. While going through the papers they annotated each set of papers on their type of reproducibility.
In technical reproducibility - the ability to fully replicate code against the same dataset used by the authors, only 55 percent of machine learning in health care (MHL) papers made their code available and used public datasets, compared to the 0 percent computer vision and natural language processing papers.
"What is worrying is that the datasets are not available, " says Ghassemi ."I did not realize quite how bad it was until we rea through the papers."
In conceptual reproducibility, the ability to reproduce results using different datasets only 23 percent of MHL papers used multiple datasets to confirm their results, as compared to the 80 percent of computer vision studies and 58 percent of natural learning processing studies.
Because of the restricted datasets due to health privacy concerns and even the disagreements over a scan or patient machine learning is really challenging in Healthcare. Though researchers are still optimistic and hopeful that it will do better.
Driggs and the multidisciplinary team have their set of recommendations to fix the issue. Building a team of machine learning researchers and clinicians may bridge the discontent between the medical and machine learning community.
Some research teams are creating diverse and representative data to be used in the machine learning health community.
These researchers also suggest health organizations having data standards such as the Observational Medical Outcomes Partnership Standard and the Fast Healthcare Interoperability Resources standard to be implemented on MLH research.
To help their work, Newsmusk allows writers to use primary sources. White papers, government data, initial reporting, and interviews with industry experts are only a few examples. Where relevant, we also cite original research from other respected publishers.
Source- Spectrum/IEEE
Comments