Steps to avoid over use and misuse of ML in clinical research Review
 Steps to avoid over use and misuse of ML in clinical research Review 
 Information
link: https://www.nature.com/articles/s41591-022-01961-6
Introduction
- The lackluster performance of  many machine learning (ML) systems in healthcare
- has been well documented.
 - In healthcare, AI algorithms can even perpetuate human prejudices such as
- sexism and racism when trained on biased datasets
 
 
 
Box 1 | Recommendations to avoid overuse and misuse of AI in clinical research
- Whenever appropriate, (predefined)  sensitivity analyses
- using traditional statistical models should be presented alongside ML models.
 
 
- Whenever appropriate, (predefined)  sensitivity analyses
 - Protocols should be published and peer reviewed whenever possible, and 
- the choice of model should be stated and substantiated.
 
 
- Protocols should be published and peer reviewed whenever possible, and 
 - All model performance parameters should be disclosed and, ideally,
- the dataset and analysis script should be made public.
 
 
- All model performance parameters should be disclosed and, ideally,
 - Publications using ML algorithms should be accompanied by disclaimers 
- about their decision-making process, and
 - their conclusions should be carefully formulated.
 
 
- Publications using ML algorithms should be accompanied by disclaimers 
 - Researchers should commit to developing interpretable and transparent ML algorithms
- that can be subjected to checks and balances.
 
 
- Researchers should commit to developing interpretable and transparent ML algorithms
 - Datasets should be inspected for sources of bias and
- necessary steps taken to address biases.
 
 
- Datasets should be inspected for sources of bias and
 - The type of ML technique used should be chosen taking into account
- the type, size and dimensionality of the available dataset.
 
 
- The type of ML technique used should be chosen taking into account
 - ML techniques should be avoided when dealing with
- very small, but readily available, convenience clinical datasets.
 
 
- ML techniques should be avoided when dealing with
 - Clinician–researchers should aim to procure and utilize
- large, harmonized multicenter or international datasets with high-resolution data, if feasible.
 
 
- Clinician–researchers should aim to procure and utilize
 - A guideline on the choice of statistical approach,
- whether ML or traditional statistical techniques, would
 - aid clinical researchers and highlight proper choices
 
 
- A guideline on the choice of statistical approach,
 
Failure to replicate
- At the beginning of the COVID-19 pandemic,
- the development of ML algorithms to estimate the probability of infection was hot.
- These algorithms based their predictions on various data elements 
- captured in electronic health records, such as chest radiographs
 
 
 - These algorithms based their predictions on various data elements 
 - Despite their promising initial validation results,
- the success of numerous artificial neural networks trained on chest X-rays  
- were largely not replicated when applied to different hospital settings,
 - in part because the models failed to learn or understand the true underlying pathology of COVID-19
 
 
 - the success of numerous artificial neural networks trained on chest X-rays  
 - These ML algorithms were
- not explainable and,
 - were inferior to traditional diagnostic techniques such as RT-PCR,
 - obviating their usefulness.
 
 - More than 200 prediction models were developed for COVID-19,
- some using ML, and
 - virtually all suffer from poor reporting and high risk of bias
 
 
 - the development of ML algorithms to estimate the probability of infection was hot.
 
Avoiding overuse
- The term ‘overuse’ refers to
- the unnecessary adoption of AI or advanced ML techniques
 - where alternative, reliable or superior methodologies already exist.
 - In such cases, the use of AI and ML techniques is
- not necessarily inappropriate or unsound,
 - but the justification for such research is unclear or artificial:
 - for example, a novel technique may be proposed that delivers no meaningful new answers.
 
 
 - A high AUC is not necessarily a mark of quality,
 - as the ML model might be over-fit (Fig. 1).
 - Figure 1 Model fitting 

 - When a traditional regression technique is applied and compared against ML algorithms,
- the more sophisticated ML models often offer
 - only marginal accuracy gains,
 - presenting a questionable trade-off between model complexity and accuracy
 
 - Even very high AUCs are no guarantees of robustness,
- as an AUC of 0.99 with an overall event rate of <1% is possible, and
 - would lead to all negative cases being predicted correctly,
 - while the few positive events were not
 
 - many simple medical prediction problems are inherently linear,
- with features that are chosen because they are known to be strong predictors,
 - usually on  the basis of prior research or mechanistic considerations.
- In these cases, it is unlikely that
 - ML methods will provide a substantial improvement in discrimination
 
 
 - modest improvements in medical prediction accuracy are
- unlikely to yield a difference in clinical action
 
 - ML techniques should be evaluated against traditional statistical methodologies 
- before they are deployed.
 - If the objective of a study is to develop a predictive model, 
- ML algorithms should be compared to a predefined set of traditional regression techniques
- for Brier score
- (an evaluation metric similar to the mean squared error, used to check the goodness of a predicted probability score)
 - https://data.library.virginia.edu/a-brief-on-brier-scores/
 
 - discrimination (or AUC) and calibration.
 
 - for Brier score
 - The model should then be externally validated
 
 - ML algorithms should be compared to a predefined set of traditional regression techniques
 
 
Rationalize usage
- Researchers should start any ML project with
- clear project goals and
 - an analysis of  the advantages that AI, ML or conventional statistical techniques
- deliver in the specific clinical use case
 
 - If the objective of a study is to develop  a new prognostic nomogram or predictive model,
- there is little evidence that ML will fare better than traditional statistical models
- even when dealing with large and highly dimensional datasets
 
 
 - there is little evidence that ML will fare better than traditional statistical models
 - If the purpose of a study is to infer a causal treatment effect of a given exposure,
- many  well-established traditional statistical techniques, such as
- structural equation modelling, propensity-score methodology, instrumental variables analysis a regression discontinuity analysis,
 - yield readily interpretable and rigorous estimates of the treatment effect
 
 
 - many  well-established traditional statistical techniques, such as
 
 
Avoiding misuse
- the term ‘misuse’ connotes more egregious usages of ML,  
- ranging from problematic methodology that engenders spurious inferences or predictions,
 - to applications of ML that endeavor to replace the role of physicians  
- in situations which should still require a human input
 
 
 - Indiscriminately accepting an AI algorithm
- purely based on its performance,
 - without scrutinizing its internal workings, 
- represents a misuse of ML19, although
 - it is questionable to what extent every clinician decision is robustly explainable
 
 
 - Many groups have called for explainable ML or the incorporation of counterfactual reasoning
- in order to disentangle correlation from causation
 - The notion of a ‘black box’  that underpins clinical decision-making 
- is an antithesis to the modern practice of medicine and
 - is increasingly inaccurate,  
- given the growing armamentarium of techniques such as
- saliency maps and generative adversarial networks
 - that can be used to probe the reasoning made by neural networks
 
 
 - given the growing armamentarium of techniques such as
 
 
 - Researchers should commit to developing ML models that are
- interpretable, with their reasoning standing up to scrutiny by human experts, and
 - to sharing de-identified data and scripts
- that would allow external replication and validation
 
 
 
Data constraints
- Usage of ML in spite of data constraints, such as biased data and small datasets,
- is another misuse of AI.
 
 - Deep learning techniques are known to require large  amounts of data,
- but many publications in the medical literature
- feature techniques with much smaller sample and feature-set sizes than
 - are typically available in other technological industries.
- Meta’s Facebook trained its facial recognition software using photos from more than 1 billion users;
 - autonomous automobile developers use billions of miles of road traffic video recordings from hundreds of thousands of individual drivers in order to develop software to recognize road objects; and
 - DeepBlue and AlphaGo learn from millions or billions of played games of chess and Go
 
 
 - In   contrast, clinical research studies involving AI generally
- use thousands or hundreds of radiological and pathological images, and
 - surgeon–scientists developing software for surgical phase recognition
- often work with no more than several dozen surgical videos
 
 
 - These observations underscore
- the relative poverty of big data in healthcare and
 - the importance of working toward achieving sample sizes ike those that have been attained in other industries, as well as
 - the importance of a concerted, international big-data sharing effort for health data.
 
 
 - but many publications in the medical literature
 
Human–machine collaboration
- The respective functions of humans and algorithms in delivering healthcare are not the same
- ML algorithms can complement, but not replace,
- physicians in most aspects of clinical medicine,
- from history-taking and physical examination
 - to diagnosis, therapeutic decisions and performing procedures.
 
 
 - physicians in most aspects of clinical medicine,
 - Clinician–investigators must therefore 
- forge a cohesive framework whereby big data propels
- a new generation of human– machine collaboration
 
 
 - forge a cohesive framework whereby big data propels
 
 - ML algorithms can complement, but not replace,
 - ML applications are likely to exist as
- discrete decision-support modules to support specific aspects of patient care,
 - rather than competing against their human counterparts
 
 - Human patients are likely to want 
- human doctors to continue making medical decisions,
 - no matter how well an algorithm can predict outcomes.
 
 - ML should, therefore, be
- studied and implemented as
- an integral part of a complete system of care.
 
 
 - studied and implemented as
 - The clinical integration of ML and big data is poised to improve medicine.
- ML  researchers should recognize the limits of their algorithms and models
- in order to prevent their overuse and misuse,
- which would otherwise sow distrust and cause patient harm
 
 
 - in order to prevent their overuse and misuse,
 
 - ML  researchers should recognize the limits of their algorithms and models
 
Table 1: Definitions of several key terms in machine learning
 This post is licensed under  CC BY 4.0  by the author.
