Introduction

Personalised medicine represents a new frontier in healthcare. Data-driven approaches are crucial in optimising individualised rehabilitation pathways by providing reliable, interpretable, and patient-centric predictions1. Moreover, there is a pressing demand for trustworthy prognostic solutions, enabling users to understand and interpret automatic decisions2. However, while tools for personalised treatment decisions are becoming more prevalent in healthcare, their clinical validation and impact on treatment improvement remain largely underexplored3.

Treatment personalisation is particularly relevant in rehabilitative medicine4, where the goal is to adapt the rehabilitation plan to the unique needs of each patient, given the evidence of its positive effects on recovery5. According to Kokkotis et al.6, machine learning (ML) tools can be applied to predict long-term recovery rates from the earliest hours of hospitalisation after a stroke. This suggests ML can assist medical practitioners in deploying novel, individualised rehabilitation approaches, to enhance the quality of life for survivors and the overall quality of care. However, examples of technological tools that support personalised post-stroke rehabilitation treatments are still scarce7.

The use of ML technologies in healthcare presents pitfalls such as prediction inaccuracy, privacy vulnerabilities, and data scarcity that can hinder the attainment of real-life comparable results8. A critical challenge is collecting high-dimension and high-quality data for reliable and reproducible predictions, due to limitations in sample size and data quality in real-world scenarios8. In this context, the presence of missing data may represent a significant technical problem9, because it can result in a loss of information, reduced sample size, bias in the results, and underestimated uncertainty9,10,11. When it is not possible to avoid missing values by optimising data collection, Multiple Imputation (MImp) is a suitable method for obtaining unbiased results while appropriately considering variability11,12. While in single imputation only one value is imputed for each missing entry, causing statistical analyses to overlook the uncertainty around the values which are not observed10, MImp is a statistical technique involving the generation of multiple plausible estimates for missing values, allowing a correct quantification of the uncertainty associated with missing observations in the data13.

In ML, the presence of missing data is often resolved by simply removing or exclusively filling the entries with a single imputation procedure13. However, the integration of MImp techniques with ML methods is possible, despite being rarely addressed, and may lead to superior results, enhancing prediction performance14. Pioneering contributions currently exist specifically addressing the use of MImp techniques in ML, exploring alternative procedures and their feasibility15,16. Rios et al.15 conducted an evaluation of the impact of missing values on the accuracy estimates of ML models, employing seven distinct methods for missing data management, such as the MImp method, cluster-based imputation or regression-based imputation. In this work, MImp emerged as a promising compromise between feasibility and accuracy, in predicting patient-specific risk of adverse cardiac events.

Despite the increasing prevalence of ML methods applied to stroke and ambulation recovery studies17, in accordance with current information no attention has been given to integrating advanced missing data management techniques with ML ones. Therefore, it becomes urgent to explore and evaluate methods that ensure the robustness and reliability of missing data handling without compromising the overall effectiveness of the analytical process.

This study focuses on the development of predictive models for the prognosis of stroke rehabilitation outcomes, based on the datasets of two multisite observational studies, prospectively and systematically enrolling all adults addressing intensive inpatient rehabilitation within 30 days after stroke18,19. The recovery of independent ambulation is a key stroke rehabilitation outcome, directly related to community mobility and participation20, and improved quality of life in the chronic stage of stroke, as well as a determinant of caregiver’s burden21. Further independent walking is a well-known top priority of stroke patients and their families, having a relevant impact on the patients’ social destination after discharge, and mobility. For these reasons, the recovery of independent ambulation can be considered one of the most relevant patient-centred outcomes, as also reported in the International Standard Set of Patient-Centered Outcome Measures After Stroke22. Thus, we focused on the recovery of independent ambulation at discharge from rehabilitation in the subset of stroke survivors, who ambulated independently before stroke but lost the ability after stroke. After an accurate phase of data pre-processing, this study integrated MImp techniques with a cross-validated ML-based predictive model. Then, influential predictors of ambulation outcomes were identified, by using explainable Artificial Intelligence (AI) techniques.