Predicting HIV testing status in pregnant women using balanced machine learning models: Insights from Sierra Leone's demographic health survey

Soladoye, Afeez A.; Olawade, David; Bello, Oluwakemi Jumoke; Analikwu, Claret Chinenyenwa; Daniel, Raphael Igbarumah Ayo; Osborne, Augustus

Predicting HIV testing status in pregnant women using balanced machine learning models: Insights from Sierra Leone's demographic health survey

Soladoye, Afeez A., Olawade, David ORCID: https://orcid.org/0000-0003-0188-9836, Bello, Oluwakemi Jumoke ORCID: https://orcid.org/0009-0007-1435-8766, Analikwu, Claret Chinenyenwa, Daniel, Raphael Igbarumah Ayo and Osborne, Augustus (2026) Predicting HIV testing status in pregnant women using balanced machine learning models: Insights from Sierra Leone's demographic health survey. Decoding Infection and Transmission, 4. p. 100078.

[thumbnail of 1-s2.0-S2949924026000054-main.pdf]

Preview

Text
1-s2.0-S2949924026000054-main.pdf - Published Version
Available under License Creative Commons Attribution Non-commercial No Derivatives.
| Preview

Official URL: https://doi.org/10.1016/j.dcit.2026.100078

Abstract

Objective
Preventing vertical HIV transmission requires comprehensive testing programmes for pregnant women, yet coverage gaps persist across Sub-Saharan Africa. In Sierra Leone, approximately one-third of pregnant women remain untested for HIV, creating substantial public health challenges. Conventional predictive models often exhibit bias towards majority classes in imbalanced datasets, hindering accurate identification of untested women who require urgent intervention. This study addresses the critical need for diagnostic prediction models that can reliably identify pregnant women at risk of not being tested for HIV. This study develops and validates diagnostic machine learning prediction models to identify HIV testing patterns among pregnant women in Sierra Leone, emphasising class balance techniques to enhance minority class detection capabilities and improve targeted intervention strategies.

Methods
We analysed data from 990 pregnant women (aged 15-49) using the 2019 Sierra Leone Demographic and Health Survey. Our preprocessing pipeline included categorical variable encoding, feature normalisation via Min-Max scaling, and implementation of Synthetic Minority Oversampling Technique (SMOTE) for dataset balancing. Model development employed four supervised learning algorithms: Random Forest, XGBoost, Logistic Regression, and K-Nearest Neighbors. Model performance was evaluated using macro-averaged metrics including precision, recall, F1-score, and accuracy, with 70-30 train-test split validation.

Results
Imbalanced dataset models demonstrated suboptimal performance with macro F1-scores between 0.46 and 0.57. Following SMOTE implementation, diagnostic performance improved substantially to 0.55-0.72. Random Forest achieved optimal macro F1-score (0.72), representing 56% improvement over standard approaches.

Conclusions
Class imbalance mitigation through SMOTE substantially enhances diagnostic prediction model performance for HIV testing status classification, facilitating targeted public health strategies in resource-constrained environments.