Kwok, Wing S. ORCID: https://orcid.org/0000-0002-6806-3011, Wallbank, Geraldine
ORCID: https://orcid.org/0000-0001-7914-6149, Hodgson, Philip, Schrader, Thomas, Shao, Lexuan
ORCID: https://orcid.org/0009-0007-3914-6532, Elkins, Mark, Fandim, Junior
ORCID: https://orcid.org/0000-0002-3317-0167, Scott, Julia, Sherrington, Cathie and Traeger, Adrian
(2026)
Automated approaches to identifying clinical trials based on title and abstract in the field of physiotherapy: a comparative analysis.
Journal of Clinical Epidemiology.
p. 112309.
(In Press)
Abstract
Objective
To compare accuracy, precision, recall, F1 and time spent using commercial tools to identify physiotherapy trials based on title and abstract, compared with a human approach.
Study Design
This study compared two approaches for title and abstract screening of 10,793 newly published records. In the reference standard human approach, two reviewers independently screened records using pre-specified rules to assess relevance to physiotherapy. A third person resolved disagreements. We evaluated three LLMs (gpt-4o, gpt-4.5, gpt-4-turbo) within two commercial, web-based tools (ChatGPT and Co-pilot). Outcomes were accuracy (proportion of records that model correctly identified as relevant or irrelevant), precision (proportion of records identified as relevant that were considered as relevant by human approach), recall (the proportion of all actual relevant records that the model successfully identified), F1 (harmonic mean of precision and recall) and time spent. Exploratory analyses compared the performance of the commercial tools with local approaches, including local LLMs implementation, machine learning and natural language processing.
Results
Commercial tools showed comparable performance across all metrics (ChatGPT vs Copilot: accuracy: 83% vs 86%; precision: 44% vs 48%; recall: 88% vs 87%; F1: 59% vs 62%). The total time spent using commercial tools with a labelled dataset was equivalent to 37% of the time required for the human-only screening process. Exploratory analysis showed that the API-based implementation has comparable performance (accuracy: 82%; precision: 42%; recall: 93%; F1: 58%). Yet, LLM-based models demonstrated lower performance compared with other local, custom-adapted automation approaches such as machine learning and natural language processing.
Conclusion
This proof-of-concept study demonstrates that commercial web-based LLMs may have sufficient accuracy to support title and abstract screening and substantially reduce the time to identify field-specific trials. However, alternative approaches, including machine learning or natural language processing, could achieve screening performance similar to or slightly higher than that of commercial tools, yet they require a series of pre-processing steps for implementation.
| Item Type: | Article |
|---|---|
| Status: | In Press |
| DOI: | 10.1016/j.jclinepi.2026.112309 |
| School/Department: | School of Science, Technology and Health |
| URI: | https://ray.yorksj.ac.uk/id/eprint/14725 |
University Staff: Request a correction | RaY Editors: Update this record
Altmetric
Altmetric