Combining machine learning and crowdsourcing for systematic and scoping review citation screening: Derivation and validation of two hybrid algorithms


The COVID-19 pandemic demonstrated accepted approaches and research methodologies often do not align with timelines necessary for responding to public health emergencies. This includes comprehensive knowledge synthesis efforts (systematic or scoping reviews, SR) often representing the critical first step in prioritizing research, setting policy, and preparing clinical guidelines. SRs often require months and even years to complete as small teams of 2 to 3 are tasked with assessing thousands of citations representing a major rate-limiting step. Application of a large team (crowdsourcing) approach has been demonstrated as a valid methodology for rapid completion of comprehensive SRs. This study used data from recent crowdsourced SRs to evaluate whether machine learning (ML) could reduce human workload with minimal sensitivity loss.  


This study employed the data from 11 SRs performed using an online platform (insightScope) developed to facilitate crowdsourcing. All crowd members participating in a SR passed a test set prior to participation (sensitivity ≥80%). Step one of derivation evaluated five different ML models (TF-IDF, BOW, FastText, Word2Vec, and Doc2Vec) against data from 6 SRs, with the text from study goal and eligibility criteria used to score each citation against the title and abstract. Step two of derivation incorporated the two top performing ML models into two hybrid human-machine algorithms designed to reduce human workload by 35% (conservative) and 50% (aggressive). The top performing ML model was then applied to the two hybrid algorithms and evaluated against 5 SRs representing the validation set. 


The derivation set included 34 042 total citations, with 911 eligible studies evaluated by 139 unique reviewers (average team size: 25; range: 12 to 40). The validation set included 17 972 citations with 259 eligible studies, evaluated by 75 unique reviewers (average team size: 13; range 8 to 40). Step one of derivation identified TF-IDF and BOW as the top performing ML models for unaided machine screening at the 70% threshold (TF-IDF: sensitivity 84.7%, range: 73.9–94.5%; BOW: sensitivity 86.6%, range: 81.3–97.3 %). In step 2 of derivation, application of the 35% and 50% work-saving hybrid algorithms determined TF-IDF to be superior with sensitivities of 95.3% (range: 88.1 –98.9%) and 93.7% (range: 98.0 –86.3%), respectively. Application of the two hybrid algorithms to the validation set produced sensitivities of 99.0% (range: 96.1–100.0%) and 95.7% (88.3–100.0%) for the 35% and 50% work-saving algorithms, respectively. Evaluation of false negatives identified missing abstracts and non-English text as risk factors. Following validation, the two largest SRs prospectively applied the hybrid algorithms to remaining unassessed citations (n= 18 300, 12 900), and reduced human workload by 47% and 44%, respectively.


Findings suggest potential for a hybrid machine-human crowdsourcing approach to SR screening. Following derivation, validation demonstrated both algorithms achieved sensitivity at or above 95% while reducing human workload by up to 50%. While encouraging, we recommend investigators consider translation of this research by first validating the algorithms against an initial set of citations reviewed fully by human reviewers. Pending further study, investigators should remove non-English citations and consider alternative threshold for citations missing abstracts.

Dayre McNally slides

Leave a Reply

Your email address will not be published. Required fields are marked *