Streamlining systematic reviews with large language models using prompt engineering and retrieval augmented generation

Table 1 Summary and comparison of the manual method, Rayyan thresholds A and B and the LLM method

		Manual	Rayyan Threshold A	Rayyan Threshold B	LLM
Title/Abstract	Articles to screen*	14,439	14,439	14,439	14,439
	Inclusion Threshold	-	“Undecided”	“Likely to Exclude”	-
	Articles Remaining after automated screening (AER)	-	3,470 (72.1%)	6,131 (50.7%)	3,280 (77.2%)
	Total Articles to Manually Screen	14,439	5,470	8,131
	Time taken for all Manual Screening Articles	144.4 h	54.7 h	81.3 h	-
	Time for automated screening	-	-	-	2 h
	True Positives (FNR)	N/A (Gold Standard)	19 (5%)	20 (0%)	20 (0%)
	Total time for Step	144.4 h	54.7 h	81.3 h	2 h
	Total Time Saved compared to manual method (%)	N/A (Gold Standard)	89.7 h (62.1%)	63.1 h (43.7%)	-
Full Text**	Articles to screen	1,680	-	-	3,280
	Time to run automated screening	-	-	-	4 h
	Articles Remaining (AER)	-	-	-	78 (97.6%)
	Time to manually screen remaining articles	420 h	-	-	19.5 h
	True Positives (FNR)	N/A (Gold Standard)	-	-	20 (0%)
	Total time for step (hours)	420 h	-	-	23.5 h
Total	Total Time for both steps	564.4 h	-	-	25.5 h
Total	Total Time Saved compared to manual method (%)	N/A (Gold Standard)	-	-	538.9 h (95.5%)

AER: Article Exclusion Rate, FNR: False Negative Rate
*Of the original 17,776 citations, 430 articles were excluded as their results were inadvertently not saved. 2,907 articles were deleted after duplicate removal, of the remaining 17,346 articles and 14,439 remained
**Rayyan was excluded from full-text comparison, as its article classification feature is not yet supported in its full text screening platform

ISSN: 1471-2288