- Research
- Open access
- Published:
Assessing the representativeness of large medical data using population stability index
BMC Medical Research Methodology volume 25, Article number: 44 (2025)
Abstract
Background
Understanding sample representativeness is key to interpreting findings from epidemiological research and applying these findings to broader populations. Though techniques for assessing sample representativeness are available, they rely on access to raw data detailing the population of interest which are often not readily available and may not be suitable for comparing large datasets. In reality, population-based data are often only available in an aggregated format. In this study, we aimed to examine the capability of population stability index (PSI), a popular metric to assess data drift for artificial intelligence studies, in detecting sample differences using population-based data.
Method
We obtained United States cancer statistics from the National Cancer Institute's Surveillance, Epidemiology, and End Results (SEER) database. We queried the SEER 17-registry research database to obtain cancer count data by age, sex, and cancer site groups from the rate sessions of the SEER*State incidence database for 2000 and 2015 – 2020. We then calculated PSI scores to estimate yearly data distribution shift from 2015 to 2020 for each variable. We compared the PSI results to the Chi-Square and Cramér's V tests for the same comparisons.
Results
Scores for PSI comparing age, sex, and cancer site distribution between years ranged widely from 2.96 to less than 0.01. In line with our expectations, we found moderate to substantial differences in cancer population characteristics between 2000 and all other included years using PSI. Despite small effect sizes (Cramér's V 0.01 – 0.09), Chi-Square tests were significant for most comparisons, indicating likely type-I error caused by our large sample.
Conclusions
Population stability index can be used to examine sample differences in healthcare studies where only binned data are available or where large datasets may reduce the reliability of other metrics. Inclusion of PSI in epidemiological research will give greater confidence that results are representative of the general population.
Background
In descriptive epidemiological studies, the representativeness of study samples is a cornerstone of generalizability and the application of findings to wider populations [1, 2]. It is well-established in the literature that poor sample representativeness leads to biased associations and/or suboptimal policy decision-making [3, 4]. Although researchers suggested that sample representativeness is not essential for causal relationship studies, sample representativeness assessment is necessary to ensure the proper application of the inference [5].
As data volume and the availability of population-based, real-world data increase, classic approaches for sample representativeness evaluation suffer from the over-powering issue of catching subtle, clinically meaningless differences between samples [6]. Epidemiologists have developed representativeness assessment metrics that provide better estimation of differences in large samples [7,8,9,10]. An example is Representativity indicators (R-indicators) that was first developed to estimate the differences between responders and non-responders in survey studies [8] and later applied to other studies for sample representativeness assessment [6]. R-indicators estimate overall sample representativeness based on the standard deviation of sample propensities [7]. However, the use of raw data of R-indicators can be a challenge when comparing study samples to population-based data for sample representative assessment as most population-based data, such as the National Cancer Institute's Surveillance, Epidemiology, and End Results (SEER) data, are only available in a post-aggregated format without additional access permission.
Despite the use of different terminology, the importance of sample representativeness is also highlighted by artificial intelligence (AI) and machine learning (ML) studies leveraging high-dimensional, huge-volume data [11, 12]. AI and ML fields use the term “data drift” or “concept drift” to describe the existence of differences in variable distributions between the sample used to train a model and the sample fed to the model for prediction. Therefore, concept drift and lack of sample representativeness are conceptually the same. As model performance can significantly decrease when feature drift happens [12], most, if not all, ML solutions highlight the need to detect drift after model deployment and offer various ways for automatic concept drift detection [13].
Population index stability (PSI) is a sample distribution distance-based statistic for measuring sample similarity [14, 15]. PSI measures the distribution differences in each class of a variable between samples and provides an overall score of the variable by summing the scores of each class. As such, PSI accepts only categorical variables, and numeric variables need to be binned to enable the use of PSI. The possible score of PSI ranges from 0 to 1, with a larger value representing greater differences in the variable between samples. A general rule adopted in practice to interpret a PSI result is: PSI < 0.1 represents no difference, PSI > = 0.25 indicates a significant difference, and any score between the two represents a slight difference [14].
Population index stability is a widely used metric in AI and ML fields to determine whether a predictive model needs refinement due to data changes over time [14]. However, the discussion on using PSI in healthcare research is limited. It is unclear whether PSI can be an alternative to established representativeness metrics when raw reference sample data is unavailable. The purpose of this study was to examine the capability of PSI in detecting differences in population-based samples. Specifically, we applied PSI to assess distribution changes in age, gender, and cancer types of the U.S. cancer population over time using SEER data.
Methods
For this study, we extracted sex, age, and cancer type data of the U.S. cancer population from the Surveillance, Epidemiology, and End Results (SEER) [16]. We calculated PSI for each variable to compare the populations between all possible year-pairs across 2015 and 2020. We also extracted and compared data from the year 2000 to all other years to evaluate whether PSI could capture differences in data distributions that we hypothesized were likely to have occurred in a 15–20 year timeframe. We examined the PSI results by comparing them to the results from Chi-Square tests. This research was deemed exempt from ethical review because of the use of publicly-available anonymous data without human subject involvement.
Data
We obtained U.S. cancer population statistics from the SEER database for this study. The SEER database, supported and maintained by the National Cancer Institute, collects comprehensive, population-based U.S. cancer incidence and survival data alongside cancer patient demographics, tumor information, diagnosis, and treatment data since 1973. The database is updated yearly and used to support oncology research and inform policy decision-making throughout the USA [17]. We downloaded aggregated data using the SEER*Stat Software (version 8.4.3).
We queried the SEER 17-registry research database submitted in November 2022. We obtained cancer count data by age, sex, and cancer site groups from the rate sessions of the SEER*State incidence database. For cancer site groups, we extracted incidence rate and patient count data for Lung, Breast, Colorectal, Genitourinary, and Melanoma. We obtained data for 2000 and 2015—2020.
Population Stability Index (PSI)
We calculated PSI for sex, age group, and cancer type using the equation \(PSI={\sum }_{i=1}^{k}\left({O}_{i}-{E}_{i}\right)\times \text{ln}\left(\frac{{O}_{i}}{{E}_{i}}\right)\), where k represents the total number of categories for the variable of interest, O is the percentage of patients in a category in the scoring sample, and E is the percentage of patients in a category in the reference sample [15]. We used the equation to calculate the PSI scores to estimate yearly data distribution shifting from 2015 to 2020 for each variable. For instance, we used age-group data from 2015 as a referencing sample and from 2016 as a scoring sample to calculate the PSI for the age-group data distribution change estimate. As no definitive gold standard reference for sample representativeness exists, we used PSI to compare the distribution differences in selected variables between years with the early year in each comparison as a reference sample to depict relative sample similarity for each year-pair.
We manually computed PSIs to ensure the compatibility of the calculation with aggregated data we obtained from the SEER database. We used widely used cut-off points of 0.1 and 0.25 in informatics literature to interpret PSI results, with PSI < 0.1 representing no distribution differences between samples, PSI > = 0.1 and < 0.25 meaning moderate differences, and PSI > = 0.25 indicating significant differences [18, 19].
We tested PSI in a scenario where the differences in age, sex, and cancer group composition of the U.S. cancer population between years are expected and well-studied to ensure that there were sample differences for the PSI to detect. However, we expected that data distribution changes in consecutive years would be less notable and would not raise concerns about the presentation of ignorable sample differences. Therefore, we expected that PSI would not detect action-required data changes in any consecutive year comparison, but Chi-square tests may still flag the differences due to the large power. To demonstrate the capability of PSI in detecting data changes, we compared data from each year to data from 2000 for each variable under the assumption that the composition of the U.S. cancer population is significantly different between 2000 and recent years.
Analyses
We conducted all analyses using the R statistical software package version 4.2.1 [20]. To demonstrate the advantage of PSI in detecting distribution differences between large samples, we compared the PSI results to the results of Chi-square tests. We calculated the Chi-Square test scores for the comparisons we used to compute PSI scores. We adjusted the p-value using the Bonferroni approach for the Chi-square test due to multiple comparisons. For the comparisons showing significance in the Chi-Square test, we used Cramér's V to estimate the size of differences between the samples [21]. In this study, we consider a Cramér's V score < = 0.2 for a small effect size, a score > 0.2 and < = 0.6 representing a moderate effect size, and a score > 0.6 for a large effect size [22].
Results
We present a yearly summary of U.S. cancer population counts and percentages by age, sex, and cancer site group in Table 1.
PSI scores comparing age, sex, and cancer site distribution between years ranged widely from 2.96 to less than 0.01 (Fig. 1). PSI scores indicate moderate to significant differences in cancer population characteristics between 2000 and all other included years. PSI scores are less likely to reach the moderate or significant difference thresholds when the referencing and scoring years are closer.
The largest PSI was 2.96 for the age group comparison between 2016 and 2000. Further investigation of the composition of the PSI score reveals that there were notably more cancer individuals in the age groups of 60–64, 65–69, and 75–79 years in 2016 (Table 2). We included the PSI calculation processes for all comparisons in the Online Appendix.
We included the Chi-Square test and effect size results in Fig. 2. The Chi-Square tests showed significance for most comparisons. On the other hand, Cramér's V scores for the comparisons with significant Chi-Square scores revealed that the effect sizes were all small, ranging from less than 0.01 to 0.09.
Discussion
Quantitatively assessing differences in research samples provides a means to accurately describe sample representativeness for observational studies and allow proper evaluation and informed use of scientific evidence. Many retrospective cohort studies in healthcare leverage electronic health record (EHR) data and discover knowledge using massive data with much larger sample sizes than before. However, traditional tools, such as Pearson's Chi-Square test and Student-T test, for the examination of sample differences have too much power to discard subtle differences that may not be clinically meaningful when the sample size increases to over a thousand people [6, 10]. In this study, we examined the capacity of the population stability index (PSI) to detect sample differences and compared the PSI results to the Chi-Square test results. Our results suggest that PSI can detect differences in the distribution of given variables between two large samples and estimate the differences unaffected by the overpowering issue.
Our PSI results suggested that the U.S. cancer population after 2015 significantly differs from the population in 2000 in terms of sex, age, and cancer groups, but the differences between any two consecutive years are ignorable, aligning with previous epidemiology surveillance reports [23]. On the other hand, the traditional approaches showed significance in most comparisons after the application of an abnormally aggressive p-value adjustment with tiny V scores (< 0.01), indicating ignorable differences. This may be problematic for comparisons between 2000 and recent years, such as 2015, as evidence has shown that the populations between the years are different [24]. Our findings suggest that PSI provides a better estimate of sample differences when the sample size is large with an inevitable overpowering issue.
The PSI scores are the summation of the score for each category of the variate used to example sample differences. The breakdown scores provide additional information to enable the identification of categories that contribute to the sample differences. The example we provided in the results section demonstrated that use of PSI to examine age group differences in U.S. cancer populations in 2000 and 2016 enables the findings that the U.S. cancer population is notably older than before. Although the age group comparison between 2000 and 2016 may not provide meaningful information, the analysis can be utilized in other scenarios to guide further investigation or analysis approach adjustment. Example scenarios include assessing cancer-type differences between immunotherapy patients with and without the development of adverse events using population-based data, comparing sample differences between control and intervention groups for large, multi-institutional clinical trials, and evaluating whether a machine learning model is applicable to a population.
Researchers have argued that statistical approaches for hypothesis testing using p-value should not be used to assess sample differences between large datasets without adjustment [25]. The rationale behind using the Chi-square test was to emphasize its limitation and to highlight that PSI can be used for sample representativeness estimation in a big data context. We were not able to compare PSI and other sample representativeness metrics designed for large sample comparisons due to the lack of suitable large data access. However, this limitation highlighted a notable advantage of PSI that it can be computed using aggregated data. Most population databases, such as SEER, International Agency for Research on Cancer (IARC), and other disease-specific registry databases, are only available in an aggregated format without further permission. Therefore, few research teams can access and leverage raw population data to evaluate the representativeness of their samples using assessment metrics that require raw data, such as R indicators and standardized mean difference (SMD) approaches [7,8,9]. Our findings suggested that PSI can arguably be an alternative to those representativeness metrics developed by epidemiologists when raw population data are unavailable. When raw population data are available, PSI can complement the representativeness metrics, providing overall representativeness scores, as PSI enables information about differences in the category distribution of a variable in two samples.
Given the popularity of PSI in the AI and ML industry for monitoring feature drift in data, PSI has been widely implemented in many ML tool kits, such as Azure AI, Evidently AI, and Neptune AI. The current implementation of PSI in these ML packages that require raw continuous data can be improved by allowing the use of aggregated data to enable broader research teams to leverage publicly available population data for sample representativeness assessment. These AI tool kits also provide other sample similarity metrics, such as Kullback–Leibler (KL) divergence and Jensen-Shannon (JS) distance, which also require raw data. These metrics are similar to PSI estimating sample similarity based on differences in variable distributions [14, 26]. Further research is needed to compare PSI and other distance-based metrics to correlate the results of these metrics.
It is also essential to discuss the limitations of PSI to examine sample differences. First, PSI requires the variable of interest to be categorical and needs numerical variables to be binned before score calculation. Thus, information loss may happen when discretizing numeric variables [27], and bin size selection can determine PSI scores [28], similar to plotting a histogram of a numeric variable of two samples. Second, although there is wide use of PSI in industry and researchers have tried to define the statistical property of the PSI score [28, 29], little discussion on the metric and the score interpretation is in the literature body. Further, PSI cannot detect selection bias if the same selection bias exists in both samples.
The PSI score was designed for univariant comparisons, and thus, multivariant conditions were not considered. It was mostly used to detect data drifting in the AI/ML industry, with the primary goal of detecting notable changes in sample distribution for any variables [28]. It is possible to concatenate multiple variables into a single variable for each individual in the sample after binning them and calculating PSI scores for the concatenated variable. In this way, multiple variables were considered at once and may provide further information about the sample representativeness. However, this approach would require access to raw data and thus could not be conducted in this analysis. Future experiments are needed to explore the use of PSI for multivariate analyses.With the limitations of PSI, we suggest the PSI is a good alternative for population representativeness evaluation when raw data is not available for other approaches, such as R indicators, and when the sample size is large that traditional statistics inevitably capture a significant difference in samples with a clinically ignorable effect size.
This study has limitations. First, due to the use of SEER data, all data accessible to us were aggregated and categorical. Therefore, we could not apply PSI to a numeric variable and compare its results to the Student T-test or ANOVA results or compare PSI to R-indicators. Further, the dataset we used contains data from over 240,000 patients per year. This sample size may be greater than common big data studies with sample sizes ranging from 1,000 to 15,000. It is unclear whether the issues with inflated power of traditional approaches persists in such a sample size, thus requiring further investigation. There is also a need to examine the applicability of PSI in sample difference detection using data with a sample size similar to general healthcare research using big data.
Conclusions
Sample representativeness is a key determinant of the generalizability and applicability of study findings. In this study, we examined the use of PSI to capture differences in large samples and compared its results to the traditional statistics. Our findings suggest that PSI can be used to examine sample differences in healthcare studies leveraging big data. Further research is needed to compare the PSI to other sample representativeness metrics and correlate their results to enable comparable data. Further implementation allowing the use of aggregated data for PSI calculation will enable research teams to use the metric using aggregated population-based datasets.
Data availability
All data used in this study are publicly available from the Surveillance Epidemiology and End Results (SEER) Program. We published all our analysis data in this manuscript.
Abbreviations
- AI:
-
Artificial intelligence
- HER:
-
Electronic heath record
- IARC:
-
International Agency for Research on Cancer
- JS:
-
Jensen-Shannon distance
- KL:
-
Kullback–Leibler divergence
- ML:
-
Machine learning
- PSI:
-
Population Index Stability
- SEER:
-
Surveillance, Epidemiology, and End Results
- SMD:
-
Standardized mean difference
References
Rothman KJ, Gallacher JEJ, Hatch EE. Why representativeness should be avoided. Int J Epidemiol. 2013;42:1012–4.
Penberthy LT, Rivera DR, Lund JL, Bruno MA, Meyer A. An overview of real-world data sources for oncology and considerations for research. CA Cancer J Clin. 2022;72:287–300.
Ebrahim S, Smith GD. Commentary: Should we always deliberately be non-representative? Int J Epidemiol. 2013;42:1016–7.
Nathan H, Pawlik TM. Limitations of claims and registry data in surgical oncology research. Ann Surg Oncol. 2008;15:415–23.
Nohr EA, Olsen J. Commentary: Epidemiologists have debated representativeness for more than 40 years—has the time come to move on? Int J Epidemiol. 2013;42:1016–7.
Kuijper SC, Besseling J, Klausch T, Slingerland M, van der Zijden CJ, Kouwenhoven EA, et al. Assessing real-world representativeness of prospective registry cohorts in oncology: insights from patients with esophagogastric cancer. J Clin Epidemiol. 2023;164:65–75.
Schouten B, Bethlehem J, Beullens K, Kleven Ø, Loosveldt G, Luiten A, et al. Evaluating, Comparing, Monitoring, and Improving Representativeness of Survey Response Through R-Indicators and Partial R-Indicators. Int Stat Rev. 2012;80:382–99.
Schouten B, Bureau Voor De Statistiek C, Cobben F, Bethlehem J. Indicators for the Representativeness of Survey Response. Surv Methodol. 2009;35(1):101–13.
Derksen JWG, Vink GR, Elferink MAG, Roodhart JML, Verkooijen HM, van Grevenstein WMU, et al. The Prospective Dutch Colorectal Cancer (PLCRC) cohort: real-world data facilitating research and clinical care. Sci Rep. 2021;11.
Cleophas TJ. Clinical trials and p-values, beware of the extremes. Clin Chem Lab Med (CCLM). 2004;42(3):300–4.
Lu SC, Swisher CL, Chung C, Jaffray D, Sidey-Gibbons C. On the importance of interpretable machine learning predictions to inform clinical decision making in oncology. Front Oncol. 2023;13.
Muslim Jameel S, Ahmed Hashmani M, Alhussain H, Rehman M, Tunku Abdul Rahman Perak U, Arif Budiman M. A Critical Review on Adverse Effects of Concept Drift over Machine Learning Classification Models. 2020.
Walsh B. Productionizing AI. Apress; 2023.
Ashok S, Ezhumalai S, Patwa T. Remediating data drifts and re-establishing ML models. In: Procedia Computer Science. Elsevier B.V.; 2022. p. 799–809.
Khademi A, Hopka M, Upadhyay D. Model Monitoring and Robustness of In-Use Machine Learning Models: Quantifying Data Distribution Shifts Using Population Stability Index. 2023.
Surveillance Epidemiology and End Results (SEER) Program. SEER*Stat Database: Incidence - SEER Research Data, 17 Registries, Nov 2022 Sub (2000–2020) - Linked To County Attributes - Time Dependent (1990–2021) Income/Rurality, 1969–2021 Counties. National Cancer Institute, DCCPS, Surveillance Research Program. 2023.
Yu JB, Gross CP, Wilson LD, Smith Benjamin D. NCI SEER Public-Use Data: Applications and Limitations in Oncology Research. Oncology. 2009;23.
Becker A, Becker J. Dataset shift assessment measures in monitoring predictive models. In: Procedia Computer Science. Elsevier B.V.; 2021. p. 3391–402.
Karakoulas G. Empirical Validation of Retail Credit-Scoring Models. RMA J. 2004;87:56–60.
R Core Team. R: A Language and Environment for Statistical Computing. 2021.
Tomczak M, Tomczak E. The need to report effect size estimates revisited. An overview of some recommended measures of effect size. 2014.
Lee DK. Alternatives to P value: Confidence interval and effect size. Korean J Anesthesiol. 2016;69:555–62.
Surveillance Research Program NCI. SEER*Explorer: An interactive website for SEER cancer statistics. 2023. https://seer.cancer.gov/statistics-network/explorer/. Accessed 25 Mar 2024.
Kehm RD, Yang W, Tehranifar P, Terry MB. 40 Years of Change in Age- and Stage-Specific Cancer Incidence Rates in US Women and Men. JNCI Cancer Spectr. 2019;3.
Lin M, Lucas HC, Shmueli G. Too big to fail: Large samples and the p-value problem. Inf Syst Res. 2013;24:906–17.
Whitney HM, Baughan N, Myers KJ, Drukker K, Gichoya J, Bower B, et al. Longitudinal assessment of demographic representativeness in the Medical Imaging and Data Resource Center open data commons. J Med Imag. 2023;10.
Sang Y, Qi H, Li K, Jin Y, Yan D, Gao S. An effective discretization method for disposing high-dimensional data. Inf Sci (N Y). 2014;270:73–91.
Taplin R, Hunt C. The population accuracy index: A new measure of population stability for model monitoring. Risks. 2019;7.
Yurdakul B, Naranjo J. Statistical properties of the population stability index. The Journal of Risk Model Validation. 2020. https://doiorg.publicaciones.saludcastillayleon.es/10.21314/JRMV.2020.227.
Acknowledgements
Not applicable.
Clinical trial number
Not applicable.
Funding
The authors received no financial support for this study.
Author information
Authors and Affiliations
Contributions
S.C.L: Conceptualization, Methodology, Formal analysis, Software, Visualization, Writing – original draft. W.S.: Data curation, Writing – review & editing. A.P.: Formal analysis, Software, Writing – review and editing. C.G.: Conceptualization, Methodology, Supervision, Resources, Project administration, Writing – review and editing.
Corresponding author
Ethics declarations
Ethics approval and consent to participate
This research was deemed exempt from ethical review because of the use of publicly-available anonymous data without human subject involvement.
Consent for publication
Not applicable.
Competing interests
The authors declare no competing interests.
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary Information
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Lu, SC., Song, W., Pfob, A. et al. Assessing the representativeness of large medical data using population stability index. BMC Med Res Methodol 25, 44 (2025). https://doiorg.publicaciones.saludcastillayleon.es/10.1186/s12874-025-02474-9
Received:
Accepted:
Published:
DOI: https://doiorg.publicaciones.saludcastillayleon.es/10.1186/s12874-025-02474-9