Detecting cardiovascular diseases using unsupervised machine learning clustering based on electronic medical records

Hu, Ying; Yan, Hai; Liu, Ming; Gao, Jing; Xie, Lianhong; Zhang, Chunyu; Wei, Lili; Ding, Yinging; Jiang, Hong

doi:10.1186/s12874-024-02422-z

Research
Open access
Published: 19 December 2024

Detecting cardiovascular diseases using unsupervised machine learning clustering based on electronic medical records

Ying Hu ORCID: orcid.org/0009-0008-4587-3311^1,2^na1,
Hai Yan³^na1,
Ming Liu^2,6,
Jing Gao⁴,
Lianhong Xie⁴,
Chunyu Zhang¹,
Lili Wei⁴,
Yinging Ding⁵ &
…
Hong Jiang ORCID: orcid.org/0000-0002-7260-9646^1,2

BMC Medical Research Methodology volume 24, Article number: 309 (2024) Cite this article

970 Accesses
1 Altmetric
Metrics details

Abstract

Background

Electronic medical records (EMR)-trained machine learning models have the potential in CVD risk prediction by integrating a range of medical data from patients, facilitate timely diagnosis and classification of CVDs. We tested the hypothesis that unsupervised ML approach utilizing EMR could be used to develop a new model for detecting prevalent CVD in clinical settings.

Methods

We included 155,894 patients (aged ≥ 18 years) discharged between January 2014 and July 2022, from Xuhui Hospital, Shanghai, China, including 64,916 CVD cases and 90,979 non-CVD cases. K-means clustering was used to generate the clustering models with k = 2, 4, and 8 as predetermined number of clusters k = 2, 4, and 8. Bayesian theorem was used to estimate the models’ predictive accuracy.

Results

The overall predictive accuracy of the 2-, 4-, and 8-classification clustering models in the training set was 0.856, 0.8634, and 0.8506, respectively. Similarly, the predictive accuracy of the 2-, 4-, and 8-classification clustering models in the testing set was 0.8598, 0.8659, and 0.8525, respectively. After reducing from 19 dimensions to 2 dimensions by principal component analysis, significant separation was observed for CVD cases and non-CVD cases in both training and testing sets.

Conclusion

Our findings indicate that the utilization of EMR data can support the development of a robust model for CVD detection through an unsupervised ML approach. Further investigation using longitudinal design is needed to refine the model for its applications in clinical settings.

Peer Review reports

Introduction

Cardiovascular diseases (CVD) are the leading cause of death globally, accounting for approximately18 million deaths annually [1], and this number is expected to rise to 23.6 million by 2030. In China, two out of every five deaths are attributed to CVD, affecting an estimated 330 million people [2]. Traditional statistics-based prediction tools for future CVD [3], such as the Framingham Risk Score [4], Systematic Coronary Risk Evaluation [5] and QRISK scores [6, 7], are commonly used in primary prevention settings. However, these methods use a common set of risk factors and the overall accuracy remains unsatisfactory and limited application for early detection [3, 8]. Clinicians diagnose CVD by evaluating the clinical symptoms and signs of patients and using auxiliary diagnostic methods, such as blood tests and imaging (non-invasive and invasive) examinations. These procedures are expensive, time-consuming and often requires specialized expertise. Asymptomatic individuals may be overlooked during routine physical examinations or hospitalization for other unrelated diseases. An automated CVD detection tool that help identify high-risk individuals quickly and accurately is needed.

Machine learning (ML), a technique used to realize artificial intelligence, broadens the scope of traditional statistics by identifying nonlinear relationships and higher-order interactions among numerous variables. It can be categorized into supervised and unsupervised learning [8]. Supervised ML build models by associating a certain set of features with known outcomes (labeled data) to predict outcomes for new data, including naive Bayes, random forest, Logistic regression, support vector machines (SVM), K-Nearest Neighbor (KNN), artificial neural network [9] and genetic algorithm [10]. Unsupervised ML, on the other hand, focuses on identifying the underlying patterns in unlabeled data, including clustering, association and dimensionality reduction. Clustering analysis is a process that involves the identification of distinct subgroups within extensive and intricate data. K-means clustering is unsupervised approach to group objects into K number of clusters number of clusters based on their features. This technique ensures that each data point assigned to a specific cluster is in closer proximity to the centroid of the cluster compared to all other clusters [11]. Dimension reduction is a process of reducing high-dimensional data to a low-dimensional representation is achieved while preserving the inherent changes and structures in the original full-dimensional data. A recent study [12] employed unsupervised ML approach, specifically multiple kernel learning-based dimension reduction and K-means clustering, to combine echocardiographic data and clinical parameters to phenotype heart failure patients.

ML has been increasingly utilized to improve the accuracy and speed of CVD prediction and diagnosis [13]. Nevertheless, the majority of ML-based prediction models are built on community-based populations that share similar features [14,15,16,17], and the prevalence and severity of CVD may also affect the models’ accuracy, limiting their clinical application [8]. Importantly, electronic medical records (EMR) as a digital version of paper records were initially introduced in hospitals to improve healthcare efficiency and promote patient care. EMR contain a wide variety of data, such as demographics, diagnoses, medications, laboratory and imaging tests. With the growing availability of rich and large sample size data recorded in EMR, there is growing interest to translate these data into clinical practices through the application of ongoing machine learning and AI advancements [18]. EMR-trained ML models have the potential in CVD risk prediction by integrating a range of medical data from patients, facilitate timely diagnosis and classification of CVDs [19]. Nevertheless, there have been limited study conducted on the EMR data for constructing CVD prediction models [20, 21].

Thus, using EMR data, we employed K-means clustering and Bayesian theorem to construct a model that can accurately identify the patients with high probability of having CVD in clinical settings. K-means clustering was utilized to generate the clustering models, and Bayesian theorem was utilized to estimate their predictive accuracy. Our work provides an example demonstrating the application of EMR-based ML to develop a prediction model for assessing the likelihood of having the CVD.

Methods

Data source

The study obtained data from the electronic medical record (EMR) system and clinical laboratory information system (LIS) of Xuhui Central Hospital, an affiliate of Fudan University in China. The data consisted of diagnostic information and laboratory test results for adult patients who were discharged from January 2014 to July 2022. This study was performed in accordance with the guidelines of the Declaration of Helsinki. The study design was approved by the Ethics Committee of Shanghai Xuhui Central Hospital (approval no: 2023033), and the institutional review board waived the requirement to obtain the informed consent. The medical record number, gender, age and ICD-10 diagnostic information were extracted from the EMR system using SQL statements. A total of 155 894 patients were included.

The primary outcome of this study was determining the presence of CVD in each subject. CVD was defined based on the primary symptoms outlined in the International Classification of Diseases, 10th Revision (ICD-10) diagnostic information). These symptoms including “coronary heart disease arrhythmia”, “coronary artery insufficiency”, “coronary heart disease”, “coronary artery slow flow”, “coronary artery bypass surgery status”, “coronary artery stent thrombosis”, “coronary artery stent implantation status”, “coronary artery stenosis”, “coronary artery fistula”, “coronary atherosclerosis”, “coronary atherosclerotic heart disease” [22]. Patients exhibiting the aforementioned symptoms were categorized as cases of CVD (n = 64916) (Table 1), while the other patients who did not display these symptoms were classified as non-CVD cases (n = 90979).

Table 1 Data overview of main CVD diseases

Full size table

We searched the LIS system for various laboratory test results upon admission, including total cholesterol (TC), triglyceride (TG), high-density lipoprotein (HDL), low-density lipoprotein (LDL), blood glucose, creatine kinase (CK), CK-MB isoenzyme (CK-MB), troponin (Tn), myoglobin (Mb), angiotensin (I/II), aldosterone, hemorheology, brain natriuretic peptide (BNP), glycosylated hemoglobin (GHB), homocysteine (HCY), tumor necrosis factor (TNF), interleukin, C-reactive protein (CRP), D-dimer, fibrinogen, creatinine, urea nitrogen, uric acid, glomerular filtration rate (GFR), plasma viscosity, erythrocyte aggregation index, hemoglobin, blood sodium, blood potassium, and other relevant test results.

Data preprocessing and variable selection

After data cleaning, the incomplete, incorrect, inaccurate, and irrelevant parts of 155 894 patients’ data were identified and were replaced, modified, or deleted. Due to the inherent characteristics of the mining process, the vast majority of data attributes utilized within this method were of a quantitative type, specifically integer or real number data. The analysis eliminated gender as a variable due to its binary nature. The process of selecting predictor variables (features) was conducted by three medical experts with experience in the diagnosis of CVD selected the predictor variables (features) based on comprehensive review of relevant literature. Also, features with missing data in ≥ 20% of patients were removed, and features with missing data for < 20% of the patients were subjected to multiple imputation. The features of these deletions included angiotensin, aldosterone, brain natriuretic peptide, homocysteine, free triiodothyronine, free tetraiodothyronine, and thyroid stimulating hormone.

The preliminary list focused on 15 variables that are clearly implicated in the pathogenesis of CVD [23], including blood lipids (TC, TG, HDL, LDL), cardiac markers (CK, CK-MB, Mb, Tn), renal function (creatinine, urea nitrogen, uric acid, GFR) and blood glucose markers (glucose, GHB). Four additional variables that have previously been associated with CVD but lack robust clinical evidence, were included in this study. These variables included coagulation markers such as D-dimer and fibrinogen as well as other biomarkers including hemoglobin, blood sodium, blood potassium). Finally, 19 features were selected as input for the ML algorithm. Table 2 shows the description of selected variables. Z-score normalization was used to standardize the numerical variables.

Table 2 Dataset features description

Full size table

Statistical machine learning analysis

The entire dataset was randomly split into two non-overlapping sets: training set (90%, n = 140304) and testing set (10%, n = 15590). We ran our unsupervised ML algorithm on the training set first to generate the prediction model (i.e., create clusters), and then tested the models using the features of the testing set to assess their ability to accurately infer the class labels for the patients in the testing set. The estimation of the predictive accuracy of the clusters and models was afterwards conducted utilizing the Bayesian theorem. The dimensionality reduction approach of principal component analysis (PCA) was additionally employed to reduce the number of features from 19 to 2 dimensions in both the training and testing sets. This allowed for the visualization of the sample results projected onto the first two components [24]. The principal components are the continuous solutions derived from the discrete cluster membership markers for K-means clustering, PCA can serve as a tool to evaluate the 2-classification clustering model from a different angle [25]. The modeling process is depicted in Fig. 1.

K-means clustering and bayesian theorem

K-means clustering was used to classify the data-set into a fixed number (K) of distinct clusters. We selected k = 2, 4, and 8 as predetermined number of clusters and iterated 1 million times to guarantee the stability of the results. The input of the model was a normalized vector of 19 parameters, and the output was whether CVD was present. We used the characteristics of K-means clustering to classify the disease, and classify the patients with or without CVD into two types for clustering. Ideally, patients with CVD should be clustered in several of the three clustering models of 2-, 4-, and 8-classification, while patients without CVD should be clustered in other clusters. However, in reality, it is impossible to achieve the ideal state. Our data only covered the major symptoms of patients who were diagnosed at a given time in the hospital, and they may only represent their occasional situation. Furthermore, not all of 19 features are strongly related to CVD pathogenesis. In practical situations, the more uneven the distribution of CVD and non-CVD ratios in each cluster, the better it is for the cluster to determine whether CVD is present. The more such clusters there are in the entire clustering model, the better it is for the entire clustering model to determine whether CVD is present.

The model was constructed to the accurate classification of patients, enabling to ascertain their disease status (i.e., CVD or non-CVD) with 100% probability. Therefore, after calculating the proportion of CVD in each cluster of the clustering model, the prediction accuracy of a single cluster in the three clustering models was calculated by using the inverse probability principle of Bayesian theorem, and then the overall prediction accuracy of the three clustering models was calculated by using Bayesian theorem. The predictive accuracy of the clustering model was determined by dividing the sum of the size of the bigger group in each cluster by the total number of samples. The specific method to calculate the accuracy by using Bayesian theorem was as follows:

The predictive probability for each cluster by the Bayesian theorem was:

$$\:{P}_{n}=\frac{{X}_{n}^{max}}{{X}_{n}^{all}}$$

$\:{X}_{n}^{max}$refers to the size of the bigger group (CVD cases or non-CVD cases) in the cluster, and $\:{X}_{n}^{all}$refers to the total number patients in the cluster.

The overall prediction accuracy (model performance) of our model by Bayesian theorem was:

$$\:{P}_{all}=\sum\:_{i=1}^{n}\frac{{X}_{n}^{max}}{{X}_{n}^{all}}\times\:\frac{{X}_{n}^{all}}{{X}_{all}^{all}}=\frac{\sum\:_{1}^{n}{X}_{n}^{max}}{{X}_{all}^{all}}$$

$\:{X}_{all}^{all}$ refers to the number of all subjects in the sample.

This shows that the predictive accuracy of the clustering model is determined by dividing the sum of the size of the bigger group in each cluster by the total number of samples.

Model performance

The predictive probability of detecting the existence of CVD for a single cluster was calculated as the number of patients with prevalent CVD divided by the total number of patients. The predictive probability of detecting prevalent CVD in each cluster was obtained from k = 2, 4, and 8 classifications, respectively. After calculating the proportions of CVD and non-CVD cases in each cluster from k = 2, 4, and 8 classifications, the predictive accuracy of each cluster was calculated by Bayesian theorem. We calculated the predictive accuracy (performance) of the overall model, which is equivalent to the predictive accuracy of all single clusters as shown above.

Comparisons of K-means clustering with other ML algorithms

We conducted a comparative experiment with three traditional ML methods to evaluate the performance of our K-means clustering approach. The models included in this comparison were SVM, K-Nearest Neighbor (KNN), and Logistic regression. After establishing the models, we calculated area under the curve (AUC) of the models separately. Finally, we plotted the Receiver Operating Characteristic (ROC) curve. AUC served as the main indicator of model performance.

Results

Characteristics of study subjects

Of 155 894 patients included, we filtered out 41.64% who already experienced a CVD outcome (during or before baseline). The remaining patients (90 979) did not experience any CVD outcome. Coronary atherosclerotic heart disease was the most common CVD (61 951 patients), followed by atherosclerosis (1 226 patients) and arrhythmia type of coronary heart disease (892 patients). Table 1 shows the number of patients according to different CVD symptoms.

Predictive probability of each cluster in 2-, 4-, and 8-classification clustering models

K-means clustering was used to classify the patients in the training set, with 2, 4, and 8 chosen as the predetermined number of clusters. As shown in Fig. 2; Table 3. In the 2-classification clustering model, the predictive probability of detecting prevalent CVD in clusters 1 and 2 were 0.8473 and 0.1384, respectively. In the 4-classification clustering model, the predictive probability of detecting prevalent CVD in clusters 1, 2, 3 and 4 were 0.4418, 0.1288, 0.8899 and 0, respectively. In the 8-classification clustering model, the predictive probability of detecting prevalent CVD in clusters 1, 2, 3, 4, 5, 6, 7 and 8 were 0.0938, 0.6252, 0.8958, 0.4400, 0.3333, 0, 0.4271, and 0.2056, respectively. For each clustering model, the cluster with the highest probability was the one most likely to have prevalent CVD.

Table 3 Distribution of CVD and non-CVD cases in each cluster with different predetermined number of clusters in the training set

Full size table

The clustering models were further evaluated in the testing set. As shown in Fig. 3; Table 4, in the 2-classification clustering model, the predictive probability of detecting prevalent CVD in clusters 1 and 2 were 0.8518 and 0.1351, respectively. In the 4-classification clustering model, the predictive probability of detecting prevalent CVD in clusters 1, 2, 3 and 4 were 0.4480, 0.1261, 0.8906, and 0, respectively. In the 8-classification clustering model, the predictive probability of detecting prevalent CVD in clusters 1, 2, 3, 4, 5, 6, 7 and 8 were 0.0916, 0.6287, 0.8943, 1, 1, 0, 0.4065 and 0.2109, respectively.

Table 4 Distribution of CVD and non-CVD cases in each cluster with different predetermined number of clusters in the testing set

Full size table

It should be noted that in the 4- and 8-clustering models, two clusters accounting for the majority of the total samples provided the main information needed to determine whether or not CVD was present, whereas other clusters accounting for a relatively small proportion of the overall samples provided minimal information.

Model performance of 2-, 4-, and 8- classification clustering models

Bayesian theorem was used to assess the 2-, 4-, and 8-classification clustering models’ predictive accuracy as the model performance. The overall predictive accuracy of the 2-, 4-, and 8-classification clustering models in the training set was 0.856, 0.8634, and 0.8506, respectively, while the predictive accuracy of the 2-, 4-, and 8-classification clustering models in the testing set was 0.8598, 0.8659, and 0.8525, respectively (Table 5). Here, all values from the testing and evaluation sets were similar and above 0.85, showing that the models had good performance in detecting the CVD.

Table 5 Comparative model performance in the testing sets

Full size table

Clustering visualization

Because predictive accuracy was not dependent on the number of classifications as above showed, 2-classification clustering model is simplified and thus optimal. PCA was conducted to reduce 19 dimensions (features) down to two dimensions. PCA plots of the samples projected onto the first two principal components in the training and testing sets are shown in Figs. 4 and 5, respectively. Significant separation was observed for CVD cases and non-CVD cases in both training and testing sets.

Performance of other models

The evaluation of models of KNN, SVM and Logistic regression was based on the testing set, and the results are presented in Table 6. The predictive accuracy for each model was as follows: K-means clustering achieved the highest accuracy of 0.8598, followed by KNN with a predictive accuracy of 0.846, SVM with a predictive accuracy of 0.819, and Logistic regression with a predictive accuracy of 0.7992 (Fig. 6).

Table 6 Model performance with different predetermined number of clusters in the training and testing sets, respectively

Full size table

Discussion

In this study, the data retrieved from the EMR was employed to construct a CVD detection model using unsupervised ML algorithm and subsequently assessed its predictive accuracy using the Bayesian theorem. Our study confirms the efficacy of unsupervised ML as a new approach for identifying individuals at high-risk of having CVD by utilizing routine blood tests conducted during physical examinations or hospitalization for other medical conditions. This can assist healthcare providers in assessing the necessity for additional health examinations or appropriate treatment, thereby facilitating early detection of CVD and reducing unnecessary medical expenses.

Unsupervised clustering algorithms, which need no labeling the input data, have proven to be useful in disease detection, diagnosis and classification [26]. In a recent work, hierarchical clustering analysis was used to evaluate numerous clinical variables and discovered new clinical phenotypes of atrial fibrillation [27]. The other study utilized K-means clustering to detect the varied etiology and prognosis of heart failure with preserved ejection fraction [28]. Our investigation showed that by extracting information from underutilized EMR data, the K-means clustering models surpassed the performance of SVM, KNN and Logistic regression models, with a predictive accuracy of over 85% in both the training and testing sets. Our findings suggest that unsupervised ML approach may yield novel tools in the detection of CVD with high accuracy. Furthermore, since the patient’s data may be obtained from the EMR without the necessity of gathering additional health information in the context of limited medical expenditures, the adoption of this strategy is simple and efficient.

Various CVD guidelines recommend different CVD risk prediction tools. The most commonly used tool is Framingham risk score, which incorporate age, sex, diabetes, smoking, systemic blood pressure, and body mass index [29]. The QRISK2 scores, which is another frequently used prediction tool, incorporate many factors such as age, gender, race, blood pressure, diabetes, family history of coronary heart disease, chronic renal disease, blood lipids, rheumatoid arthritis, medication use, weight, smoking, etc [30]. However, ML-based prediction models often incorporate a diverse array of variables. An ML-based model for CVD prediction was developed using a dataset from the UK BioBank, which consisted of 423,604 CVD-free patients. The model was built using 473 variables [31]. However, due to the lack of a solid pathological basis and the inability of professionals to recognize it, this condition is rarely used in clinical settings. The 19 variables in our selection from EMR data was chosen based on their clinical significance. Specifically, TC, TG, HDL, and LDL are key components of blood lipid profiles. Glucose and GHB are linked to diabetes, whereas creatinine, urea nitrogen, urea nitrogen, and GFR are associated with chronic kidney disease. Mb, Tn, and CK-MB are important in diagnosing coronary heart disease since their levels are typically elevated in those with acute coronary syndrome. The current guidelines incorporate these variables, but do not include D-dimer, fibrinogen, hemoglobin, blood sodium, and blood potassium [32, 33]. It has been noted that the coagulation indicators D-dimer and fibrinogen exhibit an elevation during thromboembolism. During CVD events, the blood’s coagulation status is shown to be hypercoagulable as a result of activation of coagulation mechanisms [34]. Hemoglobin as a indicator for blood viscosity, and an increase in blood viscosity has been linked to CVD events [35]. Elevated sodium levels have an direct influence on the progression of hypertension, which is considered a notable risk factor for ischemic heart disease, stroke, and others [36]. According to previous reports, serum potassium levels was associated with CVD events and mortality [37]. Collectively, we believe that these variables may possess some pathological foundations that contribute to the development of CVD. Therefore, our model may serve as a useful model in assessing the likelihood of having the CVD.

Several limitations should be acknowledged. First, this was a cross-sectional analysis of input features and prevalent CVD status recorded in EMR, the temporal order of causality could not be determined. Second, this was a single institution, our models should be externally validated. In addition, we focused on variables that are often recorded in EMR, other major CVD risk factors such as BMI and family history of CVD were not incorporated in analysis as they are not consistently recorded in EHR, however, the prediction accuracy as estimated by Bayesian theory was deemed satisfactory, and thus findings should not be severely affected.

In conclusion, this study demonstrates the application of a ML approach that integrates K-means clustering and Bayesian theorem with EMR data to develop an automated model for evaluating the likelihood of having the CVD. Additional longitudinal investigations including more characteristics (e.g., comorbidities, medication use, and CVD events) across several institutions are needed to improve the model’s accuracy and facilitate its potential implementation applications in clinical context.

Data availability

The datasets used and analyzed during the current study available from the corresponding author on reasonable request.

References

RuaN Y, Guo Y, Zheng Y, et al. Cardiovascular disease (CVD) and associated risk factors among older adults in six low-and middle-income countries: results from SAGE Wave 1. BMC Public Health. 2018;18(1):778. https://doiorg.publicaciones.saludcastillayleon.es/10.1186/s12889-018-5653-9.
Article PubMed PubMed Central Google Scholar
Summary of China Cardiovascular Health and Diseases Report 2020. Chin Circulation J. 2021;36(06):521–45. https://doiorg.publicaciones.saludcastillayleon.es/10.3969/j.issn.1000-3614.2021.06.001.
Article Google Scholar
Dimopoulos A C, Nikolaidou M, Caballero F F, et al. Machine learning methodologies versus cardiovascular risk scores in predicting disease risk. BMC Med Res Methodol. 2018;18(1):179. https://doiorg.publicaciones.saludcastillayleon.es/10.1186/s12874-018-0644-1.
Article PubMed Google Scholar
Greenland P, Alpert J S, Beller G A, et al. 2010 ACCF/AHA guideline for assessment of cardiovascular risk in asymptomatic adults: a report of the American College of Cardiology Foundation/American Heart Association Task Force on Practice guidelines. Circulation. 2010;122(25):e584–636. https://doiorg.publicaciones.saludcastillayleon.es/10.1016/j.jacc.2010.09.001.
Article PubMed Google Scholar
Piepoli M F, Hoes A W, Agewalls S, et al. 2016 European guidelines on cardiovascular disease prevention in clinical practice: the Sixth Joint Task Force of the European Society of Cardiology and Other Societies on Cardiovascular Disease Prevention in Clinical Practice (constituted by representatives of 10 societies and by invited experts)developed with the special contribution of the European Association for Cardiovascular Prevention & Rehabilitation (EACPR). Eur Heart J. 2016;37(29):2315–81. https://doiorg.publicaciones.saludcastillayleon.es/10.1093/eurheartj/ehw106.
Article PubMed Google Scholar
Hippisley-Cox J, Coupland C, Vinogradova Y, et al. Predicting cardiovascular risk in England and Wales: prospective derivation and validation of QRISK2. BMJ. 2008;336(7659):1475–82. https://doiorg.publicaciones.saludcastillayleon.es/10.1136/bmj.39609.449676.25.
Article PubMed PubMed Central Google Scholar
Hippisley-Cox J, Coupland C. Brindle P. Development and validation of QRISK3 risk prediction algorithms to estimate future risk of cardiovascular disease: prospective cohort study. Bmj, 2017, 357(j2099). https://doiorg.publicaciones.saludcastillayleon.es/10.1136/bmj.j2099.
Shu S, Ren J. Clinical application of machine learning-based Artificial Intelligence in the diagnosis, prediction, and classification of Cardiovascular diseases. Circ J. 2021;85(9):1416–25. https://doiorg.publicaciones.saludcastillayleon.es/10.1253/circj.CJ-20-1121.
Article CAS PubMed Google Scholar
Trayanova N A, Popescu D M, SHADE JK. Machine learning in Arrhythmia and Electrophysiology. Circ Res. 2021;128(4):544–66. https://doiorg.publicaciones.saludcastillayleon.es/10.1161/CIRCRESAHA.120.317872.
Article CAS PubMed Google Scholar
Ordikhani M, Saniee Abadeh M, Prugger C, et al. An evolutionary machine learning algorithm for cardiovascular disease risk prediction. PLoS ONE. 2022;17(7):e0271723. https://doiorg.publicaciones.saludcastillayleon.es/10.1371/journal.pone.0271723.
Article CAS PubMed PubMed Central Google Scholar
Dalmaijer E S, Nord C L, Astle D E. BMC Bioinformatics. 2022;23(1):205. https://doiorg.publicaciones.saludcastillayleon.es/10.1186/s12859-022-04675-1. Statistical power for cluster analysis [J].
Cikes M, Sanchez-Martinez S, Claggett B, et al. Machine learning-based phenogrouping in heart failure to identify responders to cardiac resynchronization therapy. Eur J Heart Fail. 2019;21(1):74–85. https://doiorg.publicaciones.saludcastillayleon.es/10.1002/ejhf.1333.
Article PubMed Google Scholar
Gautam N, Mueller J, Alqaisi O, Gandhi T, Malkawi A, Tarun T, Alturkmani HJ, Zulqarnain MA, Pontone G, Al’Aref SJ. Machine Learning in Cardiovascular Risk Prediction and Precision Preventive approaches. Curr Atheroscler Rep. 2023;25(12):1069–81. https://doiorg.publicaciones.saludcastillayleon.es/10.1007/s11883-023-01174-3.
Article PubMed Google Scholar
Song H, Koh Y, Rhee T M, et al. Prediction of incident atherosclerotic cardiovascular disease with polygenic risk of metabolic disease: analysis of 3 prospective cohort studies in Korea. Atherosclerosis. 2022;348:16–24. https://doiorg.publicaciones.saludcastillayleon.es/10.1016/j.atherosclerosis.2022.03.021.
Article CAS PubMed Google Scholar
Klooster C C V, Bhatt D L, Steg P G, et al. Predicting 10-year risk of recurrent cardiovascular events and cardiovascular interventions in patients with established cardiovascular disease: results from UCC-SMART and REACH. Int J Cardiol. 2021;325:140–8. https://doiorg.publicaciones.saludcastillayleon.es/10.1016/j.ijcard.2020.09.053.
Article PubMed Google Scholar
Lu P, Guo S, Zhang H, et al. Research on improved depth Belief Network-based prediction of Cardiovascular diseases. J Healthc Eng. 2018. https://doiorg.publicaciones.saludcastillayleon.es/10.1155/2018/8954878. 2018(8954878.
Article PubMed PubMed Central Google Scholar
Li Y, Sperrin M, Ashcroft DM, Van Staa TP. Consistency of variety of machine learning and statistical models in predicting clinical risks of individual patients: longitudinal cohort study using cardiovascular disease as exemplar. BMJ. 2020;371:m3919. https://doiorg.publicaciones.saludcastillayleon.es/10.1136/bmj.m3919.
Article PubMed PubMed Central Google Scholar
Tang AS, Woldemariam SR, Miramontes S, et al. Harnessing EHR data for health research. Nat Med. 2024;30:1847–55. https://doiorg.publicaciones.saludcastillayleon.es/10.1038/s41591-024-03074-8.
Article CAS PubMed Google Scholar
Ward A, Sarraju A, Chung S, Li J, Harrington R, Heidenreich P, Palaniappan L, Scheinker D, Rodriguez F. Machine learning and atherosclerotic cardiovascular disease risk prediction in a multi-ethnic population. NPJ Digit Med. 2020;3:125. https://doiorg.publicaciones.saludcastillayleon.es/10.1038/s41746-020-00331-1.
Article PubMed PubMed Central Google Scholar
Qiu Y, Wang W, Wu C, et al. A risk factor attention-based model for cardiovascular disease prediction. BMC Bioinformatics. 2022;23(Suppl 8):425. https://doiorg.publicaciones.saludcastillayleon.es/10.1186/s12859-022-04963-w.
Article PubMed PubMed Central Google Scholar
Li Q, Campan A, Ren A, Eid WE. Automating and improving cardiovascular disease prediction using machine learning and EMR data features from a regional healthcare system. Int J Med Inf. 2022;163:104786.
Article Google Scholar
Meng H, Ruan J, Yan Z, et al. New Progress in early diagnosis of atherosclerosis. Int J Mol Sci. 2022;23(16):8939. https://doiorg.publicaciones.saludcastillayleon.es/10.3390/ijms23168939.
Article CAS PubMed PubMed Central Google Scholar
Francula-Zaninovic S, Nola I A. Management of Measurable Variable Cardiovascular Disease’ risk factors. Curr Cardiol Rev. 2018;14(3):153–63. https://doiorg.publicaciones.saludcastillayleon.es/10.2174/1573403X14666180222102312.
Article PubMed Google Scholar
Ringnér M. What is principal component analysis? Nat Biotechnol. 2008;26:303–4. https://doiorg.publicaciones.saludcastillayleon.es/10.1038/nbt0308-303.
Article CAS PubMed Google Scholar
Ding C, He X. K-means Clustering via Principal Component Analysis. Proceedings of the 21 st International Conference on Machine Learning, Banff, Canada, 2004.
Frades I, Matthiesen R. Overview on techniques in cluster analysis. Methods Mol Biol. 2010;593:81–107. https://doiorg.publicaciones.saludcastillayleon.es/10.1007/978-1-60327-194-3_5.
Article CAS PubMed Google Scholar
Inohara T, Shrader P, Pieper K, et al. Association of of Atrial Fibrillation Clinical Phenotypes with treatment patterns and outcomes: a Multicenter Registry Study. JAMA Cardiol. 2018;3(1):54–63. https://doiorg.publicaciones.saludcastillayleon.es/10.1001/jamacardio.2017.4665.
Article PubMed Google Scholar
Harada D, Asanoi H, Noto T, et al. Different pathophysiology and outcomes of heart failure with preserved ejection Fraction Stratified by K-Means clustering. Front Cardiovasc Med. 2020;7(607760). https://doiorg.publicaciones.saludcastillayleon.es/10.3389/fcvm.2020.607760.
Petruzzo M, Reia A, Maniscalco G T, et al. The Framingham cardiovascular risk score and 5-year progression of multiple sclerosis. Eur J Neurol. 2021;28(3):893–900. https://doiorg.publicaciones.saludcastillayleon.es/10.1111/ene.14608.
Article PubMed Google Scholar
Brunström M, Andersson J, Eliasson M, et al. [SCORE2 - an updated model for cardiovascular risk prediction]. Lakartidningen. 2021;118:21164.
PubMed Google Scholar
Alaa A M, Bolton T, Di Angelantonio E, et al. Cardiovascular disease risk prediction using automated machine learning: a prospective study of 423,604 UK Biobank participants. PLoS ONE. 2019;14(5):e0213653. https://doiorg.publicaciones.saludcastillayleon.es/10.1371/journal.pone.0213653.
Article CAS PubMed Google Scholar
Virani Ss, Newby L K, Arnold S V, et al. 2023 AHA/ACC/ACCP/ASPC/NLA/PCNA Guideline for the management of patients with chronic coronary disease: a report of the American Heart Association/American College of Cardiology Joint Committee on Clinical Practice guidelines. Circulation. 2023. https://doiorg.publicaciones.saludcastillayleon.es/10.1161/CIR.0000000000001168.
Article PubMed Google Scholar
Knuuti J, Wijns W. 2019 ESC guidelines for the diagnosis and management of chronic coronary syndromes. Eur Heart J. 2020;41(3):407–77. https://doiorg.publicaciones.saludcastillayleon.es/10.1093/eurheartj/ehz425.
Article PubMed Google Scholar
LIndahl B. Acute coronary syndrome - the present and future role of biomarkers. Clin Chem Lab Med. 2013;51(9):1699–706. https://doiorg.publicaciones.saludcastillayleon.es/10.1515/cclm-2013-0074.
Article CAS PubMed Google Scholar
Canaud B, Rodriguez A. Whole-blood viscosity increases significantly in small arteries and capillaries in hemodiafiltration. Does acute hemorheological change trigger cardiovascular risk events in hemodialysis patient?. Hemodial Int. 2010;14(4):433–40. https://doiorg.publicaciones.saludcastillayleon.es/10.1111/j.1542-4758.2010.00496.x.
Article PubMed Google Scholar
Zhou B, Perel P, Mensah G A, et al. Global epidemiology, health burden and effective interventions for elevated blood pressure and hypertension. Nat Rev Cardiol. 2021;18(11):785–802. https://doiorg.publicaciones.saludcastillayleon.es/10.1038/s41569-021-00559-8.
Article PubMed PubMed Central Google Scholar
Liu S, Zhao D, Wang M, et al. Association of Serum Potassium Levels with Mortality and Cardiovascular events: findings from the Chinese multi-provincial cohort study. J Gen Intern Med. 2022;37(10):2446–53. https://doiorg.publicaciones.saludcastillayleon.es/10.1007/s11606-021-07111-x.
Article PubMed Google Scholar

Download references

Acknowledgements

Not applicable.

Funding

This work was supported by Shanghai Aging and Maternal and Child Health Research Project (No.2020YJZX0141); Clinical Special Project of Shanghai Municipal Health Commission, China(No.202040083). The funding bodies played no role in the design of the study and collection, analysis, and interpretation of data and in writing the manuscript.

Author information

Ying Hu and Hai Yan are co-first authors.

Authors and Affiliations

Department of Cardiology, National Clinical Research Center for Interventional Medicine, Shanghai Institute of Cardiovascular Diseases, Zhongshan Hospital, Fudan University, Shanghai, 200032, China
Ying Hu, Chunyu Zhang & Hong Jiang
Shanghai Engineering Research Center of AI Technology for Cardiopulmonary Diseases, Zhongshan Hospital, Fudan University, Shanghai, 200032, China
Ying Hu, Ming Liu & Hong Jiang
Department of General Surgery, Center for Bariatric and Hernia Surgery, Huashan Hospital, Fudan University, Shanghai, 200040, China
Hai Yan
Shanghai Xuhui Central Hospital, Zhongshan-Xuhui Hospital, Fudan University, Shanghai, 200031, China
Jing Gao, Lianhong Xie & Lili Wei
Department of Epidemiology, School of Public Health, and Key Laboratory of Public Health Safety of Ministry of Education, Fudan University, Shanghai, 200032, China
Yinging Ding
Department of Health Management Center, Zhongshan Hospital, Fudan University, Shanghai, 200032, China
Ming Liu

Authors

Ying Hu
View author publications
You can also search for this author inPubMed Google Scholar
Hai Yan
View author publications
You can also search for this author inPubMed Google Scholar
Ming Liu
View author publications
You can also search for this author inPubMed Google Scholar
Jing Gao
View author publications
You can also search for this author inPubMed Google Scholar
Lianhong Xie
View author publications
You can also search for this author inPubMed Google Scholar
Chunyu Zhang
View author publications
You can also search for this author inPubMed Google Scholar
Lili Wei
View author publications
You can also search for this author inPubMed Google Scholar
Yinging Ding
View author publications
You can also search for this author inPubMed Google Scholar
Hong Jiang
View author publications
You can also search for this author inPubMed Google Scholar

Contributions

Ying Hu and Hong Jiang contributed to the study conception and design. Material preparation, data collection and analysis were performed by Ming Liu, Jing Gao and Lianhong Xie. Data curation were managed by Lili Wei and Chunyu Zhang. The first draft of the manuscript was written by Ying Hu and Hai Yan. The review and editing were completed by Hong Jiang and Yinging Ding. All authors have read and agreed to the published version of the manuscript.

Corresponding authors

Correspondence to Yinging Ding or Hong Jiang.

Ethics declarations

Ethics approval and consent to participate

This study was performed in accordance with the guidelines of the Declaration of Helsinki. The study design was approved by the Ethics Committee of Shanghai Xuhui Central Hospital (approval no.: 2023033), and the institutional review board waived the requirement to obtain the informed consent.

Consent for publication

All authors have approved for its publication.

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

Hu, Y., Yan, H., Liu, M. et al. Detecting cardiovascular diseases using unsupervised machine learning clustering based on electronic medical records. BMC Med Res Methodol 24, 309 (2024). https://doiorg.publicaciones.saludcastillayleon.es/10.1186/s12874-024-02422-z

Download citation

Received: 18 October 2023
Accepted: 25 November 2024
Published: 19 December 2024
DOI: https://doiorg.publicaciones.saludcastillayleon.es/10.1186/s12874-024-02422-z

Detecting cardiovascular diseases using unsupervised machine learning clustering based on electronic medical records

Abstract

Background

Methods

Results

Conclusion

Introduction

Methods

Data source

Data preprocessing and variable selection

Statistical machine learning analysis

K-means clustering and bayesian theorem

Model performance

Comparisons of K-means clustering with other ML algorithms

Results

Characteristics of study subjects

Predictive probability of each cluster in 2-, 4-, and 8-classification clustering models

Model performance of 2-, 4-, and 8- classification clustering models

Clustering visualization

Performance of other models

Discussion

Data availability

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding authors

Ethics declarations

Ethics approval and consent to participate

Consent for publication

Competing interests

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

BMC Medical Research Methodology

Contact us