Skip to main content

Multiple states clustering analysis (MSCA), an unsupervised approach to multiple time-to-event electronic health records applied to multimorbidity associated with myocardial infarction

Abstract

Multimorbidity is characterized by the accrual of two or more long-term conditions (LTCs) in an individual. This state of health is increasingly prevalent and poses public health challenges. Adapting approaches to effectively analyse electronic health records is needed to better understand multimorbidity. We propose a novel unsupervised clustering approach to multiple time-to-event health records denoted as multiple state clustering analysis (MSCA). In MSCA, patients’ pairwise dissimilarities are computed using patients’ state matrices which are composed of multiple censored time-to-event indicators reflecting patients’ health history. The use of state matrices enables the analysis of an arbitrary number of LTCs without reducing patients’ health trajectories to a particular sequence of events. MSCA was applied to analyse multimorbidity associated with myocardial infarction using electronic health records of 26 LTCs, including conventional cardiovascular risk factors (CVRFs) such as diabetes and hypertension, collected from south London general practices between 2005 and 2021 in 5087 patients using the MSCA R library. We identified a typology of 11 clusters, characterised by age at onset of myocardial infarction, sequences of conventional CVRFs and non-conventional risk factors including physical and mental health conditions. Interestingly, multivariate analysis revealed that clusters were also associated with various combinations of socio-demographic characteristics including gender and ethnicity. By identifying meaningful sequences of LTCs associated with myocardial infarction and distinct socio-demographic characteristics, MSCA proves to be an effective approach to the analysis of electronic health records, with the potential to enhance our understanding of multimorbidity for improved prevention and management.

Peer Review reports

Background

The notion that life-course health trajectories are influenced by early life events and the ever-changing historical context [1], or by later lifestyle associated exposure [2] have dominated the epidemiological field so far.

Accordingly, life-course epidemiology has developed the environmental model-based approach to chronic diseases, focusing on lifestyles and previous exposures that may explain future health conditions. Conceptual life-course models include, for instance, the critical period model [3], the accumulation risk model [4] or the chain risk model [5]. These models emphasize various aspects of life experiences such as the impact of early-life exposures on later health outcomes, the cumulative effects of multiple risk factors and exposure over time, and the interdependence of health events respectively. Conversely, we can assume that the main combinations of early life events and specific socio-demographic contexts result in distinct patterns of life-course health trajectories characterised by specific patterns of long-term conditions (LTCs).

Progress in information technologies and the availability to academic research of publicly funded health databases including routinely collected general practice records represent an opportunity to better evaluate this hypothesis [6]. Electronic health records are digital versions of patients’ medical history, containing healthcare-related data such as diagnoses and treatments. The widespread adoption of electronic health records in healthcare systems had led to the accumulation of vast amounts of longitudinal data, which can be harnessed to analyse health outcomes throughout an individual’s lifetime.

However, analyzing health records can be challenging. In this setting, the use of unsupervised exploratory analysis may be required to facilitate description of the main patterns of multimorbidity, guiding further research towards relevant underlying conceptual models.

Importantly, the analysis of electronic health records enables data interpretation within patients’ socio-demographic contexts, thereby supporting the development of public health preventive strategies targeting potential modifiable risk factors in identified subpopulations.

We propose in this paper a novel unsupervised clustering approach to multiple time-to-event records, denoted as multiple state analysis. In the setting of observational studies for instance, this method allows to handle, for each analysed patient a potentially large numbers of LTCs in order to obtain clusters of health trajectories characterised by major sequences of analysed LTCs.

We applied multiple state analysis to electronic health records of LTCs associated with myocardial infarction, including the conventional cardiovascular risk factors (CVRFs) such as diabetes and hypertension, collected from south London general practices between April 2005 and April 2021. In addition to well-documented trajectories, such as the association of hypertension and diabetes with cardiovascular diseases [7], this research aims at identifying less common patterns of multimorbidity characterized by non-conventional sequences of LTCs.

Myocardial infarction, leading to “heart attack”, is one of the leading causes of death in high-income countries [8]. Myocardial infarction is caused by decreased or complete cessation of blood flow in the myocardium and results in irreversible damage to the heart muscle [9]. Most of the time, myocardial infarction is due to underlying coronary artery disease [10]. Conventional modifiable risk factors associated with coronary artery disease and myocardial infarction include smoking, abnormal blood lipid profile, hypertension, diabetes, abdominal obesity, psycho-social factors, diet, physical inactivity and alcohol consumption (protective) [11, 12]. Some non-modifiable risk factors associated with myocardial infarction include advanced age, male gender (males tend to have myocardial infarction earlier in life) and genetics [12, 13].

This paper is organized as follows: in the Background section, multiple state analysis is presented. In the Methods section, an application of multiple state analysis to the analysis of electronic health records of 26 LTCs in patients with myocardial infarction is conducted. Finally in the Multiple state analysis of multimorbidity associated with myocardial infarction section, the proposed method is briefly discussed in light of the study results.

Methods

Multiple state analysis is an unsupervised clustering approach to multiple time-to-event records. It is designed to handle and analyse series of events associated with each instance of a population, such as a set of LTCs recorded in a cohort of patients. In the epidemiological setting, the objective is to create a typology of the main patterns of multimorbidity, allowing therefore a simplified description of analysed cohorts in terms of the main underlying patterns of multimorbidity. This objective is achieved by applying the principles of unsupervised clustering analyses [14]: given a relevant metrics, pairwise patients’ dissimilarities are computed before a clustering method is applied and a typology of the main health trajectories is defined. Further investigations includes evaluating the association between identified clusters and socio-demographic characteristics and other CVRFs.

Individual patient’s state matrix and group summaries

In order to compute patients’ pairwise dissimilarities, multiple state analysis requires patients’ health records to be formatted into multiple indicator tracks, stacked in \({t \times k}\) state matrices, t and k being the maximum age observed in the cohort and the total number of recorded LTCs, respectively.

If onset of disease l for patient j is \(t_{l,j}^o\), this patient’s state matrix is noted: \(\text {M}_{.l}^j = \text {I}_{[t \ge t_{l,j}^o]}\), where \(\text {M}_{.l}^j\) stands for column l of matrix \(\text {M}^j\), and I is the positive integer indicator function. For instance, in the example here after, examination of patient j’s state matrix (\(\text {M}^j\)) indicate that the analysis is conducted on LTCs a, b, c, d and e from 0 to 99 years, onset of diseases a, b, c, and e being 2, 46, 47 and 48 respectively. Patients may also be censored or experience a competing event such as death or may be lost of follow up.

\(\tau\) being patients’ censoring times vector, \(\text {c}^j = \text {I}_{[t \ge \tau _j]}\) represents patient’s j censoring indicator and \(\bar{\text {c}}^j = \text {I}_{[t < \tau _j]}\), the follow up period indicator for patient j. In the proposed example, patient j is censored at age 83.

(1)

Alternatively, in cases EHRs shows evidence that patient j is cured from long term condition l at time \(t_{l,j}^c < \tau _j\), their state matrix would be modified such that \(\text {M}_{.l}^j = \text {I}_{[ t_{l,j}^o \le t \le t_{l,j}^c]}\). This later expression of M is a more general formula of a patient state matrix. From the previous example, if EHRs indicate that patient j is cured from diseases b at 48 years, for instance, their state matrix \(M^{'j}\) would be:

(2)

Dissimilarity index matrix

The objective pursued when creating a typology of patients based on their profiles of LTCs is to categorise patients such that patients belonging to the same clusters share similar trajectories relatively to patients belonging to other clusters. This process implies therefore the use of a dissimilarity (or distance) metric in order to obtain pairwise patients’ profile dissimilarities. Based on the resulting distance matrix or dissimilarity index matrix, a clustering method can be applied to finally define a typology.

As multiple state analysis deals with state indicators, the relationship between two patients, regarding a given LTC can be summarised using a 2 \(\times\) 2 contingency table such as: [(pr), (sq)],

  1. (i)

    q being the number of matching time units during which both patients were affected by a given LTC,

  2. (ii)

    p being the number of matching time units during which both patients were free from the considered LTC,

  3. (iii)

    s and r being the number of matching time units during which both patients were in different states of health regarding the specified LTC and,

  4. (iv)

    t being the length of the sequence.

In this setting, the dissimilarity between profiles \(x^i\) and \(x^j\) can be written as:

$$\begin{aligned} \left\{ \begin{array}{l} d(x^i,x^j) = \frac{ r + s }{ q + r + s + p }. \\ q + r + s + p = t \end{array}\right. \end{aligned}$$

Considering states of illness as more informative than healthy states and omitting p, the number of negative matches in the denominator, leads to the Jaccard dissimilarity index [15]:

$$\begin{aligned} d(x^i,x^j) = \frac{ r + s }{ q + r + s } = 1 - \frac{ q }{ q + r + s }. \end{aligned}$$

Since \(r + s = t - p - q\), we can rewrite the above relation as follows:

$$\begin{aligned} d(x^i,x^j) = 1 - \frac{ q }{ t - p }. \end{aligned}$$
(3)

A composite analogue to the Jaccard dissimilarity index when considering indicators from multiple states can be derived as:

$$\begin{aligned} \left\{ \begin{array}{l} d(M^i,M^j) = 1 - \frac{ Q }{ t^* - P },\\ Q = \sum _{l=1}^k q_l,\\ P = \sum _{l=1}^k p_l,\\ t^* = kt. \end{array}\right. \end{aligned}$$
(4)

i.e. Q and P are the sum of q and p over all considered LTCs.

Using the matrix notations (1 or 2) from Individual patient’s state matrix and group summaries section and setting \(\bar{\text {C}^i}\) as the \(t \times t\) diagonal matrix with \(\bar{\text {c}^i}\) as diagonal entries, the censored quantities defined above can be conveniently computed as:

$$\begin{aligned} Q=&\;\text {tr} \left( \left( \text {M}^i{'} \bar{\text {C}^i} \bar{\text {C}^j} \right) \left( \text {M}^j{'} \bar{\text {C}^i} \bar{\text {C}^j} \right) ' \right) \\ P= &\;\text {tr} \left( \left( \left( \text {M}^i - \mathbbm {1} \right) ' \bar{\text {C}^i} \bar{\text {C}^j} \right) \left( \left( \text {M}^j - \mathbbm {1} \right) ' \bar{\text {C}^i} \bar{\text {C}^j} \right) ' \right) \\ t^*= &\;\text {tr}\left( \left( \mathbbm {1}' \bar{\text {C}^i} \bar{\text {C}^j} \right) \left( \mathbbm {1}' \bar{\text {C}^i} \bar{\text {C}^j} \right) '\right) , \end{aligned}$$

where tr denotes the trace operator and \(\mathbbm {1}\) is the \(t \times k\) matrix with all entries are set to 1.

As an illustration, let’s consider health records of two patients, patient i having a record of diabetes (dm) at 55 and a record of hypertension (hyp) at 53, and patient j having a record of diabetes at 52 and a record of hypertension at 59.

The state matrices of these patients are:

The composite Jaccard dissimilarity index proposed in (4) computed between patient i and j for diabetes and hypertension is computed as follow:

  • compute the total number of time units considered:

    $$\begin{aligned} t^* = 11 \times 2 = 22, \end{aligned}$$
  • compute the number of matching time units during which both patients were in the same state of health:

    $$\begin{aligned} Q = q_{dm} + q_{hyp} = 6 + 2 = 8, \end{aligned}$$
  • compute the number of matching time units during which both patients were free from the considered LTCs:

    $$\begin{aligned} P = p_{dm} + p_{hyp} = 2 + 3 = 5, \end{aligned}$$
  • get the Jaccard dissimilarity index using (3):

    $$\begin{aligned} d(m^i,m^j) = 1 - \frac{Q}{t^*-P} = 1 - \frac{8}{22-5} = \frac{9}{17}. \end{aligned}$$

Clustering method

Although multiple state analysis is not restricted to a specific clustering procedure, we have used in this paper the Ward’s hierarchical clustering method [16]. At the starting point of this procedure, each instance is considered as a cluster of its own, then clusters are recursively merged such that the resulting cluster structure presents the minimum cost in terms of the within-clusters dissimilarity, which often results in compact and well-defined clusters [17]. Although originally proposed in a Euclidean setting, the Ward’s method can be generalized to dissimilarity measures, such as the Jaccard dissimilarity, using the Lance-Williams formula and the coefficients associated with Ward’s method [18], provided the dissimilarity measure satisfies non-negativity and symmetry.

The Lance-Williams formula is given by the following expression:

$$\begin{aligned} d(C_i \cup C_j, C_k) = \alpha _i d(C_i, C_k) + \alpha _j d(C_j, C_k) + \beta d(C_i, C_j) + \gamma |d(C_i, C_k) - d(C_j, C_k)|, \end{aligned}$$

in which the dissimilarity between merged clusters i and j and cluster k (\(d(C_i \cup C_j, C_k)\)), is expressed as a linear combination of the dissimilarities between involved clusters without explicit reference to the type of metrics used to compute the dissimilarity matrix between instances. The Lance-Williams coefficients associated with the Ward’s method are expressed as follows:

$$\begin{aligned} \left\{ \begin{array}{l} \alpha _i = \frac{|C_i| + |C_k|}{|C_i| + |C_j| + |C_k|} \\ \alpha _j = \frac{|C_j| + |C_k|}{|C_i| + |C_j| + |C_k|} \\ \beta = -\frac{|C_k|}{|C_i| + |C_j| + |C_k|} \\ \gamma = 0 \end{array}\right. , \end{aligned}$$

where \(|C_i|\) represent the size of cluster i.

MSCA algorithm time complexity

A critical point inherently associated with MSCA is its computational cost. Instead of clustering a reduced set of long term conditions based on a set of patients, MSCA is a patient-oriented method that produces a typology of patients sharing similar health trajectories. On the one hand, although this strategy allows for capturing patients pairwise longitudinal dissimilarities through the computation of state matrices, the computation cost associated with Ward’s clustering method (theoretical time complexity: \(O(n^3)\)) and the patient-wise dissimilarity matrix (time complexity: \(O(n^2)\)) represents a significant consideration when applying MSCA.

Practically, the use of the fastclust C++-implemented routines [19] to perform the Ward’s clustering method, through its interfaces to R’s hclust or Python’s scipy.cluster.hierarchy.linkage, reduces the observed runtime to \(O(n^2)\) in many cases [20], making it scalable for larger datasets.

For n patients, given a symmetric dissimilarity measure, the pairwise dissimilarities between all patients can be stored in an upper triangular matrix of size n, corresponding to \(\frac{n(n-1)}{2}\) distinct elements (hence its \(O(n^2)\) time complexity). The time complexity associated with the computation of the Jaccard dissimilarity between two time-aligned sequences of length t scales linearly with t, i.e. O(t). Therefore, the time complexity associated with the computation of the dissimilarity matrix is \(O(n^2 \cdot l \cdot t)\), \(l\) being the number of long-term condition considered. This can be approximated as \(O(n^2)\) when \(l \cdot t\) is much smaller than the number of patients \(n\). In scenarios where \(n\) is large, it may therefore be relevant to optimize the time granularity to effectively reduce MSCA computational cost.

In the next section, we propose an application of MSCA to the analysis of multimorbidity associated with myocardial infarction.

Multiple state analysis of multimorbidity associated with myocardial infarction

To illustrate multiple state analysis we have conducted an analysis of multimorbidity associated with myocardial infarction using electronic health records of 26 long-term conditions including conventional cardiovascular risk factors such as diabetes and hypertension, collected from south London general practices between 2005 and 2021 in 5087 patients.

Patients and method

Primary care registry

We considered electronic health records of 27 common LTCs (including myocardial infarction) in adult patients aged over 18 and registered in 41 general practices in south London between April 2005 and April 2021. Recorded LTCs are listed in Table 1. Briefly, the proposed list includes conventional CVRFs such as hypertension, and diabetes, but also LTCs a priori less related or non-directly related to myocardial infarction such as cancers, chronic kidney disease, asthma, and chronic obstructive pulmonary disease (COPD). Patients’ electronic files included the date at which any of the considered LTCs were first ever recorded.

Table 1 Long term conditions analysed: long-term conditions are classified according to the 10th international classification of diseases (ICD-10)

Socio-demographic variables and risk factors collected were: age, gender, ethnicity (asian, black, mixed and other, and white), polymediaction status (defined as eight or more different medications in different BNF (British National Formulary) chapters and sub headings, quintile of locally calculated index of multiple deprivation (IMD) 2019, hypercholesterolaemia (total cholesterol over 5.0 mmol/L) and current or ex smoking habits. Data were provided by the Lambeth DataNet and approval for the analysis of fully anonymised data was granted by Lambeth DataNet Clinical Commissioning Group and Information Governance Steering Group.

Statistical analysis

The analysis was conducted on patients with a record of myocardial infarction, according to the 3 steps described in Methods section: i) arrange patients individual records into multiple time-to-event indicators stacked in individual patients’ state matrices and censoring indicators, ii) compute pairwise patients’ dissimilarities on individual state matrices (and censoring indicators) and apply a clustering method, and iii) define a typology.

State matrices were computed considering records associated with the 27 LTCs displayed in Table 1. Figure 1 displays state matrices computed using records of 8 LTCs associated with nine patients randomly sampled from the analysed cohort.

Fig. 1
figure 1

Examples of patients’ state matrices: censored status of 8 long-term conditions are considered from 0 to 104 years old in nine patient randomly sampled from patients’ cohort. Patients’ status are represented by state indicators for each displayed long-term condition. If during its follow up a patient remains free from a given long-term conditions, the corresponding state indicator will remain zeros from the age of zero to the age at which patient’s follow up ends. If for instance, a patient is diagnosed with hypertension at 50 years, patient’s state indicator for hypertension will be zero from zero to 49 years old, and one from 50 years old to the end of patient’s follow up

Pairwise dissimilarities between patients were computed using the Jaccard dissimilarity index [15] as described in Dissimilarity index matrix section. The Jaccard coefficient, is a measure of dissimilarity between binary samples, such as state indicators. Finally, agglomerative hierarchical clustering was computed using the Lance-Williams formula with coefficients corresponding to the Ward’s method, and the point biserial correlation was used to determine the optimal size for the typology given a convenient and workable range (from two to 14) [21].

Graphical representation

For a better understanding the link between onset of myocardial infarction and associated LTCs in defined clusters, we used a graphical representation displaying health conditions as dots whose x-coordinate represents the median age at onset. On the y axis, dots are conveniently assigned to layers, such that significant transitions are represented by edges oriented from higher to lower levels [22].

Statistical analysis of cluster trajectories

Socio-demographics, conventional CVRFs and LTC indicators, were displayed as frequencies and percentage or median and interquartile range as appropriate. Associations between variables and clusters were tested using the Fisher exact test or using Kruskal-Wallis test for numeric variables (Table 2). Multivariate associations between clusters and socio-demographic variables, conventional CVRFs, and LTCs were estimated using logistic regressions where cluster indicators were explained by tested variables. Results were displayed using heatmaps where estimates associated with P values greater than 5% were omitted (Fig. 3). All computations were performed using the R language and environment for statistical computing (version 4.3.0 (2023-04-21)) [23]. State matrices and patients pairwise distances were computed using the MSCA R library [24].

Table 2 Distribution of socio-demographic variables and risk factors in myocardial infarction patients across defined clusters

Results

The study workflow is displayed Fig. 2. Of 856,342 registered patients 5087 (0.59%) had a record of myocardial infarction (Tables 2 and 3). This corresponds to an incidence rate of 100.95 cases per 100,000 person-years. The median age at onset of myocardial infarction was 61.5. Among patients with a record of myocardial infarction, 32.2% were female, 49.3% were white and the majority (66.7%) of the population belonged to IMD quintile 1–2 (most deprived vs quintile 3–5 (less deprived)). The main LTCs associated with myocardial infarction were hypertension (67.2%) and diabetes (39.2%) and the median number of LTCs associated with myocardial infarction was 5 (IQR: 3). The median follow-up was 9.6 years, 61.7% of patients were censored or died at end of follow-up (39.3%) (Table 2).

Fig. 2
figure 2

Study workflow

Table 3 Distribution of the 26 analyzed comorbidities in myocardial infarction patients across defined clusters

After pairwise patients’ dissimilarity computation and clustering, an 11 clusters partition was retained. This partition size corresponds to a local maximum of the point biserial correlation for a range extending from two to 14 clusters [21].

Typology annotation

Clusters were ordered by decreasing frequency, and numbered from 1 to 11. Cluster 1 represents 1017 patients (20.0%) and clusters 11 represents 172 patients (3.4%). Table 2 displays socio-demographic variables and risk factors distribution across clusters. Alternatively, Fig. 3 displays the multivariate log-odds ratio of socio-demographic variables and risk factors (upper panel) and LTCs (lower panel) across clusters. Additionally a graphical representations of time at onset of myocardial infarction and highly associated LTCs in proposed in Fig. 4, and Fig. 5 displays an interpretation of the proposed typology according to clusters’ prevalence in hypertension and diabetes. Finally, Table 4 displays the number of sequences of length 2 across the different clusters and detail sequences shared by at least 30% of patients in the corresponding cluster as well as the percentage of patients following this sequence in that cluster.

Fig. 3
figure 3

Log-odds ratio of socio-demographic variables and risk factors associated (upper panel), and acute and long-term conditions (lower panel) across clusters: values are derived from multivariate logistic regressions where clusters’ indicators were explained by displayed variables. Positive log-odds ratio illustrate an over representation of the corresponding traits in a given cluster as compared to other analysed patients. Estimates associated with P values greater than 5% were omitted

Fig. 4
figure 4

Graphical representation of clusters: long-term conditions are represented by colored dots. Dots x-coordinate represents the median age at onset of the given long-term condition. Dots are presented on the y axis according to layers, such that no significant transitions exist between dots from the same layer. Significant transitions are represented by edges oriented from a higher level to a lower levels

Fig. 5
figure 5

Graphical typology annotation: The typology is divided in two main branches. The conventional cardiovascular risk factors branch includes clusters associated with hypertension (clusters 2, 4 and 5) as well as clusters associated with diabetes (clusters 6 and 10). The other main branch is composed by clusters characterised by non conventional risk factors (clusters 1, 3, 7, 8, 9 and 11) among which clusters 7 and 9 present lower prevalence of hypertension and recorded long-term conditions

Table 4 Main sequences of length 2 accross clusters

Clusters examination

Clusters showed distinct demographic characteristics and levels of the conventional CVRFs such as age at onset (clusters 5 and 10) , hypertension (cluster 2, 4, 5) and diabetes (clusters 6 and 10). Conversely, other clusters were highly associated with specific non-conventional CVRFs, such as higher number of recorder LTCs (clusters 1, 3 and 11), mental health conditions (cluster 3), Osteoarthritis (cluster 8) and asthma (cluster 11). Finally, other clusters (7 and 9) were characterised with significantly lower prevalence of hypertension and/or diabetes, a lower number of recorded LTCs as well as lower prevalence of mental health conditions. Non-conventional CVRFs associated with these clusters were peripheral vascular disease and COPD (cluster 7) and atrial fibrillation (cluster 9) (Tables 2, 3 and Fig. 5).

Conclusions

We have presented in this paper a novel approach to the unsupervised analysis of multiple time-to-event records denoted as multiple state analysis with application to the analysis of electronic health records in patients with myocardial infarction.

Similar to clustering methodology developed in other fields of research [25,26,27], multiple state analysis follows the general principle of cluster analysis where individuals are grouped according to a systematic rule. However, unlike the analysis of sequences developed in the field of the social sciences, in which multiple aspects of the social experience, such as marital and employment status, are jointly analysed [26, 27], the proposed method does not deal primarily with constructed sequence of data but binary coding of individual records into states matrices and associated follow up indicators. Importantly, if meaningful sequences of LTCs may be derived from resulting clusters, electronic health records and associated state matrices are not sequences of events.

This aspect is important in the context of epidemiological research where patients’ state of health are often characterized by the accrual of two or more LTCs, that may occur simultaneously or in a short period of time [28]. In this setting, the simple record of events and the subsequent analysis of resulting sequences may be misleading and fail to fully render the evolution of patients’ states of health. This limitation is due to i) the multiplicity of sequences that can be derived from a unique set of electronic health records, ii) the non-univocal interpretation of electronic health records in presence of ties, and iii) the non-mutual exclusivity of events under investigation, such as the onset of multiple LTCs, for instance. One the other hand, the definition and use of state matrices allow to conveniently measure distances between patients without reducing patients’ health trajectories to a particular sequence. This allows to handle arbitrary large number of potentially concomitant and non-exclusive LTCs without making a priori choices regarding the importance of input variables, as it is the case in sequence analyses.

Other important aspects of unsupervised clustering methods involve the choice of a dissimilarity metric and a clustering algorithm. We proposed in our study the Jaccard dissimilarity index associated the Ward’s hierarchical clustering approach. Also referred to as the binary metric, the Jaccard index allows to conveniently compare multiple instances based on binary attributes such as state indicators. In the clinical setting, the Jaccard index provides also a meaningful epidemiological interpretation: it represents, for two patients and for a given long-term condition, the number of matching time units these patients spent in different states of health over the period either patient have presented a record of the investigated disease. Conversely, the longer two patients remained simultaneously in the same state of health, the smaller the Jaccard index and the more likely these patients to be co-clustered, which meets assigned objectives. Another interesting feature associated with the use of the Jaccard index is a certain robustness to censored measurements. This is due to the implicit censoring of periods simultaneously free from a given condition when computing the index between two patients, much like when one of them is actually censored. The Ward’s method used in this paper aim at minimizing the within-cluster dissimilarity and results often in compact and well-defined clusters [17] and represents a consensual choice in clustering analyses [26, 27, 29, 30].

In an application of multiple state analysis to multimorbidity associated with myocardial infraction, we analysed 5093 patients with respect to up to 26 LTCs and various follow-up times. Following our strategy, patients were mapped back to a limited number of clusters, characterised by distinct patterns of multimorbidity and sequences of LTCs. Interestingly, resulting clusters had also distinctive socio-demographic characteristics and levels of the conventional CVRFs such as age at onset of myocardial infarction (clusters 5 and 10), hypertension (clusters 2, 4, 5) and diabetes (clusters 6 and 10). Conversely, other clusters were highly associated with specific non-conventional risk factors, such as higher number of recorder LTCs (clusters 1, 3 and 11), mental health conditions (cluster 3), osteoarthritis (cluster 8) and asthma (cluster 11). Of note, 2 clusters (7 and 9) were characterised by significantly lower prevalence of hypertension and/or diabetes, a lower number of recorded LTCs as well as lower prevalence of mental health conditions. Non-conventional risk factors associated with these clusters were peripheral vascular disease and COPD (cluster 7) and atrial fibrillation (cluster 9).

Association between these non-conventional risk factors and myocardial infarction have been previously reported with various levels of evidence [31,32,33,34,35]. Conventional assumptions regarding these possible associations include i) the shared risk factors hypotheses [31, 36], ii) the complications induced by long-term exposures to pharmacotherapies used in the treatment of these LTCs, such as non-steroidal anti-inflammatories used in the treatment of ostheoarthritis [37], beta-2 agonists and systemic corticosteroids used in the treatment of asthma [38,39,40] and antipsychotic drugs used in the treatment of mental health disorders [41], and iii) induced restrictions on physical activity [36], which in return are associated with increased levels of CVRFs such as hypercholesterolaemia, hypertension and diabetes.

Examination of the typology created using multiple state analysis shows that resulting clusters of patients were discriminated not only in terms of the main myocardial infarction driving LTCs but also with respect to major socio-demographics variables including age, gender and ethnicity, as well as other CVRFs such as hypercholesterolaemia or ever smoking status. Therefore, assuming that different early life lifestyle and exposures, possibly mediated by genotype and environment factors, result in distinct patterns of multimorbidity, tractable through the continuous record of health conditions, our results support both the environmental and the genotype / environment hypothesis of life course epidemiology. From a public health perspective however, underlying models are not of prime importance as public health policies would rather focus on the modifiable risk factors. To this regard, although multiple state clustering analysis presents itself as a relevant methodological approach to electronic health records–allowing patients’ stratification in terms of both the main driving LTCs as well as associated socio-demographic variables–further research is needed. This includes a deeper assessment of typologies resulting from this methodology, particularly in therms of their clinical and public health utility, to better evaluate its potential for for informing patient-oriented health policies.

Data availability

Restrictions apply to the availability of the data supporting the findings of this study. The data were used under license, following project approval by the Lambeth Clinical Commissioning Group, and therefore are not publicly available. The R library MSCA, which implements user-friendly routines for performing MSCA, will soon be available in the CRAN repository (https://cran.r-project.org/).

Abbreviations

LTC:

Long-term condition

MSCA:

Multiple State Analysis

CVRF:

Cardiovascular risk factor

IMD:

Index of multiple deprivation

References

  1. Elder GH Jr. The life course as developmental theory. Child Dev. 1998;69(1):1–12.

    Article  PubMed  Google Scholar 

  2. Hill AB, Doll R. Smoking habits of doctors. Br Med J. 1957;2(5054):1173.

    Article  PubMed Central  Google Scholar 

  3. Rutter M, Sroufe LA. Developmental psychopathology: Concepts and challenges. Dev Psychopathol. 2000;12(3):265–96.

    Article  CAS  PubMed  Google Scholar 

  4. Brown GW, Harris T. Social origins of depression: A study of psychiatric disorder in women. J Psychiatr Res. 1978;12(3):189.

    Google Scholar 

  5. Younossi ZM, Koenig AB, Abdelatif D, Fazel Y, Henry L, Wymer M. Global epidemiology of nonalcoholic fatty liver disease-meta-analytic assessment of prevalence, incidence, and outcomes. Hepatology. 2016;64(1):73–84.

    Article  PubMed  Google Scholar 

  6. Herrett E, Gallagher AM, Bhaskaran K, Forbes H, Mathur R, van Staa T, Smeeth L. Data resource profile: Clinical practice research datalink (cprd). Int J of Epidemiol. 2015;44(3):827–36.

    Article  Google Scholar 

  7. Carter BD, Abnet CC, Feskanich D, Freedman ND, Hartge P, Lewis CE, Ockene JK, Prentice RL, Speizer FE, Thun MJ, et al. Smoking and mortality-beyond established causes. N Engl J Med. 2015;372(7):631–40.

    Article  CAS  PubMed  Google Scholar 

  8. Roth GA, Huffman MD, Moran AE, Feigin V, Mensah GA, Naghavi M, Murray CJL. Global and regional patterns in cardiovascular mortality from 1990 to 2013. Circulation. 2015;132(17):1667–78.

    Article  PubMed  Google Scholar 

  9. Jaffe AS. Third universal definition of myocardial infarction. Clin Biochem. 2013;46(1–2):1–4.

    Article  PubMed  Google Scholar 

  10. Libby P. Mechanisms of acute coronary syndromes and their implications for therapy. N Engl J Med. 2013;368:2004–13.

    Article  CAS  PubMed  Google Scholar 

  11. Yusuf PS, Hawken S, Ôunpuu S, Dans T, Avezum A, Lanas F, McQueen M, Budaj A, Pais P, Varigos J, Lisheng L. Effect of potentially modifiable risk factors associated with myocardial infarction in 52 countries (the interheart study): case-control study. Lancet. 2004;364:937–52.

  12. Anand SS, Islam S, Rosengren A, Franzosi MG, Steyn K, Yusufali AH, Keltai M, Diaz R, Rangarajan S, Yusuf S, I. N. T. E. R. H. E. A. R. T. Investigators. Risk factors for myocardial infarction in women and men: insights from the interheart study. Eur Heart J. 2008;29:932–40.

  13. Nielsen M, Andersson C, Gerds TA, Andersen PK, Jensen TB, Køber L, Gislason G, Torp-Pedersen C. Familial clustering of myocardial infarction in first-degree relatives: a nationwide study. Eur Heart J. 2013;34:1198–203.

    Article  PubMed  Google Scholar 

  14. Kaufman L, Rousseeuw PJ. Finding Groups in Data: An Introduction to Cluster Analysis. 1st ed. Hoboken: Wiley; 2009.

    Google Scholar 

  15. Jaccard P. Étude comparative de la distribution florale dans une portion des alpes et des jura. Bull Soc Vaudoise Sci Nat. 1901;37:547–79.

    Google Scholar 

  16. Ward JH Jr. Hierarchical grouping to optimize an objective function. J Am Stat Assoc. 1963;58(301):236–44.

    Article  Google Scholar 

  17. Murtagh F, Legendre P. Ward’s hierarchical agglomerative clustering method: which algorithms implement ward’s criterion? J Classif. 2014;31(3):274–95.

    Article  Google Scholar 

  18. Lance GN, Williams WT. A general theory of classificatory sorting strategies: 1. hierarchical systems. Comput J. 1967;9(4):373–80.

    Article  Google Scholar 

  19. Müllner D. fastcluster: Fast hierarchical, agglomerative clustering routines for r and python. J Stat Softw. 2013;53(9):1–18.

    Article  Google Scholar 

  20. Müllner D. Modern hierarchical, agglomerative clustering algorithms. arXiv preprint arXiv:1109.2378. 2011.

  21. Milligan GW, Cooper MC. An examination of procedures for determining the number of clusters in a data set. Psychometrika. 1985;50:159–79.

    Article  Google Scholar 

  22. Sugiyama K, Tagawa S, Toda M. Methods for visual understanding of hierarchical system structures. IEEE Trans Syst Man Cybern. 1981;11(2):109–25.

    Article  Google Scholar 

  23. R Core Team. R: A Language and Environment for Statistical Computing. Vienna: R Foundation for Statistical Computing; 2023.

  24. Marc Delord. MSCA: Clustering of multiple censored time-to-event endpoints, 2025. R package version 1.0 (beta version). Available at https://cran.r-project.org/package=MSCA.

  25. Eisen MB, Spellman PT, Brown PO, Botstein D. Cluster analysis and display of genome-wide expression patterns. Proc Natl Acad Sci. 1998;95(25):14863–8.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  26. Pollock G. Holistic trajectories: a study of combined employment, housing and family careers by using multiple-sequence analysis. J R Stat Soc Ser A (Stat Soc). 2007;170(1):167–83.

    Article  Google Scholar 

  27. Gauthier J-A, Widmer ED, Bucher P, Notredame C. Multichannel sequence analysis applied to social science data. Sociol Methodol. 2010;40(1):1–38.

    Article  Google Scholar 

  28. Barnett K, Mercer SW, Norbury M, Watt G, Wyke S, Guthrie B. Epidemiology of multimorbidity and implications for health care, research, and medical education: a cross-sectional study. Lancet. 2012;380:37–43.

    Article  PubMed  Google Scholar 

  29. Ng SK, Tawiah R, Sawyer M, Scuffham P. Patterns of multimorbid health conditions: a systematic review of analytical methods and comparison analysis. Int J Epidemiol. 2018;47(5):1687–704.

    Article  PubMed  Google Scholar 

  30. Delord M, Sun X, Learoyd A, Curcin V, Wolfe C, Ashworth M, Douiri A. Patient-oriented unsupervised learning to uncover the patterns of multimorbidity associated with stroke using primary care electronic health records. BMC Prim Care. 2024;25(1):419.

  31. Hall AJ, Stubbs B, Mamas MA, Myint PK, Smith TO. Association between osteoarthritis and cardiovascular disease: systematic review and meta-analysis. Eur J Prev Cardiol. 2016;23(9):938–46.

    Article  PubMed  Google Scholar 

  32. Iribarren C, Tolstykh IV, Miller MK, Sobel E, Eisner MD. Adult asthma and risk of coronary heart disease, cerebrovascular disease, and heart failure: a prospective study of 2 matched cohorts. Am J Epidemiol. 2012;176(11):1014–24.

    Article  PubMed  Google Scholar 

  33. Bang DW, Wi CI, Kim EN, Hagan J, Roger V, Manemann S, Lahr B, Ryu E, Juhn YJ. Asthma status and risk of incident myocardial infarction: a population-based case-control study. J Allergy Clin Immunol Pract. 2016;4(5):917–23.

    Article  PubMed  PubMed Central  Google Scholar 

  34. Nicholson A, Kuper H, Hemingway H. Depression as an aetiologic and prognostic factor in coronary heart disease: a meta-analysis of 6362 events among 146 538 participants in 54 observational studies. Eur Heart J. 2006;27(23):2763–74.

    Article  PubMed  Google Scholar 

  35. Roest AM, Martens EJ, de Jonge P, Denollet J. Anxiety and risk of incident coronary heart disease: a meta-analysis. J Am Coll Cardiol. 2010;56(1):38–46.

    Article  PubMed  Google Scholar 

  36. Stubbs B, Hurley M, Smith T. What are the factors that influence physical activity participation in adults with knee and hip osteoarthritis? a systematic review of physical activity correlates. Clin Rehabil. 2015;29(1):80–94.

    Article  PubMed  Google Scholar 

  37. Schmidt M. Cardiovascular risks associated with non-aspirin non-steroidal anti-inflammatory drug use. Dan Med J. 2015;62:pii–B4987.

  38. Appleton SL, Ruffin RE, Wilson DH, Taylor AW, Adams RJ, North West Adelaide Cohort Health Study Team, et al. Cardiovascular disease risk associated with asthma and respiratory morbidity might be mediated by short-acting \(\beta\)2-agonists. J Allergy Clin Immunol. 2009;123(1):124–30.

  39. Zeiger R, Sullivan P, Chung Y, Kreindler JL, Zimmerman NM, Tkacz J. Systemic corticosteroid-related complications and costs in adults with persistent asthma. J Allergy Clin Immunol Pract. 2020;8(10):3455–65.

    Article  PubMed  Google Scholar 

  40. Cazzola M, Rogliani P, Calzetta L, Matera MG. Bronchodilators in subjects with asthma-related comorbidities. Respir Med. 2019;151:43–8.

    Article  PubMed  Google Scholar 

  41. Nielsen RE, Banner J, Jensen SE. Cardiovascular disease in patients with severe mental illness: a review. Nat Rev Cardiol. 2021;18(2):136–145.

Download references

Acknowledgements

We dedicate this work to the memory of Professor Mark Ashworth, whose unwavering support and encouragement profoundly shaped our efforts.

Funding

This project is funded by King’s Health Partners / Guy’s and St Thomas Charity ‘MLTC Challenge Fund’ (grant number EIC180702) and support from the National Institute for Health and Care Research (NIHR) under its Programme Grants for Applied Research (NIHR202339).

Author information

Authors and Affiliations

Authors

Contributions

Conceptualization: Abdel Douiri and Marc Delord; Formal analysis and investigation: Marc Delord; Writing - original draft preparation: Marc Delord; Writing - review and editing: Abdel Douiri; Funding acquisition: Abdel Douiri; Supervision: Abdel Douiri. Both authors approved the final manuscript.

Corresponding author

Correspondence to Marc Delord.

Ethics declarations

Ethics approval and consent to participate

This study was conducted in accordance with the Declaration of Helsinki. Ethics approval was granted by the Health Research Authority (HRA) ethics committees (IRAS: 282174) and the Confidentiality Advisory Group (CAG: 22/CAG/0022), as well as the local Research Ethics Committees (RECs) at Guy’s and St Thomas’ NHS Foundation Trust (London, UK), King’s College Hospital NHS Foundation Trust (London, UK), Queen Square (London, UK), and Westminster Hospital (London, UK) (REC: 22/SC/0043).

Consent for publication

Not applicable.

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Delord, M., Douiri, A. Multiple states clustering analysis (MSCA), an unsupervised approach to multiple time-to-event electronic health records applied to multimorbidity associated with myocardial infarction. BMC Med Res Methodol 25, 32 (2025). https://doiorg.publicaciones.saludcastillayleon.es/10.1186/s12874-025-02476-7

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doiorg.publicaciones.saludcastillayleon.es/10.1186/s12874-025-02476-7

Keywords