Incorporating latent survival trajectories and covariate heterogeneity in time-to-event data analysis: a joint mixture model approach

Liang, Fu-Wen; Chan, Wenyaw; Swartz, Michael D.; Dabaja, Bouthaina S.

doi:10.1186/s12874-025-02580-8

Research
Open access
Published: 14 May 2025

Incorporating latent survival trajectories and covariate heterogeneity in time-to-event data analysis: a joint mixture model approach

Fu-Wen Liang^1,2,3,
Wenyaw Chan⁴,
Michael D. Swartz⁴ &
…
Bouthaina S. Dabaja⁵

BMC Medical Research Methodology volume 25, Article number: 132 (2025) Cite this article

343 Accesses
Metrics details

Abstract

Background

Finite mixture models have been recently applied in time-to-event data to identify subgroups with distinct hazard functions, yet they often assume differing covariate effects on failure times across latent classes but homogeneous covariate distributions. This study aimed to develop a method for analyzing time-to-event data while accounting for unobserved heterogeneity within a mixture modeling framework.

Methods

A joint model was developed to incorporate latent survival trajectories and observed information for the joint analysis of time-to-event outcomes, correlated discrete and continuous covariates, and a latent class variable. It assumed covariate effects on survival times and covariate distributions vary across latent classes. Unobservable trajectories were identified by estimating the probability of belonging to a particular class based on observed information. This method was applied to a Hodgkin lymphoma study, identifying four distinct classes in terms of long-term survival and distributions of prognostic factors.

Results

Results from simulation studies and the Hodgkin lymphoma study demonstrated the superiority of our joint model compared with the conventional survival model. Four unobserved subgroups were identified, each characterized by distinct survival parameters and varying distributions of prognostic factors. A notable decreasing trend in the incidence of second malignancy over time was noted, along with different effects of second malignancy and relapse on survival across subgroups, providing deeper insights into disease progression over time.

Conclusions

The proposed joint model effectively identifies latent subgroups, revealing unobserved heterogeneity in survival outcomes and prognostic factors. Its flexibility enables more precise estimation of survival trajectories, with broad applicability in survival analysis.

Peer Review reports

Background

Mixture models have been used to model time-to-event data in situations where a homogeneous probability density function cannot adequately account for the heterogeneity of failure time [1,2,3,4,5,6,7,8,9]. Mixture distributions are generally applied to three situations. The first situation is the analysis of competing risks survival data with masked cause of death. The overall hazard function can be modeled as a weighted sum of the cause specific death density conditional on the type of failure, where the weight is given by the partial information about the prior probability of the corresponding type of failure [10, 11]. The second situation is the assessment of the proportion of patients cured of disease [12]. The third situation is modeling a disease with a multi-phase progression, where the time to failure in each phase follows a different parametric distribution [2, 13]. In cancer survival analysis, there has been increasing attention to these models. However, all aforementioned models assume that the effects of the covariates involved in failure times differ across latent classes, but the covariate distributions are homogeneous. In the situation where some covariate information that is important to the outcome of interest is unknown, the assumption of a homogeneous covariate distribution would be problematic.

As a motivating example, we consider a Hodgkin lymphoma (HL) study. Over the past 50 years, curative therapies for HL have evolved significantly with the introduction of new drugs, treatment modalities, and diagnostic procedures. Many HL studies have used diagnostic time periods as a surrogate variable to assess the impact of these therapeutic advancements on mortality. Specifically, since 1980, significant progress has been made: ABVD chemotherapy was established as a standard of care in the 1980s, combined modality therapy was refined in the 1990s to reduce long-term toxicity, and advanced imaging techniques such as PET scans and risk-adapted treatment strategies were developed in the 2000s [14]. Patients diagnosed in different time periods would have undergone varying treatment protocols leading to substantial heterogeneity within the population. This variation introduces bias into estimates if treatment heterogeneity is not properly accounted for. Long-term complications from different treatments further increase the uncertainty regarding how prognostic factors-such as genetic and environment interactions- affect survival. Therefore, it is essential to investigate both the long-term and late effects of treatment, specifically focusing on the patterns of unobserved heterogeneity that arise from these differing treatment eras. In this context, assuming a homogeneous covariate distribution across diagnostic time periods as the latent class would be unrealistic, given the variability in long-term outcomes and the complex interactions between prognostic factors across treatment eras.

To directly address the research question concerning unobserved heterogeneity, we need a model that is capable of both grouping survival patterns into meaningful latent classes and estimating the effects of covariates on the survival trajectories and their association with the distal outcome. We propose a joint model consisting of a mixture survival model, mixture covariate distributions, and a latent class variable. Two simulation studies are presented to examine the performance of the proposed model and compare it with the common mixture survival model. We apply this new model to an HL study to explore unobserved heterogeneity and to examine the relationship between survival outcome and important prognostic factors.

Methods

In this paper, we describe the formulation of our joint model for a time-to-event outcome, incorporating both continuous and discrete mixture covariates within a mixture modeling framework. This model integrates a mixture survival model with a mixture of joint discrete distributions and a mixture of multivariate normal distributions. The mixture model characterizes subpopulations that exhibit heterogeneity not only in their survival time but also in the distributions of their prognostic variables, which include correlated continuous and discrete variables.

Assume that there are C unobserved classes in the population and each of N individuals is taken from one of the C subgroups. Let $\left( {Z_{i1} ,Z_{i2} , \ldots ,{\text{Z}}_{iC} } \right)^{t}$ represent hidden binary random variables indicating the subpopulation that subject $i$ is taken from. Then $\left( {Z_{i1} ,Z_{i2} , \ldots ,{\text{Z}}_{iC} } \right)^{t}$ follows a multinomial distribution with parameters $\left( {\pi_{1} ,\pi_{2} , \ldots ,\pi_{C} } \right)^{t}$. If a subject $i$ belongs to the $j^{th}$ class, then ${\text{Z}}_{ij} = 1$ and ${\text{Z}}_{ik} = 0$ for $k \ne j$. For each individual $i$$(i = 1, \ldots ,N)$, the continuous survival time $t$, the failure indicator $\delta$, discrete covariates ${\mathbf{x}}$, and continuous covariates ${\mathbf{y}}$ are observed. Suppose that there are G binary covariates and H normally distributed continuous covariates. The function of parameters $\eta_{lj}$ and random effects $a_{i}$ is the class-specific probability for the binary covariate $x_{l}$, where $\gamma a_{i}$ is the subject-specific random effect used to allow for correlation between the binary covariates and the continuous covariates, and $a_{i}$is assumed to follow a normal distribution with mean 0 and variance 1.

$$p_{ilj} = \frac{{e^{{\eta_{lj} + \gamma a_{i} }} }}{{1 + e^{{\eta_{lj} + \gamma a_{i} }} }},i = 1, \ldots ,N,l = 1, \ldots ,G,{\text{ j}} = 1, \ldots ,C$$

We further assume that if a subject $i$ belongs to the $j^{th}$ class, conditional on class membership $Z_{ij}$ and the subject’s characteristic $a_{i}$, the probability distribution for $G$ binary covariates is

$${\text{Pr}}_{X|Z}\left({\text{X}}|{Z}_{ij}=1,{a}_{i}\right)=\prod_{l=1}^{G}{p}_{ilj}^{{x}_{li}}{q}_{ilj}^{{1-x}_{li}}\left[1+{\rho }_{x}\sum_{l<m}({{p}_{ilj}{q}_{ilj}{p}_{imj}{q}_{imj})}^{-1/2}\left({x}_{li}-{p}_{ilj}\right)\left({x}_{mi}-{p}_{imj}\right)\right]$$

(1)

where $x_{li} = 0,1,$ and $q_{ilj} = 1 - p_{ilj} , \, j = 1, \ldots ,C, \,$ is a function of the parameter $\eta_{lj}$. Note that $\rho_{x}$ denotes the correlation between two x’s and is constrained to ensure the correlation matrix of the binary variables satisfies positive semidefiniteness, as described in [15]. This ensures that the pairwise correlations are consistent with a valid joint probability distribution. We simplify Bahadur’s general form of a G-dimensional binary distribution [15] by including only the correlation between any two component covariates and assuming all higher-level correlations of Bahadur’s to be 0. Another advantage of this approach is that it permits a negative correlation between two component covariates.

Similarly, for $H$ normally distributed continuous variables ${\mathbf{y}}$, the conditional distribution is

$$f_{{{\mathbf{Y}}|Z}} \left( {{\mathbf{y}}|Z_{ij} = 1,a_{i} } \right) = \frac{1}{{(2\pi )^{H/2} |\Sigma_{j} |^{1/2} }} \times \exp \left[ {\frac{ - 1}{2}\left( {{\mathbf{y}}_{{\mathbf{i}}} - {{\varvec{\upmu}}}_{{\mathbf{j}}} - a_{i} \xi } \right)^{T} \Sigma_{j}^{ - 1} \left( {{\mathbf{y}}_{{\mathbf{i}}} - {{\varvec{\upmu}}}_{{\mathbf{j}}} - a_{i} \xi } \right)} \right]$$

(2)

where ${{\varvec{\upxi}}} = \left( {\xi_{1} ,\xi_{2} , \cdots ,\xi_{H} } \right)$ is an H-dimensional vector of constants used to allow various variances of the random effects and various correlation structures for the components of ${\mathbf{y}}$, ${{\varvec{\upmu}}}_{{\mathbf{j}}}$ denotes the class-specific H-dimensional mean vector, $a_{i} \xi$ is the subject-specific random effect created to account for the correlation between ${\mathbf{y}}$ and the binary covariates and $\Sigma_{j}$ is the class-specific covariance matrix, for class j.

The observed continuous time, $t$, is the time to a certain event. The survival times are right-censored for individuals who do not have an event occur by the end of the study. The term $\delta$ is an indicator variable where $\delta = 1$ when an event is observed and $\delta = 0$ when the event is not observed and, thus, a right-censored time is observed. Assuming that the survival times follow a Weibull distribution, then the class-specific distribution of the survival times for individuals in latent class $j$ are characterized by a shape parameter $\alpha_{j}$, a scale parameter $\lambda_{j}$, and the covariate coefficients ${{\varvec{\upbeta}}}_{{{\mathbf{Gj}}}}$ for discrete covariates ${\mathbf{x}}_{{\mathbf{i}}}$, and ${{\varvec{\upbeta}}}_{{{\mathbf{Hj}}}}$ for continuous covariates ${\mathbf{y}}_{{\mathbf{i}}}$. Note that all parameters in the distribution are assumed to be class-specific. The hazard function for the survival time of the $i^{th}$ individual taken from the $j^{th}$ class is of the form

$$h_{j} \left( {t|{\mathbf{x}}_{{\mathbf{i}}} {\mathbf{,y}}_{{\mathbf{i}}} } \right){ = }\left( {\lambda_{j} \alpha_{j} t_{i}^{{\alpha_{j} - 1}} } \right) \times \exp \left( {{\mathbf{\beta^{\prime}}}_{{{\mathbf{Gj}}}} {\mathbf{x}}_{{\mathbf{i}}} + {\mathbf{\beta^{\prime}}}_{{{\mathbf{Hj}}}} {\mathbf{y}}_{{\mathbf{i}}} } \right), \, j = 1, \ldots ,C$$

(3)

where ${\mathbf{x}}_{{\mathbf{i}}}$ is a $G \times 1$ random vector and ${{\varvec{\upbeta}}}_{{{\mathbf{Gj}}}}$ is the vector of corresponding class-specific effects for discrete prognostic variables, ${\mathbf{y}}_{{\mathbf{i}}}$ is a $H \times 1$ random vector and ${{\varvec{\upbeta}}}_{{{\mathbf{Hj}}}}$ is the vector of corresponding class-specific effects for continuous prognostic variables. The survival function for time to failure is

$$S_{j} \left( {t|{\mathbf{x}}_{{\mathbf{i}}} {\mathbf{,y}}_{{\mathbf{i}}} } \right) = {\text{exp}}\left[ { - \lambda_{j} t_{i}^{{\alpha_{j} }} \exp \left( {{\mathbf{\beta^{\prime}}}_{{{\mathbf{Gj}}}} {\mathbf{x}}_{{\mathbf{i}}} + {\mathbf{\beta^{\prime}}}_{{{\mathbf{Hj}}}} {\mathbf{y}}_{{\mathbf{i}}} } \right)} \right]$$

(4)

The conditional probability that individual $i$ belongs to class $j$ is given by

$$\Pr \left( {Z_{ij} = 1\left| {t_{i} ,\delta_{i} ,{\mathbf{x}}_{{\mathbf{i}}} ,{\mathbf{y}}_{{\mathbf{i}}} ;{{\varvec{\uptheta}}}_{{\mathbf{j}}} } \right.} \right) = \int\limits_{ - \infty }^{\infty } {\frac{{\pi_{j} \Pr \left( {t_{i} ,\delta_{i} ,{\mathbf{x}}_{{\mathbf{i}}} ,{\mathbf{y}}_{{\mathbf{i}}} \left| {Z_{ij} = 1,a_{i} } \right.} \right)}}{{\sum\limits_{l = 1}^{C} {\pi_{j} \Pr \left( {t_{i} ,\delta_{i} ,{\mathbf{x}}_{{\mathbf{i}}} ,{\mathbf{y}}_{{\mathbf{i}}} \left| {Z_{il} = 1,a_{i} } \right.} \right)} }}\frac{1}{{\sqrt {2\pi } }}} {\text{exp}}\left( { - \frac{{a_{i}^{2} }}{2}} \right)da_{i}$$

(5)

where $\pi_{j}$ is the proportion of individuals belonging to class $j$, and ${{\varvec{\uptheta}}}_{{\mathbf{j}}}$ is the vector of all class-specific parameters in the model, consisting of $\alpha_{j} {, }\lambda_{j} {, }{{\varvec{\upbeta}}}_{{{\mathbf{Gj}}}} {, }{{\varvec{\upbeta}}}_{{{\mathbf{Hj}}}} {, }\eta_{lj} {, }\Sigma_{j} {\text{, and }}{{\varvec{\upmu}}}_{{\mathbf{j}}} .$

Under our hypothesis that important covariates and their corresponding distributions affect both the survival and the probability that subjects belong to a specific class, the complete-data likelihood function can be expressed as

$$\begin{gathered} L_{c} \left( {{\varvec{\Psi}}} \right) = \prod\limits_{i = 1}^{N} {f\left( {t_{i} ,\delta_{i} ,{\mathbf{x}}_{i} {\mathbf{,y}}_{i} ;{{\varvec{\Psi}}}} \right)} = \prod\limits_{i = 1}^{N} {\prod\limits_{j = 1}^{C} {\left[ {\pi_{j} f\left( {t_{i} ,\delta_{i} ,{\mathbf{x}}_{i} {\mathbf{,y}}_{i} ;{{\varvec{\uptheta}}}_{{\mathbf{j}}} } \right)} \right]} }^{{Z_{ij} }} \hfill \\ = \prod\limits_{i = 1}^{N} {\prod\limits_{j = 1}^{C} {\left[ \begin{gathered} \pi_{j} \hfill \\ \times \left[ {\lambda_{j} \alpha_{j} t_{i}^{{\alpha_{j} - 1}} \exp ({\mathbf{\beta^{\prime}}}_{{{\mathbf{Gj}}}} {\mathbf{x}}_{{\mathbf{i}}} + {\mathbf{\beta^{\prime}}}_{{{\mathbf{Hj}}}} {\mathbf{y}}_{{\mathbf{i}}} )} \right]^{{\delta_{i} }} \exp \left[ { - \lambda_{j} t_{i}^{{\alpha_{j} }} \exp ({\mathbf{\beta^{\prime}}}_{{{\mathbf{Gj}}}} {\mathbf{x}}_{{\mathbf{i}}} + {\mathbf{\beta^{\prime}}}_{{{\mathbf{Hj}}}} {\mathbf{y}}_{{\mathbf{i}}} )} \right] \hfill \\ \times \int\limits_{ - \infty }^{\infty } {\prod\limits_{l = 1}^{G} \begin{gathered} p_{lj}^{{x_{li} }} q_{lj}^{{1 - x_{li} }} \left[ {1 + \rho_{x} \sum\limits_{l < k} {(p_{lj} q_{lj} p_{kj} q_{kj} )^{ - 1/2} (x_{li} - p_{lj} )} (x_{ki} - p_{kj} )} \right] \hfill \\ \frac{1}{{\sqrt {2\pi } }}{\text{exp}}\left( { - \frac{{a_{i}^{2} }}{2}} \right){\text{da}}_{i} \hfill \\ \end{gathered} } \hfill \\ \times \int\limits_{ - \infty }^{\infty } \begin{gathered} \frac{1}{{(2\pi )^{H/2} |\Sigma_{j} |^{1/2} }} \times \exp \left[ {\frac{ - 1}{2}\left( {{\mathbf{y}}_{{\mathbf{i}}} - {{\varvec{\upmu}}}_{{\mathbf{j}}} - a_{i} {{\varvec{\upxi}}}} \right)^{T} \Sigma_{j}^{ - 1} \left( {{\mathbf{y}}_{{\mathbf{i}}} - {{\varvec{\upmu}}}_{{\mathbf{j}}} - a_{i} {{\varvec{\upxi}}}} \right)} \right] \hfill \\ \frac{1}{{\sqrt {2\pi } }}{\text{exp}}\left( { - \frac{{a_{i}^{2} }}{2}} \right){\text{da}}_{i} \hfill \\ \end{gathered} \hfill \\ \end{gathered} \right]^{{Z_{ij} }} } } \hfill \\ \end{gathered}$$

where the vector ${{\varvec{\Psi}}}$ containing all the unknown parameters in the mixture model can be partitioned as $\left( {{{\varvec{\Psi}}}_{1}^{T} ,{{\varvec{\Psi}}}_{2}^{T} } \right)^{T}$, with ${{\varvec{\Psi}}}_{1}^{T} = \left( {\pi_{1} , \cdots ,\pi_{c - 1} } \right)^{T}$ and ${{\varvec{\Psi}}}_{2}^{T}$ is the vector containing all the parameters in ${{\varvec{\uptheta}}}_{{\mathbf{1}}} , \cdots ,{{\varvec{\uptheta}}}_{{\mathbf{c}}}$.

The joint likelihood function is derived for joint modeling of the time to event, its continuous and discrete covariate distributions (and their correlations), and a latent class variable. The first part of the likelihood denotes the proportion of subjects belonging to class $j$. The second part of the likelihood is constructed from the survival component using the expression in Eqs. (3) and (4). The baseline hazard and the effects of covariates on survival vary across different latent classes. The last two parts of the likelihood are from the distributions of binary covariates and continuous covariates using the expressions in Eqs. (1) and (2), respectively. The characteristics of the observed information are assumed to vary across the subgroups. Note that the correlation between any two binary covariates and the correlation between any two continuous covariates are included and in a general form.

Results

Simulation studies

To examine the performance of the proposed method, we conducted a simulation study, generating 1500 data sets with N = 1000 individuals to mirror the sample size of our data example. It is assumed that there were four latent classes and survival time is associated with two bivariate normally distributed covariates and two correlated binary covariates. Each of the latent classes was characterized by covariates that have distinct distributions and are pairwise correlated. A random effect $a_{i}$ following a normal distribution with mean 0 and variance 1 was generated first. Two class-dependent binomially distributed covariates $X_{1} ,X_{2}$ with $E[x_{ilj} ] = p_{ilj} {\text{ and corr}}(x_{1,} x_{2} ) = \rho_{x}$, where $p_{ilj} = \frac{{e^{{\eta_{lj} + \gamma a_{i} }} }}{{1 + e^{{\eta_{lj} + \gamma a_{i} }} }},i = 1, \ldots ,1000,l = 1,2, \, j = 1, \ldots ,4$, were generated. Then two continuous covariates following a bivariate normal distribution with mean $\mu_{lj}$, variance $\sigma_{lj}^{2}$ and correlation, where $l = 1,2, \, j = 1, \ldots ,4$, were generated. Finally, the survival time was generated given the four correlated covariates. The values used for generating the data are listed in Table 2. We first fitted the proposed model with 1 to 5 classes to each data set to demonstrate whether the number of latent classes can be correctly identified using PROC NLMIXED. The Akaike Information Criterion (AIC) and Bayesian Information Criterion (BIC) were used to determine the best-fitting model. Table 1 shows that the correct class number has the lowest average Akaike Information Criterion (AIC) or Bayesian Information Criterion (BIC). In addition, BIC can correctly predict 97.3% of the class number and AIC can correctly predict 74.9% of the class number. Forty datasets in which BIC failed to correctly identify the true number of classes were excluded from subsequent calculations of bias, standard deviations, and coverage rates. This ensures that the reported performance metrics reflect model fit under the correct number of classes and prevents overstating model performance due to mismatches in the number of parameters.

Table 1 Model fitness on the number of latent classes for simulation results

Full size table

For each parameter, the bias, the mean square error, the empirical standard error of the estimate over all simulations, and the coverage rate are reported in Table 2. Across the remaining 1460 (97.3%) replicates, the biases of all estimated coefficients are very small, and the coverage rate for most of parameters is approximately 95%. Notably, the parameter $\lambda$ has the largest mean square error among all parameter estimates. The greater variability may be due to the linear transformation of the Weibull scale parameter. In the Weibull regression model, the covariates are treated as multiplicative modifiers of the underlying scale parameter which is equal to the parameter $\lambda$ multiplied by the function of covariates. Therefore, the accuracy of the covariates is likely to influence the variability of the parameter $\lambda$. To best estimate the true latent classes, we selected the one that has the highest posterior probability of class membership. To further assess the stability of this mapping across simulations, we calculated the Rand Index (RI) over 1500 simulation runs, obtaining a mean RI of 0.95 with a 95% confidence interval of (0.9521,0.9527). This high agreement suggests that the estimated latent class reliably match the true classes, supporting the validity of our bias, standard deviation, and coverage calculations. Additionally, we conducted a simulation study, where the data were analyzed without accounting for the distributions of covariates for each latent class. The results in Table 3 show that for each estimate, the bias is unacceptably large and the coverage is particularly poor. The performance comparison of these two simulation studies further confirms the importance of including the distributions of covariates in the latent class model.

Table 2 Simulation results based on 1500 replications of the proposed joint model incorporating 4 latent classes and distributions of four correlated covariates

Full size table

Table 3 Simulation results based on 1500 replications of the conventional survival mixture model including 4 latent classes only

Full size table

Application to Hodgkin lymphoma data

We illustrate the proposed method using data from a HL study supported in part by Cancer Center Support (core) Grant CA016672 to The University of Texas MD Anderson Cancer Center. The HL study was a retrospective cohort study where 1050 patients diagnosed with HL between January 1970 and June 2009 were included. The median age at diagnosis was 27.5 years and median follow-up time was 14.0 years. In this study population, 46.9% were women, 25.9% experienced a recurrence of the disease, and 15.4% developed a second malignancy. Our objective is to explore unobserved heterogeneity and examine the relationship between time to death and prognostic factors under the framework of mixture modeling. The probability of class membership depends on both observed covariates and the time-to-event outcome. While chemotherapy and radiation therapy were involved, treatment information was incomplete, especially for patients diagnosed in earlier years. Moreover, since 1980, numerous diagnostic and therapeutic advancements have been made, including the establishment of ABVD chemotherapy as a standard of care and the adoption of prognostic factor models to guide clinical trial design and interpretation [14]. Therefore, instead of using treatment as a direct predictor, we used the year of diagnosis as a surrogate variable to assess the impact of these advances on survival. Our preliminary analysis identified that age at diagnosis, relapse, occurrence of a second malignancy, and the time period at diagnosis were significantly associated with survival. Consequently, the joint model includes two continuous variables (age at diagnosis and year of diagnosis) and two discrete variables (occurrence of a second malignancy and relapse). The survival times were assumed to follow a Weibull distribution. We compared two different models: Model One includes two discrete covariates following a correlated binomial distribution, two continuous covariates following a bivariate normal distribution, and a random effect with a normal distribution with mean 0 and variance 1 to model the correlation between discrete covariates and continuous covariates; Model Two, on the other hand, assumes that the correlated discrete covariates and the correlated continuous ones are independent. Both AIC and BIC indicated that Model Two, which accounts for only the two correlations, best fits the data (Model I: AIC = 22160, BIC = 22372; Model II: AIC = 28602, BIC = 28868). In addition, both AIC and BIC suggested that the model with four latent classes provided the best fit to the data (AIC = 22166 and BIC = 22431 for 3 classes; AIC = 22160 and BIC = 22372 for 4 classes; AIC = 22203 and BIC = 22539 for 5 classes). Among 1050 subjects, 274 were classified in Class 1 (26.1%), 286 were in Class 2 (27.2%), 136 were in Class 3 (13.0%), and the rest 354 were in Class4 (33.7%). The results from Model Two are presented in Table 4. Comparing the average age of diagnosis among the four latent classes, patients in Class 4 were diagnosed at the youngest age whereas patients in Class 2 were at the oldest age. Among the four classes, Class 1 shows the smallest estimated proportion of patients who developed a second malignancy. This may be due to an insufficiently long follow-up period for developing a second malignancy. Comparing with the other three groups, Class 3 has the highest percentage of patients developing a second malignancy (46%) and the highest proportion of relapsed patients (38%) which may result from the fact that these patients were treated with the oldest therapy (year of diagnosis = 1975). The results from the survival part of this joint model show that the impact of prognostic factors on survival varies across different classes. The greatest effect of a second malignancy on survival appears in Class 2 (hazard ratio (HR) = 2.11, 95% CI = 1.21–3.69) whereas the greatest impact of relapse was in Class 4 (HR = 26.07, 95%CI = 6.97–68.14). These findings support our hypothesis that higher risk of developing a second malignancy is associated with being diagnosed at younger ages, as well as receiving older treatment.

Table 4 The parameter estimates from the joint analysis of the survival mixture model and distributions of the prognostic factors for the Hodgkin lymphoma study

Full size table

For calculating the predicted survival, we first fit our model to HL data to obtain the group-specific parameter estimates. Then using the group-specific estimates, each subject’s information, and their latent class probabilities as a weight, we calculated the predicted survival. Figure 1 shows the predicted covariate-adjusted survival curve over 30 years of the cohort. The predicted covariate-adjusted survival function obviously distinguishes four subpopulations with respect to the slopes of the survival curves. In general, within the first six years after diagnosis, there is hardly any difference among the survival curves. Class 1 and Class 4 showed a relatively steady decrease in survival. The survival probabilities for Class 2 and Class 3 decreased at a much faster rate 15 years after diagnosis. When comparing patients who had no second malignancy (Fig. 1a) with patients who had a second malignancy (Fig. 1b), we can see a dramatically decreasing survival curve in Class 3 for patients who developed a second malignancy that may be due to the long-term complication caused by the more toxic treatment in the early treatment era. Figure 2 presents the predicted covariate-adjusted survival curve using the common survival mixture model. Within the first twenty years after diagnosis, except for Class 2, there is hardly any difference among the survival curves regardless of the occurrence of a second malignancy. In addition, the survival probability for patients in Class 2 seems underpredicted, especially for patients with a second malignancy. The comparison of Figs. 1 and 2 further demonstrates the importance of taking into consideration the unobserved class-specific covariate distributions. The common method usually ignores the class-specific distribution of the covariate when estimating the probability that subjects belong to a specific class.

Discussion

In this study, we proposed a joint model that includes a time-to-event variable with its continuous and discrete covariates under a framework of mixture modeling to explore the unobserved heterogeneity. To our knowledge, this is the first study that takes the class-specific covariate distributions into consideration under the mixture modeling framework and characterizes subgroups that are heterogeneous not only in their survival time, but in their distributions of prognostic variables. For example, in the HL study, one may model both time to death and occurrence of a second malignancy as primary outcomes jointly linked by a shared random effect where only the correlation of these two outcomes and their separate determinants are examined. But the impact of the distribution of second malignancy on the survival model was not taken into consideration. The proposed model considers the joint distribution of all outcome and covariates in a latent framework. It is like using a weighted likelihood on the survival model in a latent framework weighted on its distributional weight of the covariate vector. The proposed approach simultaneously estimates the effect of covariates on the survival trajectory, their association with the distal outcome, as well as the clustering of survival patterns into latent classes. All correlations between covariates are also taken into consideration. However, for the scenario that the variance of the random effect depends on the latent class, we have a concern about an identifiability problem. The results were poor when performing the simulation in which all the correlations depend on the latent class to allow different correlations for different latent class. Another attractive feature of the joint mixture model is that it can be used to model time to event in the presence of competing risks without making the untenable assumption about the independence of competing risks. The traditional approach to modeling failure time with competing risks is to postulate the existence of so-called latent failure times, $T_{1} {\text{ and }}T_{2}$, corresponding to the two causes and to proceed with the modeling of $T = \min \left( {T_{1} ,T_{2} } \right)$ on the basis that the two causes are independent of each other [10]. As pointed out by Lagakos [16], the assumption of independence of risks seems questionable in most real-life situations. Compared to the latent failure-time approach [17, 18], the postulated component survival functions and mixing proportions of the mixture model are able to be estimated directly from the observed data. Applying the proposed method in the case of competing risk, the survival function of T is modeled as $S\left( {t;x} \right) = \pi_{1} (x)S_{1} (t;x) + \pi_{2} (x)S_{2} (t;x)$, where the $i^{{{\text{th}}}}$ component-survival function $S_{i}$ is the conditional survival function, given that failure is due to the $i^{{{\text{th}}}}$ cause, and $\pi_{i} \left( x \right)$ is the probability of failure from the $i^{{{\text{th}}}}$ cause ($i = 1,2$). Development of flexible parametric inference procedures for modeling competing risks data provides more accurate information, and it may play an important role in improving patient care in clinical settings.

When studying a highly curable disease with late complications from treatment, a certain level of error is introduced. The late complications make it difficult to precisely measure interactions between risk factors. The method developed in this study is capable of accommodating unobservable heterogeneity among individuals and finding the unobserved heterogeneity in terms of survival trajectories and their corresponding characteristics. One of the examples would be a situation in which the unobserved genetic variants (latent classes) are associated with the observed covariates (height, weight, blood type, etc.) and a time-to-event outcome (death). In the HL study, we can imagine that there are four untyped risk genetic variants and their corresponding probabilities of survival after adjusting for potential confounders were shown in Figs. 1 and 2 under different conditions (with versus without second malignancy). The results from the two simulation studies, as well as the comparison of Figs. 1 and 2, demonstrate that the joint model performs better than the conventional survival mixture model when the observed covariates are differently distributed across subgroups. Although there are several strengths in the proposed model, the strategy for maximum likelihood estimation requires more attention. Different starting values and stopping values would lead to quite different parameter estimates via the Expectation–Maximization algorithm, especially when explicit formulas for parameter estimates are unavailable. Local maxima can be another issue when the mixture distribution is highly skewed. In our study, the initial values were obtained separately from the covariate distributions with equal mixing proportions to stabilize algorithm convergence, avoid bias toward specific classes, and ensure computational efficiency, as supported by previous research on the EM algorithm for finite mixture models [19]. This approach provides a neutral and practical starting point that avoids bias toward any specific class while stabilizing the convergence of the Expectation–Maximization algorithm. In addition, when assessing the size of latent classes, we suggest using two or more criteria as one criterion may have some drawbacks in some conditions. For example, AIC tends to overfit models (too many classes), whereas BIC may underestimate the true number of latent groups. In the case when a future subject with only covariates, not the survival information, are observed, we suggest using the general mixture model first to identify the latent class and then do the post-hoc comparison once the survival information is observed. One limitation of our approach is the assumption that latent classes are meaningfully distinct. In cases where two or more classes exhibit highly similar parameter values, they may be considered part of the same latent group, potentially leading to an underestimation of the true number of classes. While this aligns with the principle of parsimony, future studies could explore alternative methods to quantify class similarity and assess the impact of potential class merging on model performance. Additionally, this study assumes specific distributional forms for both the covariates and the time-to-event outcome, which may influence model performance if these assumptions are violated. While we did not conduct additional sensitivity analyses, future research should explore alternative model specifications, such as flexible parametric or non-parametric approaches, to assess robustness.

From a statistical perspective, the joint mixture model has much of the flexibility of nonparametric approaches, while retaining the advantages of parametric approaches, ensuring that the number of parameters down to a reasonable size. From an epidemiological standpoint, this method allows for illuminating the underlying roles prognostic factors play in disease severity and survival. For example, it can be used to identify unobserved subpopulations of patients with diabetes who are particularly susceptible to developing cancer. By linking latent class probabilities to observed covariates and the time to onset of conditions like end-stage renal disease, the identified latent classes can then be incorporated as a covariate in modeling the risk of secondary outcomes, such as cancer.

Conclusions

In conclusion, this joint mixture model can be treated as a personalized risk prediction tool, offering deeper insights into how behavioral, environmental, and genetic factors influence health. Once the model is estimated, a patient’s latent group can be identified, allowing for precise risk predictions tailored to the individual’s unique profile.

Data availability

No datasets were generated or analysed during the current study.

Abbreviations

HL:: Hodgkin lymphoma
HR:: Hazard ratio
AIC:: Akaike Information Criterion
BIC:: Bayesian information criterion

References

De Angelis R, Capocaccia R, Hakulinen T, Soderman B, Verdecchia A. T, Soderman B, Verdecchia A. Mixture models for cancer survival analysis: application to populati. Stat Med. 1999;18:441–54.
Article PubMed Google Scholar
Kuk AYC, Chen C-H. A Mixture Model Combining Logistic Regression with Proportional Hazards Regression. Biometrika. 1992;79:531–41.
Article Google Scholar
McLachlan G, McGiffin D. On the role of finite mixture models in survival analysis. Stat Methods Med Res. 1994;3:211–26.
Article CAS PubMed Google Scholar
Phillips N, Coldman A, McBride ML. Estimating cancer prevalence using mixture models for cancer survival. Stat Med. 2002;21:1257–70.
Article PubMed Google Scholar
Lin H, Turnbull BW, McCulloch CE, Slate EH. Latent Class Models for Joint Analysis of Longitudinal Biomarker and Event Process Data: Application to Longitudinal Prostate-Specific Antigen Readings and Prostate Cancer. J Am Stat Assoc. 2002;97:53–65.
Article Google Scholar
Garre FG, Zwinderman AH, Geskus RB, Sijpkens YWJ. A joint latent class changepoint model to improve the prediction of time to graft failure. J R Stat Soc A Stat Soc. 2008;171:299–308.
Article Google Scholar
Proust-Lima C, Taylor JM. Development and validation of a dynamic prognostic tool for prostate cancer recurrence using repeated measures of posttreatment PSA: a joint modeling approach. Biostatistics. 2009;10:535–49.
Article PubMed PubMed Central Google Scholar
Proust-Lima C, Joly P, Dartigues J-F, Jacqmin-Gadda H. Joint modelling of multivariate longitudinal outcomes and a time-to-event: A nonlinear latent class approach. Comput Stat Data Anal. 2009;53:1142–54.
Article Google Scholar
Kuo JC, Chan W, Leon-Novelo L, Lairson DR, Brown A, Fujimoto K. Latent classification model for censored longitudinal binary outcome. Stat Med. 2024;43(20):3943–57.
Article PubMed Google Scholar
Blackstone EH, Naftel DC Jr, MET. The Decomposition of Time-Varying Hazard Into Phases, Each Incorporating a Separate Stream of Concomitant Information. J Am Stat Assoc. 1986;81:615–24.
Article Google Scholar
Sen A, Banerjee M, Li Y, Noone AM. A Bayesian approach to competing risks analysis with masked cause of death. Stat Med. 2010;29:1681–95.
Article PubMed Google Scholar
Gordon NH. Application of the theory of finite mixtures for the estimation of “cure” rates of treated cancer patients. Stat Med. 1990;9:397–407.
Article CAS PubMed Google Scholar
Larsen K. Joint analysis of time-to-event and multiple binary indicators of latent classes. Biometrics. 2004;60:85–92.
Article PubMed Google Scholar
Connors JM, Noordijk EM, Horning SJ. Hodgkin’s Lymphoma: Basing the Treatment on the Evidence. ASH Education Program Book. 2001;2001:178–93.
Google Scholar
Bahadur RR, COLL. CUNYT. A Representation of the Joint Distribution of Responses to N Dichotomous Items. Defense Technical Information Center, 1959.
Lagakos SW. General right censoring and its impact on the analysis of survival data. Biometrics. 1979;35:139–56.
Article CAS PubMed Google Scholar
Heckman JJ, Honoré BE. The Identifiability of the Competing Risks Model. Biometrika. 1989;76:325–30.
Article Google Scholar
Slud EV. Nonparametric identifiability of marginal survival distributions in the presence of dependent competing risks and a prognostic covariate. In: Klein JP, Goel PK, editors. Survival Analysis: State of the Art. Kluwer Academic Publisher; Netherland: 1992. 355–368.
Karlis D, Xekalaki E. Choosing initial values for the EM algorithm for finite mixtures. Comput Stat Data Anal. 2003;41:577–90.
Article Google Scholar

Download references

Acknowledgements

The author is grateful to the late Dr. Carol Etzel for her valuable discussions and support during the preparation of this work.

Funding

This study was funded by the National Science and Technology Council (grant No. MOST 111–2314-B-037–034-MY2 and NSTC 113–2314-B-037–072).

Author information

Authors and Affiliations

Department of Public Health, College of Health Sciences, Kaohsiung Medical University, Kaohsiung, Taiwan
Fu-Wen Liang
Center for Big Data Research, Kaohsiung Medical University, Kaohsiung, Taiwan
Fu-Wen Liang
Department of Medical Research, Kaohsiung Medical University Hospital, Kaohsiung Medical University, Kaohsiung, Taiwan
Fu-Wen Liang
Department of Biostatistics and Data Science, School of Public Health, University of Texas Health Science Center at Houston, Houston, TX, USA
Wenyaw Chan & Michael D. Swartz
Department of Radiation Oncology, University of Texas MD Anderson Cancer Center, Houston, TX, USA
Bouthaina S. Dabaja

Authors

Fu-Wen Liang
View author publications
You can also search for this author inPubMed Google Scholar
Wenyaw Chan
View author publications
You can also search for this author inPubMed Google Scholar
Michael D. Swartz
View author publications
You can also search for this author inPubMed Google Scholar
Bouthaina S. Dabaja
View author publications
You can also search for this author inPubMed Google Scholar

Contributions

FWL and WC developed the theoretical framework, conceptualized the study design, and interpreted the results. BSD provided the Hodgkin lymphoma data. FWL performed data analysis and drafted the manuscript. MDS provided the methodological suggestion. WC supervised the project and critically revised the manuscript. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Wenyaw Chan.

Ethics declarations

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

Liang, FW., Chan, W., Swartz, M.D. et al. Incorporating latent survival trajectories and covariate heterogeneity in time-to-event data analysis: a joint mixture model approach. BMC Med Res Methodol 25, 132 (2025). https://doi.org/10.1186/s12874-025-02580-8

Download citation

Received: 28 September 2024
Accepted: 24 April 2025
Published: 14 May 2025
DOI: https://doi.org/10.1186/s12874-025-02580-8

Your privacy, your choice

Incorporating latent survival trajectories and covariate heterogeneity in time-to-event data analysis: a joint mixture model approach