A comparison of joint models for longitudinal and competing risks data, with application to an epilepsy drug randomized controlled trial

Joint modelling of longitudinal data and competing risks has grown over the past decade. Despite the recent methodological developments, there are still limited options for fitting these models in standard statistical software programs, which prohibits their adoption by applied biostatisticians. We summarize four published models, each of which has software available for model estimation. Each model features a different hazard function, latent association structure between the submodels, estimation approach and software implementation. Of the four models considered here, the model specifications and association structures are substantially different, thus complicating model‐to‐model comparison. The models are applied to the ‘Standard and new anti‐epileptic drugs’ trial of anti‐epileptic drugs to investigate the effect of drug titration on the treatment effects of lamotrigine and carbamazepine on the mode of treatment failure. Notwithstanding the vastly different association structures, we show that the inference from each model is consistent, namely, that there is a beneficial effect of lamotrigine on unacceptable adverse events over carbamazepine and a non‐significant effect on the hazard of inadequate seizure control. The association between anti‐epileptic drug titration and treatment failure was significant in most models. To allow for the routine adoption of joint modelling of competing risks and longitudinal data in the analysis of clinical data sets, further work is required on the development of model diagnostics to aid model choice.


Introduction
In many clinical studies, it is common that both longitudinal measurement data and time-toevent data are collected during follow-up (Ibrahim et al., 2010;Asar et al., 2015). When focus lies on the time-to-event process, so-called naive approaches such as Cox regression with a time varying covariate or application of the two-stage model have historically been applied (Cox, 1972;Tsiatis et al., 1995;Kalbfleisch and Prentice, 2002). However, numerous studies have demonstrated that it is preferable to model both outcomes jointly, particularly when the censoring mechanism is informative or the time varying covariate is measured with error, as efficiency is increased and bias decreased (Ibrahim et al., 2010;Chen et al., 2011). Joint models of longitudinal and time-to-event data can also be exploited to improve inference about longitudinal profiles with informative drop-out (Henderson et al., 2000). This has been an active research area for the past two decades. The literature on this topic is extensive, with comprehensive reviews given by Hogan and Laird (1997), Tsiatis and Davidian (2004) and Gould et al. (2015). Reviews specific to joint latent class models and multivariate longitudinal data are given by Proust-Lima et al. (2012) and Hickey et al. (2016) respectively.
Much of the research has been focused on data with a single event time and a single mode of failure, combined with an assumption of independent censoring of event times (Tsiatis and Davidian, 2004). However, in some situations interest lies with more than one possible cause of event or where the censoring is informative. Such data are termed competing risks data (Bakoyannis and Touloumi, 2012). For example, the time to treatment failure (here defined as the withdrawal of a randomized drug or addition of another) has been recommended by the International League Against Epilepsy to be one of the primary end points for clinical trials of anti-epileptic drugs (AEDs) (Commission on Antiepileptic Drugs, 1998). Patients may decide to switch to an alternative AED, or to begin an additional AED, because of inadequate seizure control (ISC). Alternatively, patients may withdraw from a treatment because of an unacceptable adverse effect (UAE). Overall analysis of treatment failure may miss differential effects of AEDs on the reasons for withdrawal, which may differ in terms of their relative importance for patients (Williamson et al., 2007b).
Radically different approaches have also been proposed for joint modelling of longitudinal and competing risks outcomes. Yu and Ghosh (2010) proposed a joint model comprised of a random-change-point submodel with a mixture time-to-event distribution for the competing risks, which also allowed for interval censoring on one of the event times. Proust-Lima et al. (2016 proposed the use of joint latent class models (JLCMs) instead of the classical shared random-effects models. Hu et al. (2016) proposed a multiple-imputation technique allowing for longitudinal and time-to-event data to be jointly imputed, thus permitting valid separate model inferences, which in principle is computationally simpler.
Research into joint modelling of longitudinal data and competing risks data has now changed direction away from modelling towards issues such as misspecification of distributional assumptions, model diagnostics and prediction assessment Ko, 2014;Andrinopoulou et al., 2017;Blanche et al., 2015). However, penetration of this relatively new modelling framework into practical use is precluded by a lack of awareness and understanding between the different existing models that are available and the ability to fit the models by using standard statistical software packages-an issue pertinent to univariate joint modelling as well (Gould et al., 2015). In this paper, we juxtapose four joint models of longitudinal and competing risks outcomes that have been published in recent years. In particular, we examine the submodel specification, latent association structure, model fitting algorithm, parameter uncertainty estimation and software implementation. We focus on these four particular models because each has available software, which enables them to be readily applied to alternative clinical data sets. To the best of our knowledge, this is the first time that multiple joint models for competing risks and longitudinal data have been presented side by side with the objective of enhancing the practical researcher's awareness and understanding between the models and their fitting options. To illustrate each model in turn, we fit them to a data set from one of the largest trials to date comparing AEDs, which was previously analysed by Williamson et al. (2008). In keeping with the analysis that was undertaken by Williamson and colleagues, the data set will be analysed with the objective of investigating the effect of drug titration on the relative effects of a standard and new AED on initial treatment failure, with interest lying in the two competing reasons for failure: a UAE or ISC.
The paper is structured as follows. We describe the data in Section 2. In Section 3 we describe the separate models. In Section 4 we describe the four joint models from the literature and fit these to the AED trial data in Section 5. We conclude with a discussion in Section 6.

Application: epilepsy drug trial data
The 'Standard and new anti-epileptic drugs' (SANAD) study was a non-blinded randomized controlled trial recruiting patients with epilepsy to test AEDs (Marson et al., 2007). This study was registered as an international standard randomized controlled trial, number ISRCTN383 54748 (http://www.isrctn.com/; digital object identifier 10.1186/ISRCTN38354748). Patients were randomized to either the standard treatment (carbamazepine, CBZ) or to lamotrigine, LTG, gabapentin, oxcarbazepine or topiramate. Time to treatment failure was a primary outcome of the trial. For the study here, we consider only patients who were assigned to either CBZ or LTG. An important secondary objective was the analysis of competing risks for treatment failure: treatment failure due to either ISC or UAE. The primary analysis of patients with partial epilepsy concluded that the newer drug LTG was superior in terms of treatment failure to CBZ, which had been the standard for many years (Williamson et al., 2007a). LTG was significantly better in terms of withdrawal for a UAE and not significantly worse in terms of withdrawal for ISC. Following publication of the trial results, it became evident that there was concern that differential titration rates may have been to the disadvantage of CBZ (Cross et al., 2007). An AED that is titrated more quickly may bring benefits in terms of seizure control but be more likely to cause adverse effects.
The data set that is analysed here includes 605 patients, comparing CBZ (n = 292) and LTG (n = 313). This data set was first analysed by using a joint model framework by Williamson et al. (2008) and is publicly available in the R package joineR (Philipson et al., 2017). During a maximum follow-up time of 6.6 years (median = 2.9 years), 94 patients withdrew from the randomized drug due to a UAE whereas 120 withdrew due to ISC. Withdrawals due to other reasons were treated as non-informative, and patients were censored at this time. To compare the two AEDs after adjusting for titration rate, dose calibration was first undertaken by standardizing the dose of both drugs relative to the midpoint of its maintenance dose range (Faught, 2007). The maintenance dose recommended in the SANAD trial was independently deemed to be reasonable and the approach to calibration considered sensible. These calibrated doses are taken to be the longitudinal measurements within the competing risks joint model. Patients had an average of 4.6 longitudinal measurements recorded, ranging from 1 to 15. Fig. 1 shows the subject-specific longitudinal profiles for calibrated dose of drugs against reverse time, stratified by drug type and event time mechanism. A large variation across patients in the initial calibrated dose was observed. Additionally, treatment failure due to ISC was seen to occur in patients with increasing calibrated doses, whereas treatment failure due to a UAE seemed to happen following a dose reduction.

Separate submodels
We consider a repeated continuous longitudinal outcome y i .t/ measured at time t for subject i = 1, : : : , m, where m is the total number of subjects in the study, and t = 0 denotes the time of randomization. These measurements are obtained at specific times t ij which can vary between subjects, yielding data y i = {y i .t ij /, j = 1, 2, : : : , n i }. We let T i denote the observed failure time for subject i, taken as T i = min.T Å 1i , T Å 2i , : : : , T Å Gi , C i /, with T Å gi denoting the true failure time of subject i for each event type g = 1, 2, : : : , G, and C i the censoring time. In addition to observing T i , we also observe the event indicator δ i , which equals 0 if censored, or g indicating that subject i failed from cause g. We assume that the censoring mechanism is independent of the event time.

Longitudinal submodel
The standard model for a continuous longitudinal outcome is the Laird and Ware (1982) linear mixed model where μ i .t ij / is the linear predictor term (or mean response): and " ij ∼ N.0, σ 2 / are the measurement errors, which are assumed to be independently and identically distributed. We let X .1/ i .t ij / and Z i .t ij / be time varying vectors of covariates for subject i associated with fixed, β .1/ , and subject-specific random, b i , effects respectively. The standard model assumes that random effects are distributed as multivariate normal with mean 0 and covariance matrix Σ. Extensions to include a separate Gaussian process have been proposed; for example, Henderson et al. (2000) proposed a stationary Gaussian process, and Proust-Lima et al. (2016) proposed including either a Brownian motion or auto-regressive process Additionally, we assume that b i and " ij are independent.

Competing risks submodel
The standard approach of modelling competing risks data includes cause-specific hazards (Putter et al., 2007), in which, for subject i, the instantaneous rate for failure of type g at time t is given by where X .2/ i .t/ is a (possibly time varying) vector of covariates, and β .2/ g is a vector of fixed effect parameters for the gth event. The baseline hazard function λ 0g .t/ is cause specific and can either be left unspecified or modelled parametrically. In the extension to joint modelling, we consider a model of the form where W ig .t/ denotes a mean 0 latent term and X .2/ i is a vector of covariates known at baseline.

Joint models
The basic premise of joint models is that the separate longitudinal and time-to-event models that are described in equations (1)  In all models, except model 4, we assume the same submodel for the longitudinal data. Therefore, most of the joint models that are described here can be distinguished in terms of (a) the specification of the baseline hazard functions λ 0g .t/, (b) specification of the latent processes W ig .t/, (c) the estimation approach (including parameter uncertainty estimation) and (d) software implementation.
A summary of the various model properties is given in Table 1.
Lagged effects parameterization Cumulative effects parameterization Weighted cumulative effects parameterization Special case of the random-effects parameterization (with fixed component) .1/ 1 and b i1 denote the fixed and random-effect coefficients associated with the linear time covariate in a random-intercepts and random-slopes linear mixed model 4a Weibull Association between submodels accounted entirely for by latent classes 4b Piecewise constant Each subject i belongs to a single latent class, a i ∈ {1, 2, : : : , R}, with submodels

Baseline hazard functions
Models 1 and 2 leave the baseline hazard functions completely unspecified, opting for a semiparametric joint model, which is consistent with the classical Wulfsohn and Tsiatis (1997) joint model approach. Models 3 and 4 propose spline models for the baseline hazard, namely B-splines (on log-hazard) and either M-splines or piecewise constant (on the untransformed hazard scale). Provided that a sufficient number of knots are specified, each model in principle offers good flexibility to capture a wide range of hazard shapes (Rutherford et al., 2015). Model 4 also considers a third specification for the hazard: the Weibull distribution. Although less flexible than the semiparametric and spline models, it can potentially reduce the computational time to fit the model. See Table 1 for a summary of the model specifications.

Latent association structure
Different characteristics of the longitudinal profile may influence the cause-specific hazards submodel. Within the joint model literature, several association structures have been proposed, but little attention has been given to the challenge of choosing the most appropriate structure for a given data set (Andrinopoulou and Rizopoulos, 2016). Among the models that are examined here, various latent association structures have been considered, including several alternatives for some models (Table 1). Model 1 assumes that W ig .t/ is proportional to the latent term in the linear mixed model, Z i .t/ T b i , which extends the model that was proposed by Henderson et al. (2000) (Table 1, model 1) This is a form of the current values parameterization, whereby the 'value' is the subject-specific deviation from the average trajectory. Model 2 applies W ig .t/ = α g θ i where θ i is a zero-mean subject-specific random effect (Table 1, model 2). The correlation between the submodels is induced by assuming that .θ T i , θ i / T are correlated. Unlike model 1, where association between submodels is tested on the basis entirely of the hypothesis of α g = 0, the null hypothesis of Σ bθ = 0 provides a useful assessment of association between outcomes for model 2.
Model 3 (Rizopoulos, 2012) has been developed with several useful structures, including the current value parameterization (Table 1,  Model 4 is a JLCM (Proust-Lima et al., 2016), which is fundamentally different from models 1-3. This model assumes that each subject i belongs to a single latent class, a i ∈ {1, 2, : : : , R}. It is assumed that subjects within the same latent class share the same mean longitudinal trajectory and hazard risk (Proust-Lima et al., 2012). Hence, the linear mixed model is class specific and the baseline hazard function is assumed to be class and cause specific. From this model, the association between the longitudinal and time-to-event submodels is entirely captured by the latent classes. Each class membership probability is modelled by a multinomial logit: i is a vector of baseline covariates corresponding to class-specific coefficient parameters, ϕ r .

Estimation
All models here are fitted by maximum likelihood estimation using different algorithms. Different approaches to standard error estimation are also taken. Details about the estimation algorithms including numerical quadrature approaches are given in the on-line supplementary data Table S1.

Software
Models 1 and 2 are available in the form of scripts for the languages R and C respectively (see Section 8 for details). Models 3 and 4 can be fitted by using R packages. Details on the general software implementation are given in the on-line Table S1. User instructions are provided with the source code for models 2-4. We have also integrated the scripts for model 1 into the R package joineR (Philipson et al., 2017) including a complete manual. It is beyond the scope of this paper to detail every step of the software implementation including data preprocessing; however, as a demonstration, we have made software code that was used to fit the models to SANAD data available on line (see the data sharing details in Section 8).

Application of models to epilepsy drug trial data
To allow for comparison, we assumed the same functional form for each model as that in the original analysis undertaken by Williamson et al. (2008). Namely, the following linear mixed model for the longitudinal measurements submodel was assumed: .4/ with X i denoting a binary time-independent treatment effect (X i = 1 if patient i was randomized to LTG, and 0 if randomized to CBZ), and " ij and .b i0 , b i1 / T are distributed N.0, σ 2 " / and N 2 .0, Σ/ respectively. Although Williamson et al. (2008) also considered a piecewise linear spline model, for our comparison, we focused only on the above simple linear model. They further specified a cause-specific hazard model for the competing risks data by All models were fitted by using the same computing environment with predominantly default settings, except where reported otherwise. Technical details for implementation of each model are given in the on-line supplementary data.

Interpretation of the treatment effects
The direct treatment effect on UAEs, β .2/ UAE , was statistically significant for all joint models considered ( Table 2). The direct treatment effect on ISC, β .2/ ISC , was non-significant for all models except model 3b. Hence, the results from all models (with the single exception of model 3b, which we discuss below) suggest that, if LTG is titrated at the same rate as CBZ, the beneficial effect of LTG on a UAE would still be evident, and there remains no evidence of a difference in seizure control between the two drugs. This conclusion is consistent with the original trial analysis (Williamson et al., 2007a), which we demonstrate by fitting the separate models for longitudinal outcome (4) and competing risks ((3), without adjustment for titration) (Table 2, model: separate).
The estimated treatment effect on calibrated dose, β .1/ 2 , from models 1, 2 and 4a, indicates that patients who were assigned to LTG were titrated significantly more slowly than those allocated to CBZ. The estimated effects from model 3 were generally closer to the separate linear mixed  ) are reported in the main text. ‡The association parameters, although capturing the strength of association between the submodels, have widely different interpretations between models and so should not be directly contrasted. See the main text for an explanation of each model. Note also that models 3b and 4 have two association parameters per cause. §Time does not take into account the grid search algorithm or fitting times of models with other numbers of classes. model submodel than those of models 1, 2 and 4 (Table 3). Treatment-and-time interaction effects β .1/ 3 were broadly similar for models 1 and 3, but attenuated towards 0 for model 2. The estimated fixed effects for time were also similar, confirming that the average calibrated dose was increased over the study period with the exception of model 3b, which was larger. Model 4 was fitted to allow for class-specific intercept, time and treatment-and-time interaction terms; hence, the estimates are not comparable with the average effects of the other models. We note that the parameter estimates from the longitudinal submodel were not of primary interest in the SANAD clinical trial; however, this example has highlighted that, among the joint models fitted, the association structures and method of estimation and software used to fit the joint models can have different influences on the longitudinal submodel estimates.  ‡CIs are reported for the Cholesky decomposition of the variance-covariance matrix, rather than directly for the covariance parameters. We have not attempted to back-transform these parameters.

Fig. 2.
Causal diagrams for models 1 and 3 (Y.t/, observed longitudinal data; μ.t/, mean trajectory function; ", error; T , failure time; X , treatment; Z.t/ T b, mean 0 latent process; .β .1/ 2 , β .1/ 3 /, treatment effects on longitudinal processes; α g , effect of longitudinal process on treatment failure (superscripts denote multiple effects); β .2/ g , treatment effect on failure time; !, effect via a transformation or operator function; for model 3c, the log-hazard of treatment failure is associated with a lagged value of the mean trajectory function in addition to the direct treatment effect; for models 3b, 3d and 3e, the log-hazard of treatment failure is associated with a linear sum of μ.t/ and the first-order derivate or (weighted) integral of μ.t/, in addition to the direct treatment effect; for model 3f, the log-hazard depends on the subject-specific random slope for time in addition to the direct treatment effect; in models 1 and 3f, the treatment effect is both direct and overall): (a) model 1; (b) models 3a and 3c; (c) models 3b, 3d and 3e; (d) model 3f

Interpretation of latent association parameters
A summary of the time-to-event submodel parameter estimates is given in Table 2. We strongly emphasize that caution is required when contrasting treatment and association parameter effects between models as each has a different latent association structure, despite the shared common notation (for example, see Fig. 2 for a contrast of models 1 and 3). We describe the interpretation for each model in turn below.

Model 1
The parameter estimates that are associated with changes in W ig .t/ = b i0 + b i1 t were significant for both events, leading to the inference that calibrated drug dose is associated with time to treatment failure for both failure reasons. Moreover, the parameters had opposite signs. The clinical interpretation, as discussed in Williamson et al. (2008), was that patients on higher doses may be more likely to withdraw from treatment because of ISC since the reason that they are on higher doses is that they are having continued seizures. Also, patients on higher doses may be less likely to be withdrawn from treatment because of a UAE since the reason that a required dose increase is possible is because no such events have occurred.

Model 2
The shared random effect (frailty term) in the competing risks submodel had an estimated variance σ 2 θ = 0:791 (95% confidence interval (CI), from 0.571 to 1.010). The two covariance terms between longitudinal submodel random effects and the frailty term, Σ T bθ , were both negative: −0:093 (95% CI, from −0:187 to 0.001) and −0:402 (95% CI, from −0:495 to −0:310) respectively, implying that an increase in calibrated dose relative to the population average (which corresponds to an increase in the components of b i / is associated with a decrease in frailty θ i . The coefficient α ISC multiplies the frailty for the event due to ISC relative to a UAE (and α UAE multiplies the frailty for the event due to a UAE, but is constrained to 1 for identifiability). Asα ISC was negative, when interpreted alongside the negative correlation between b i and θ i , it implies that an increase in calibrated dose is associated with a decrease in risk of a UAE (due to decreasing frailty), and an increase in risk of ISC. This is consistent with the original analysis conclusions.

Model 3a
Model 3a assumes that the hazard is associated with the current expected value of the calibrated drug dose at time t, i.e. β whereas model 1 assumed that the association depended on only the random subject-specific deviation from the current expected value, i.e. b i0 + b i1 t. Hence, the overall treatment effect on the event hazard is decomposed into the direct effect β .2/ g and the indirect effect α g .β Therefore, the direct treatment effect must be interpreted by also adjusting for treatment-specific intercept and slope of dose titration in the hazard model.

Model 3b
Rapid changes in titration might be indicative of risk of treatment failure, i.e. any two patients having the same mean value at a given time t might have very different slopes if their random intercepts also vary. Therefore, a separate term that captures the changes in slope might be required. Contrary to the other models and to our clinical understanding, this results in a statistically significant treatment effect for failure due to ISC. The model assumes that the instantaneous hazard of dropout is associated with a linear combination of current mean value and its first-order derivative, namely W ig .t/ = α .1/ 3 X i , and .α .1/ g , α .2/ g / are a pair of association parameters corresponding to the current value and first-derivate terms for cause g. Expanding W ig .t/, we can decompose the treatment effects ( Fig. 2) in the cause-specific hazards submodel into a direct effect β .2/ g and an indirect effect, Hence, the disparity of direct treatment effect from other models is purely an artefact caused by indirect treatment effect adjustments.

Model 3c
The risk of treatment failure might depend, not on the current value of the marker, but on a historical calibrated dose, e.g. if the AED takes a long period to have a physiological effect on patients. In this case, we might consider the calibrated dose at a previous time point, t − c (for some choice of c, taken to be c = 6 months in this example for illustration). Inferences remained consistent with the non-lagged model, model 3a.

Models 3d and 3e
Expanding on the clinical rationale for models 3a and 3c, similar arguments might lead us to hypothesize that the complete history of the calibrated dose up to time t affects the risk of treatment failure, where for example pharmacokinetic-pharmacodynamic properties can lead to cumulative toxicity effects. Hence, α g captures the association between the risk of event and the (weighted) cumulative effect up to time t. The choice of weight function will be application specific. Here, we use a Gaussian density function, as in Rizopoulos (2012), which gives less weight to points the further away from the current time they are. For both models, cumulative (calibrated) drug dose was significantly associated with ISC, which is consistent with the other models. However, cumulative (calibrated) drug dose was not significantly associated with a UAE, although the direction of effects remained consistent.

Model 3f
A simple time-independent association is specified, whereby the risk of treatment failure is associated with the subject-specific random slope for t. This is a special case of random-effects parameterization. Both latent association parameters were significantly different from 0 and in the same direction as for other models; for example model 3a indicates that patients who are titrated more quickly are at increased risk of ISC but reduced risk of a UAE. However, the direct treatment effect for treatment failure also coincides with the overall treatment effect in this model. It was also observed that the latent association parameters were much larger than for other models. This is explained because the variances of the random effects are relatively small, here 0.684 and 0.200 respectively, and hence a unit change in slope is quite extreme. Rizopoulos (2012), page 114, therefore recommended scaling W ig .t/ by this standard deviation to facilitate interpretation. However, to avoid confusion in the comparison between models, we report unscaled estimates.

Models 4a-4c
There are no parameter estimates for inference about the degree of association between the submodels. Instead, the patients are assumed to be members of one of five latent classes. For example, model 4a comprised five classes with 22.8%, 6.6%, 58.3%, 7.4% and 4.8% of subjects allocated, with good discrimination (the mean of the posterior probabilities in each class ranged from 76% (class 1) to 93% (class 2)). The treatment effects were modelled as cause specific only (i.e. independent of class), which led to consistent inferences with the other models. Figs S1 and S2 in the on-line supplementary material summarize the fitted submodels for each class. Model 4b also yielded similar effect sizes, albeit smaller in absolute magnitude. However, model 4b offered increased flexibility in the baseline hazard function compared with model 4a, which was constrained to follow a Weibull model. We were unable to fit model 4c because of lack of convergence.

Model assessment and comparison
Software implementations of models 1 and 2 do not incorporate any features that facilitate model assessment or comparison. However, we have recently updated the R package joineR (Philipson et al., 2017) to report the log-likelihood for model 1.
The software implementation of model 3 reports several statistics that are useful for model comparison, including the log-likelihood at the maximum likelihood estimate, the Akaike information criterion (AIC) and the Bayes information criterion (BIC). For models 3a-3f, the AIC values were 7099.34, 7061.75, 7106.35, 7163.33, 7139.52 and 7029.09 respectively, thus suggesting that model 3f is the preferred model. The same conclusion is obtained when contrasting the BIC for each model (7161.25 for model 3f), and the log-likelihood (−3484:55). In addition, several residual diagnostics for the fitted longitudinal data submodel can be readily viewed from a fitted model (e.g. Fig. S3 in the on-line supplementary material). Although the residual plots indicated some systematic trend, the non-random dropout mechanism can mean that such plots are potentially misleading (Rizopoulos (2012), page 157). However, the solution that was proposed by Rizopoulos et al. (2010) to overcome this-multiple-imputation residuals-is not available in the software for competing risks models. Moreover, residuals for the event time submodel appear to be unavailable for competing risks joint models.
The software implementation of model 4 (Proust-Lima et al., 2017) provides a score test for the independence assumption (Jacqmin-Gadda et al., 2010) between the submodels (conditional on the latent classes)-both globally and cause specific-which JLCMs depend on. Model 4a yielded a statistically significant score test statistic (P = 0:012); however, the causespecific score tests were non-significant (P = 0:23 (UAE); P = 0:13 (ISC)). A six-class model (the results are not shown here) did not reject the hypothesis, and estimated treatment effectŝ β 2 UAE = −0:839 (95% CI, from −1:522 to −0:157) andβ .2/ ISC = 0:018 (95% CI, from −0:563 to 0.599), which yield broadly similar inferences as per the estimates from the five-class model. Therefore, the rejection of the independence assumption does not appear to have an effect on the inferences in this instance. Model 4b did not reject the conditional independence assumption (P = 0.12). However, it should be noted that score tests are distribution dependent, which might explain the different test conclusions between models 4a and 4b. Posterior class membership tables can also be extracted from fitted models, which enable the user to assess the goodness of fit of the model and model discrimination (Proust-Lima et al., 2012). As per model 3, the AIC, BIC and log-likelihood statistics are given as output. The five-class Weibull model had the lowest AIC (6787.68 versus 6805.78) and BIC (6955.08 versus 6990.80), and maximum log-likelihood (−3355:84 versus −3360:89) relative to the five-class piecewise constant model. Although model 4a had a lower AIC than did model 3f, caution should be taken in choosing one over the other, as each offers a fundamentally different structure. Finally, the software for model 4 offers several residual diagnostics (e.g. Fig. S4 in the on-line supplementary material) from the fitted longitudinal data submodel, which can be readily viewed from a fitted model.

Discussion
Over the past decade, research into joint modelling of longitudinal and competing risks data has grown. We described four models (Elashoff et al., 2008;Williamson et al., 2008;Rizopoulos, 2012;Proust-Lima et al., 2017), each of which has provided software to enable model fitting.
The models that were explored here encompassed several association structures. Misspecification of the association structure might lead to biased effects. Ideally, a pragmatic balance between biologically plausible associations and a parsimonious model should be used; however, this might still be challenging. To date, there is limited research on this particular issue (Andrinopoulou and Rizopoulos, 2016). Although each model can be used for inference, if interest lies in risk prediction for personalized treatments, then model 4 would perhaps be most appropriate. The reasons for this are twofold: first, no consideration on the correct functional form of the latent association structure is required; second, the software implementation (Proust-Lima et al., 2017) can readily produce predictions. Although the JLCM is potentially robust to misspecification of the latent association structure, which might be exploitable for inference, it conversely precludes inference on that aspect of the joint model, which might be of interest to researchers. At the expense of the flexibility that is offered by model 2, the interpretation of the latent association is also less intuitive. Models 2, 3f and 4 also assume time-independent associations, whereas models 1 and 3a-3e consider different time-dependent association structures.
Among the proposals for the cause-specific hazards competing risks submodel, all models allowed for the specification of flexible baseline hazard functions, including spline and semiparametric models. In principle therefore, if the hazard is void of abrupt changes and a sufficient number of knots are specified for the spline models (Rutherford et al., 2015), then the differences in hazards should only subtly affect the model estimates. Notwithstanding calls for both causespecific hazard and cumulative incidence function models to be used in analyses (Latouche et al., 2013), the latter have received relatively little attention in the joint model framework. Deslandes and Chevret (2010) proposed modelling of competing risks data using the subdistribution hazard model of Gray (1999), andProust-Lima et al. (2016) constructed cumulative incidence graphs from fitted JLCMs.
Estimation approaches and software implementations also varied substantially. A summary of the merits and limitations of each model implementation is given in Table S1 in the on-line supplementary material. One very practical limitation is that the software for models 1 and 2 permits only two causes of failure, although the models are presented in full generality. Despite some variation in the model fits to the SANAD trial data note that we did not 'validate' the software for each model, as this has already been done separately in each of the individual methodological research papers. Moreover, a grand simulation analysis would be conflated by the different 'true' association structures. Nonetheless, understanding whether there is a gain in efficiency by using one particular model, or whether any are more robust to misspecification than others, would be a useful direction of future research. We are also aware that two new software applications have recently been published: Andrinopoulou et al. (2014) has recently made available a WinBUGS script for fitting a joint model of bivariate longitudinal data (including an ordinal outcome) and competing risks data, and Wang et al. (2017) published an SAS macro for fitting parametric joint models involving competing risks data. We illustrated the model fits and differences in interpretation by using a real world data set from an AED trial. Unlike research in other disease areas, there is very limited use of joint modelling in the disease area of epilepsy. The focus of this study was on the direct treatment effects, β .2/ ISC and β .2/ UAE , following a clinical hypothesis that differential titration between the two AED drugs, LTG and CBZ, resulted in an unfair comparison in a separate model competing risks analysis (Williamson et al., 2008). We note that a fundamental complication in contrasting models and treatment effects is that some models estimate an overall treatment effect, whereas others estimate separate direct and indirect treatment effects (Ibrahim et al., 2010). Titration of AEDs is a complex issue that has far-ranging implications for the design and interpretation of drug trials. Joint modelling provides a sensible analytical method to handle this issue. We found that adjustment for calibrated dose in a joint model framework resulted in similar conclusions to those found in the original analysis of the SANAD trial data.
To allow for the routine adoption of joint modelling of competing risks and longitudinal data in the analysis of clinical data sets, further availability of model diagnostics, model comparisons and statistical tools for study design are necessary. We noted that some diagnostic and goodnessof-fit assessment methods are available in the JM (Rizopoulos, 2010) and lcmm (Proust-Lima et al., 2017) R packages (see models 3 and 4 respectively). Prediction methods and assessment tools are also emerging. Blanche et al. (2015) have proposed dynamic area under the receiver operating characteristic curve and Brier score statistics applicable to these models, and the lcmm R package can generate dynamic predictions (see model 4). Finally, in our application the measurement schedule was fixed by the trial protocol; however, in other cases, particularly observational studies, the measurement times may be informative. One possible future solution would be the addition of a third submodel to capture the recurrent events process (Liu et al., 2008).

Conclusion
Although the models that were investigated here represent progress within the field, joint modelling of competing risks and longitudinal data is still very much in its infancy in terms of methodological development and is particularly restricted in terms of its ease of application to analyse clinical data sets. There is a real need to increase the availability and accessibility of software to allow the routine adoption of the methodologies when designing and analysing clinical studies. Further work is also required on the development of model diagnostics to aid model choice. Despite the vastly different association structures, the inference remains consistent across fitted models when applied to the SANAD data. The association between AED titration and treatment failure was significant in most models, and the estimated treatment effects implied a beneficial effect of LTG on UAEs over CBZ, but the two drugs were similar in terms of the hazard for ISC.

Data sharing
The data that are analysed in this paper are freely available from the R package joineR. The software and code that were used to fit the models are available from https://github. com/graemeleehickey/comprisk.