Multi-state modelling: longitudinal data

Showing posts with label longitudinal data. Show all posts

Tuesday, 5 February 2013

The gradient function as an exploratory goodness-of-fit assessment of the random-effects distribution in mixed models

Geert Verbeke and Geert Molenberghs have a new paper in Biostatistics. The paper proposes the use of the gradient function (or equivalently the directional derivatives) of the marginal likelihood with respect to the random effects distribution, as a way of assessing goodness-of-fit in a mixed model. They concentrate on cases related to standard longitudinal data analysis using linear (or generalized linear) mixed models, however the method can be extended to other mixed models, such as clustered multi-state models with multivariate (log)-normal random effects.

If we consider data from units i with observations x_i, given a mixing distribution G, we can say the marginal density is given by $f(\mathbf{x}_i , G) = \int f(\mathbf{x}_i, \mathbf{u}) dG(\mathbf{u})$ .

The gradient function is then taken as $\Delta(G,\mathbf{u}) = \frac{1}{N} \frac{f(\mathbf{x}_i, \mathbf{u})}{f(\mathbf{x}_i, G)}$ where N is the total number of independent clusters

The use of the gradient function stems from finite mixture models and in particular the problem of finding the non-parametric maximum likelihood estimate of the mixing distribution. At the NPMLE the gradient function has a supremum of 1. If instead we assume there is a parametric mixing distribution, under correct specification the gradient function should be close to 1 across all values of u. Verbeke and Molenberghs use this property to construct an informal graphical diagnostic of the appropriateness of the proposed random effects distribution.

An advantage of the approach is that essentially no additional calculations are required to compute the measure, above and beyond those already needed for estimation of the parametric mixture model itself. A current limitation of the approach is that there is no formal test to assess whether the observed deviation is statistically significant. It is stated that this is ongoing work. It seems reasonably straightforward to show that the gradient function will tend to a Gaussian process with mean 1 but with a quite complicated covariance structure. Obtaining some nice asymptotics for a statistic based either on the maximum deviation from 1 or some weighted integral of the distance from 1 therefore seems unlikely. However, it may be possible to obtain a simulation based p-value by simulating from the limiting Gaussian process.

Sunday, 28 October 2012

Survival analysis with time varying covariates measured at random times by design

Stephen Rathbun, Xiao Song, Benjamin Neustiftler and Saul Shiffman have a new paper in Applied Statistics (JRSS C). This considers estimation of a proportional hazards survival model in the presence of time dependent covariates which are only intermittently sampled at random time points. Specifically they are interested in examples relating to ecological momentary assessment where data collection may be via electronic devices like smart phones and the decision on sampling times can be automated. They consider a self-correcting point-process sampling design, where the intensity of the sampling process depends on the past history of sampling times, which allows an individual to have random sampling times that are more regular than would be achieved from a Poisson process.

The proposed method of estimation is to use inverse intensity weighting to obtain an estimate of an individual's integrated hazard up to the event time. Specifically the estimator is $\hat{H}_i(T_i) = \sum_{j=1}^{N_i} h_i(u_{ij}) \pi^{-1}_i(u_{ij})$ for a individual with $\inline N_i$ sampling times at times $\inline 0 \leq u_{ij} \leq T_i$ and point process sampling intensity $\inline \pi(t)$ . This then replaces the integrated hazard in an approximate log-likelihood.

In part of the simulation study and in an application the exact point process intensity is unknown and taken from empirical estimates from the sample. Estimating the sampling intensity didn't seem to have major consequences on the integrity of the model estimates. This seems to suggest the approach might be applicable in other survival models where the covariates are sampled longitudinally in a subject specific manner, provided a reasonable model for sampling can be devised.

Drawbacks of the method seem to be that covariates measured at baseline (which is not a random time point) cannot be incorporated in the estimate and that it seems that the covariates must be measured at the event time which may not be the case in medical contexts. The underlying hazard also needs to be specified parametrically, but as stated flexible spline modelled can be used.

Sunday, 29 April 2012

Nonparametric multistate representations of survival and longitudinal data with measurement error

Bo Hu, Liang Li, Xiaofeng Wang and Tom Greene have a new paper in Statistics in Medicine. This develops approaches for summarizing longitudinal and survival data in terms of marginal prevalences. The authors' use of "multistate" is perhaps not entirely in line with its typical usage. They consider data consisting of right censored competing risks data plus additional continuous longitudinal measurements which persist until a competing event has occurred (or censoring). For the purpose of creating a summary measure, the longitudinal measurement can be partitioned into a set of discrete states. There are thus states corresponding to absorbing competing risks plus a series of transient states corresponding to the longitudinal measurements. The aim of the paper is to develop a nonparametric estimate of the marginal probability of being in a particular state at a particular time.

The approach taken is firstly to use standard non-parametric estimates for competing risks data to get estimates of the probability of being in each of the absorbing states. For the longitudinal part, it is assumed that the "true" longitudinal process is not directly observed but instead observed with measurement error. As a consequence the authors propose to use smoothing splines to get an individual estimate of each subject's true trajectory. The combined state occupancy probability at time t for a longitudinal state then consists of the overall survival probability from the competing risks multiplied by the proportion of subjects still at risk at time t who are estimated (on the basis of their spline smooth) to be within that interval. The probability of being in an absorbing state is computed directly from the competing risks estimates. Overall a stacked probability plot consisting of the stacked CIFs for each of the competing risks plus the (not necessarily monotonic) partition of the longitudinal states.

The use of individual smoothing splines seems to present practical problems. Firstly, it assumes that the true longitudinal process is itself in some way "smooth". In some cases the change in state in a biological process may manifest itself in a rapid collapse of a biomarker. Secondly, it seems to require a relatively large number of longitudinal measurements per person in order to get a reasonable estimate of their "true" process. Presumably the level of the longitudinal measure is likely to have a bearing on the cause-specific-hazards of the competing risks. The occurrence of one of the competing risks is thus informative for the longitudinal process. The authors claim to have got around this by averaging only over people currently in the risk set at time t. However, the longitudinal measurements are intermittent. If they are sparse then someone may be observed at say 1 year, 2 years and then die at 10 years. The method would estimate a smooth spline based on years 1 and 2 and extrapolate up to 10 years not using the fact the subject died at 10 years. Similarly, there might be one or fewer longitudinal observations before a competing event for some patients making estimation of the true trajectory near impossible. Also, the estimator as it stands attempts no weighting to take account of the relative uncertainties about different individuals true trajectories at particular times. Overall as a descriptive tool it may be useful in some circumstances; primarily if subjects has regular longitudinal measurements. In this respect it is similar to the "prevalence counts" (Gentleman et al, 1994) method of obtaining non-parametric prevalence estimates for interval censored multi-state data.

In the appendix, a brief description is given of an approach to allowing the transition probabilities between states to be calculated. They only illustrate the method for a case of going from a longitudinal state to an absorbing state (presumably the procedure for transitions between longitudinal states would be different). Nevertheless, there doesn't seem to be any guarantee that estimated transition probabilities will lie in [0,1].

Tuesday, 27 March 2012

Modeling Left-truncated and right-censored survival data with longitudinal covariates

Yu-Ru Su and Jane-Ling Wang have a new paper in the Annals of Statistics. This considers the problem of modelling survival data in the presence of intermittently observed time varying covariates when the survival times are both left truncated as well as right-censored. They consider a joint model which involves assuming there exists a random effect which influences both the longitudinal covariate values (which are assumed to be a function of the random effects plus Gaussian error) and the survival hazard. Considerably work has been done in this area in cases where the survival times are merely right-censored (e.g. Song, Davidian and Tsiatis, Biometrics 2002). The authors show that the addition of left-truncation complicates inference quite considerably; firstly because the parameters affecting the longitudinal component may not be identifiable and secondly because the score equations for the regression and baseline hazard parameters become much more complicated than in the right-censoring case. To alleviate this problem, the authors propose to use a modified likelihood rather than either the full or conditional likelihood. The full likelihood can be expressed in terms of an integral over the conditional distribution of the random effect, given the event time occurred after the truncation time. The proposed modification is to instead integrate over the unconditional random effect distribution. Heuristically this is justified by noting that
$f_{A^{*}}(a | Y^{*} \geq t) = f_{A^{*}}(a)S_{Y}(t|A^{*}=a)/S_{Y}(t)$ and
$E\left(S_{Y}(t|A^{*}=a)/S_{Y}(t) \right) = 1, \forall t \geq 0$ where $\inline A^{*}$ is the random effect. The authors also show inference based on this modified likelihood gives consistent and asymptotically efficient estimators of the regression parameters and the baseline survival hazard.

An EM algorithm to obtain the MMLE is outlined, in which the E-step involves a multi-dimensional integral which the authors evaluate through Monte Carlo approximation. The implementation of the EM algorithm is simplified if the random effect is assumed to have a multivariate Normal distribution.

Thursday, 29 September 2011

A multi-state model for the analysis of changes in cognitive scores over a fixed time interval

Arnold Mitnitski, Nader Fallah, Charmaine Dean and Kenneth Rockwood have a new paper in Statistical Methods in Medical Research. The paper develops a model to describe the trajectory of cognitive function test data. The responses are test scores out of 100, but the authors choose to group responses in 12 states. Additionally subjects may die before the following assessment. Assessments occur at (roughly) equally spaced intervals so a discrete-time model is adopted. Essentially the data are then ordinal longitudinal data.

The novel aspect of the model is to assume that, conditional on survival between time j-1 and j, the state at time j follows a truncated Poisson distribution on $S=0,1,\ldots,11$ , with the mean of the Poisson distribution taken as a linear function of the state at time j and covariates (including age and/or time since baseline measurement). A separate logistic regression model is applied to the deaths. This avoids having to pretend the data are continuous as one might if a linear mixed model were used, and also avoids there being a very large number of unknown parameters, as there would be if a general discrete-time Markov model were applied. However, the truncated Poisson distribution model makes strong assumptions about the conditional distribution of the states, which may or may not be well supported by the data. The authors note that the model fits better if 12 states are used rather than 16. Whether accommodating the proposed model should be a criterion for choosing the number of states to use in the model is questionable.

It is not clear a multi-state model is particularly appropriate for modelling a response with such a large number of responses. It might be better to follow a latent trait approach where $P(X_j = r ) = P(x_r < \theta_j \leq x_{r+1})$ for some Normally distributed latent variable $\inline \theta_j$ that evolves with time deterministically with the addition of stationary Gaussian noise (not necessarily independent), where the $\inline x_r$ are boundary values to be estimated. Similarly existing approaches to modelling MMSE based on a much simpler classification into cognitive-normal and cognitive-impaired with the possibility of misclassification considered (see e.g. Van den Hout and Matthews) are likely to give more meaningful results, even if data from raw MMSE scores is sacrificed.

Monday, 8 August 2011

Joint modelling of longitudinal outcome and interval-censored competing risk dropout in a schizophrenia clinical trial

Ralitza Gueorguieva, Robert Rosenheck and Haiqun Lin have a new paper in JRSS A. The paper concerns the joint modelling of a longitudinal outcome and an interval censored competing risks outcome that explains drop-out. As is common with these joint longitudinal and survival types of models the two processes are linked via a normally distributed vector of random effects. The novelty of the paper is in the survival part is a competing risks process and the event time is interval censored. The authors adopt a parametric model for the competing risks, using the family of distributions proposed by Sparling et al (Biostatistics, 2006). This makes inference somewhat more straightforward than it would be if a non-parametric baseline cause-specific hazards were used. As recently noted, parametric treatment of competing risks data is surprisingly rare. One problem faced by the authors is that the hazard family of Sparling, while allowing closed form expressions for interval censored univariate survival data, do not result in closed form expressions for interval censored competing risks data (except in special cases). Instead a numerical integral has to be competed. The presence of the overall random effects would mean the likelihood requires nested integration. To avoid this problem the authors adopt an approximation to the true likelihood for competing risks data. If a patient is known to have had a failure of type j in the interval [t0,t1] the authors assume that the patient is censored of all risks except risk j at time t0. It is clear that this approximation will lead to systematic bias as the time at risk from each failure type will be underestimated so the hazards will tend to be overestimated. The amount of bias will depend on the typical length of the intervals [t0,t1].

For the CATIE data example the proposed approximation is probably not an issue. The drop out (competing risks) part of the model is not the primary focus of the inference, and it is really the relative hazards of different types of drop out rather than their absolute values that is important in determining the trajectories of the longitudinal measure without drop out. For instance the estimates for simulated data of a similar type are close to unbiased.
However in extreme cases like current status competing risks data the approximation will do extremely badly.

Multi-state modelling

Links

Followers

Blog Archive