## Tuesday, 26 October 2010

### Multiple imputation for estimating the risk of developing dementia and its impact on survival

Yu, Saczynski and Launer have a new paper in Biometrical Journal. This proposes an approach to fitting Cox-Markov or Cox-semi-Markov models to illness-death models subject to interval censoring via the use of iterative multiple imputation.

In essence this paper is a generalization of the method for interval censored survival data proposed by Pan (2000; Biometrics). The idea is to start by imputing the mid-point of the interval for unknown event times, calculate the parameter estimates treating these times as known. Then based on the new estimates, impute e.g. 20 different complete datasets, estimate the parameters based on these complete datasets, and the re-impute. This process continues until the parameter estimates converge.

The authors consider data regarding mortality and the onset of dementia. They model post-dementia hazard as being proportional to pre-dementia hazard and additionally allow onset age to affect hazard (for the semi-Markov model).

A drawback of the approach, which was mentioned in the paper by Pan, but not seemingly referred to by Yu et al, is that the support points for the baseline intensities will depend entirely on the initial values chosen, i.e. the final baseline intensities will have support at a subset of the original support points. For small datasets, there may be a decay in the number of support points down to 1. Pan suggested using a Link estimate of the baseline hazard (i.e. linear smoothed Breslow estimate) and to impute estimates from this to avoid the problem.

A further issue is determining what exactly constitutes convergence of the parameter estimates. After all, at each iteration a random imputation of M datasets takes place. Hence, we can expect the new parameter estimates to have a certain degree of random variation compared to the previous iteration.

The authors should be commended for having provided code and a subset of the full data in the Supporting Material of the paper. However, there seems to be an error in the code which results in an overestimation of the cumulative disease onset hazard. The code obtains the Breslow estimate of the cumulative baseline hazard. It then estimates the hazard between event times to be $\inline h(t) = (H(t_{i+1}) - H(t_i))/(t_{i+1} - t_i).$ But instead of then simulating new event times from a piecewise-exponential distribution, they ascribe h(t) to all event times (e.g. previously imputed onset times and death times) between t_i and t_{i+1}. The coding error seems to also result in considerable bias in the other estimates as corrected code (assuming support for the hazard only at imputed onset times) seems to give quite different results.

The analysis seems to be incorrect in that it treats all patients as left-truncated (i.e. healthy) at age 50 (presumably time 0 in the data is age 50). No deaths occur before age 71.6 (or 21.6 on the time scale used in the data). Even prevalent cases enter the study conditional on survival until their entry age.

A comparison with existing methods can be made by computing the NPMLE under a Markov assumption. For this we use methods developed by Gauzere (2000; PhD Thesis) and Frydman and Szarek (2009; Biometrics). For data removing the prevalent cases, a corrected version of the imputation code gives estimates that are reasonably similar to the NPMLE (see figure), whereas the uncorrected version substantially overestimates the cumulative incidence of dementia.

The results of the analysis of the data in the paper are close to those using the subsample data with the code, suggesting the same code was used for the full data. It is unclear whether the same code was used for the simulation data.

Finally, it should be noted that the method, while undoubtedly more straightforward to implement than either a semi-parametric maximum likelihood estimate or the penalized likelihood approach of Joly and Commenges, is nevertheless still quite slow. Moreover, as with the other methods, the complexity increases rapidly as the number of states in the model increases.

#### 1 comment:

annes said...

Thanks for these notes which have been very helpful.
I am working on the method proposed by Yu et al. (2010) and would be curious to know if you've got some further informations concerning the R code or the method.