Multi-state modelling: December 2010

Friday, 17 December 2010

Estimating the distribution of the window period for recent HIV infections: A comparison of statistical methods

Michael Sweeting, Daniela De Angelis, John Parry and Barbara Suligo have a new paper in Statistics in Medicine. This considers estimating the time to a biomarker crossing a threshold from seroconversion in doubly-censored data (denoted T). Such methods have been used in the past to model time to seroconversion from infection. Sweeting et al state in the introduction that a key motivation is to allow the prevalence of recent infection to be estimated. This is defined as

$P(d) = \int_{0}^{d} I(t)S(d-t) dt$

where S(t) is the time from seroconversion to biomarker crossing. Here we see there is a clear assumption that T is independent of calendar time d. This is consistent with the framework first proposed by De Gruttola and Lagakos (Biometrics, 1989) where the time to first event is assumed to be independent of the time between first and second events, the data are essentially a three-state progressive semi-Markov model where interest lies in estimating the sojourn distribution in state 2. Alternatively, the methods of Frydman (JRSSB, 1992) are based on a Markov assumption. Here the marginal distribution of the sojourn in state 2 could be estimated by integrating over the estimated distribution of times to entry into state 2. Sweeting et al however treat it as standard bivariate survival data (where X denotes the time to serocoversion and Z denotes time to biomarker crossing) using MLEcens for estimation. It's not clear whether this is sensible if the aim is to estimate P(d) as defined above. Moreover, Betensky and Finklestein (Statistics in Medicine, 1999) suggested an alternative algorithm would be required for doubly-censored data because of the inherent ordering (i.e. Z>X). Presumably as long as the intervals for seroconversion and crossing are disjoint the NPMLE for S is guaranteed to have support only at non-negative times.

Sweeting et al make a great deal about the fact that since the NPMLE of (X,Z) is non-unique, e.g. the NPMLE can only ascribe mass to regions of the (X,Z) plane, and that T = Z - X, there will be huge uncertainty over the distribution of S between the extremes of assuming mass only in the lower left corner or only in the upper right corner of the support rectangles. They provide a plot to show this in their data example. The original approach to doubly-censored data of De Gruttola and Lagakos assumes a set of discrete time points of mass for S (the generalization to higher order models was given by Sternberg and Satten). This largely avoids any problems of non-uniqueness (though some problems remain see e.g. Griffin and Lagakos 2010).

For the non-parametric approach, Sweeting et al seem extremely reluctant to make any assumptions whatsoever. In contrast, they go completely to town on assumptions once they get onto their preferred Bayesian method. It should firstly be mentioned that the outcome time is time to crossing of a biomarker and there is considerable auxiliary data available in the form of intermediate measurements. Thus, under any kind of monotonicity assumptions, we can see that extra modelling of the growth process of the biomarker has merit. Sweeting et al use a parametric, mixed effects growth model, to model the growth of the biomarker. The growth is measured in terms of time since seroconversion, which is unknown. They consider two methods: a naive method that assumes seroconversion occurs at the midpoint of the time interval and a uniform prior method that assumes a priori that the seroconversion time is distributed uniformly within the times at which it is interval censored (i.e. last time before seroconversion and the first biomarker measurement time). The first method is essentially like imputing the interval midpoint for X. The second method is like assuming the marginal distribution for X is uniform.

Overall, the paper comes across as a straw man argument against non-parametric methods appended onto a Bayesian analysis that makes multiple (unverified) assumptions.

Thursday, 16 December 2010

A Measure of Explained Variation for Event History Data

Stare, Perme and Henderson have a new paper in Biometrics. This develops a measure of explained variation, in some sense analogous to $\inline R^2$ for linear regression, for survival and event history data. The measure, denoted $\inline R_E$ is based on considering all individuals at risk at event times and considering the rank of the individual that had the event, in terms of the estimated intensities under a null model (e.g. assuming homogeneity of intensities across subjects), the current model (e.g. a Cox regression or Aalen additive hazards model) and a perfect model (where the individual who had an event always has the greatest intensity). In the case of complete observation, $\inline R_E$ is the ratio of the sum of the difference in ranks between the null and current model and the sum of the difference in ranks between the null and the perfect model. Thus $\inline R_E = 1$ would represent perfect prediction whilst $\inline R_E = 0$ would imply a model as good as the null model. Note that it is possible to have $\inline R_E$ < 0 when the predictions are worse than under a null model.

When there is not complete observation a weighted version is proposed. Weighting is based on the inverse probability of being under observation. For data subject to right-censoring independent of covariates, this can be estimated using a 'backwards' Kaplan-Meier estimate of the censoring distribution. The weighting occurs in two places. Firstly, the contribution of each event time is weighted to account for missing event times. Secondly, at each event time the contribution to the ranking of each individual is weighted by the probability of observation. This latter weighting is relevant when censoring is dependent on covariates.

A very nice property of the measure is that a local version relevant to a subset of the observation period is possible. This is a useful alternative way of diagnosing, for instance, time dependent covariate effects, e.g. lack of fit will manifest itself in a deterioration in $\inline R_E$ .

One practical drawback of the measure is the requirement to model the "under observation" probability. For instance, left-truncated data would require some joint model of left-truncation and right-censoring times. In the context of Cox-Markov multi-state models, it would in principle be possible to compute a separate $\inline R_E$ for the models for each intensity. However, there will be inherent left-truncation and its not clear whether weighting to get the data under complete observation makes sense in this case because complete observation is unattainable in reality since subjects can only occupy one state at any given time.

The authors provide R code to calculate $\inline R_E$ for models fitted in the survival package. However, left truncated data is not currently accommodated.

Friday, 3 December 2010

Interpretability and importance of functionals in competing risks and multi-state models

Per Kragh Andersen and Niels Keiding have a new paper currently available as a Department of Biostatistics, Copenhagen research report. They argue that three principles should be adhered to when constructing functionals of the transition intensities in competing risks and illness-death models. The principles are:

1. Do not condition on the future.
2. Do not regard individuals at risk after they have died.
3. Stick to this world.

They identify several existing ideas that violate these principles. Unsurprising, the latent failure times model for competing risks rightly comes under fire for violating (3), i.e. to say anything about hypothetical survival distributions in the absence of the other risks requires making untestable assumptions. Semi-competing risks analysis where one seeks the survival distribution for illness in the absence of death has the same problem.

The subdistribution hazard from the Fine-Gray model violates principle 2 because it involves the form $\inline P(X(t + dt) = j | X(t) \neq j)$ . Andersen and Keiding say this makes interpretation of regression parameters difficult because they are log(subdistribution hazard ratios). The problem seems to be that many practitioners interpret the coefficients as if they are standard hazard ratios. The authors go on to say that linking covariates directly to cumulative incidence functions is useful. The distinction between this and the Fine-Gray model is rather subtle as in the Fine-Gray model (when covariates are not time dependent): $\inline CIF(t; Z) = 1 - \exp{(-H(t)\exp{(b^{T}Z)})}$ i.e. b is essentially interpreted as a parameter in a cloglog model.
The conditional probability function recently proposed by Allignol et al has similar problems with principle 2.

Principle 1 is violated in the pattern-mixture parametrisation. This is where we consider the distribution of event times conditional on the event type, e.g. the sojourn in state i given the subject moved to state j. This is used for instance in flow-graph Semi-Markov models.

A distinction that is perhaps needed that isn't really made clear in the paper is that there is a difference between violating the principles for mathematical convenience e.g. for model fitting and violating the principles in terms of the actual inferential output. Functionals to be avoided should perhaps be those where no easily interpretable transformation to a sensible measure is available. Thus a pattern-mixture parametrisation for a semi-Markov model without covariates seems unproblematic, since we can retrieve the transition intensities. However, when covariates are present the transition intensities will have complicated relationships to the covariates without an obvious interpretation.

**** UPDATE : The paper is now published in Statistics in Medicine. *****

Multi-state modelling

Links

Followers

Blog Archive