## Friday, 17 December 2010

### Estimating the distribution of the window period for recent HIV infections: A comparison of statistical methods

Michael Sweeting, Daniela De Angelis, John Parry and Barbara Suligo have a new paper in Statistics in Medicine. This considers estimating the time to a biomarker crossing a threshold from seroconversion in doubly-censored data (denoted T). Such methods have been used in the past to model time to seroconversion from infection. Sweeting et al state in the introduction that a key motivation is to allow the prevalence of recent infection to be estimated. This is defined as

$P(d) = \int_{0}^{d} I(t)S(d-t) dt$

where S(t) is the time from seroconversion to biomarker crossing. Here we see there is a clear assumption that T is independent of calendar time d. This is consistent with the framework first proposed by De Gruttola and Lagakos (Biometrics, 1989) where the time to first event is assumed to be independent of the time between first and second events, the data are essentially a three-state progressive semi-Markov model where interest lies in estimating the sojourn distribution in state 2. Alternatively, the methods of Frydman (JRSSB, 1992) are based on a Markov assumption. Here the marginal distribution of the sojourn in state 2 could be estimated by integrating over the estimated distribution of times to entry into state 2. Sweeting et al however treat it as standard bivariate survival data (where X denotes the time to serocoversion and Z denotes time to biomarker crossing) using MLEcens for estimation. It's not clear whether this is sensible if the aim is to estimate P(d) as defined above. Moreover, Betensky and Finklestein (Statistics in Medicine, 1999) suggested an alternative algorithm would be required for doubly-censored data because of the inherent ordering (i.e. Z>X). Presumably as long as the intervals for seroconversion and crossing are disjoint the NPMLE for S is guaranteed to have support only at non-negative times.

Sweeting et al make a great deal about the fact that since the NPMLE of (X,Z) is non-unique, e.g. the NPMLE can only ascribe mass to regions of the (X,Z) plane, and that T = Z - X, there will be huge uncertainty over the distribution of S between the extremes of assuming mass only in the lower left corner or only in the upper right corner of the support rectangles. They provide a plot to show this in their data example. The original approach to doubly-censored data of De Gruttola and Lagakos assumes a set of discrete time points of mass for S (the generalization to higher order models was given by Sternberg and Satten). This largely avoids any problems of non-uniqueness (though some problems remain see e.g. Griffin and Lagakos 2010).

For the non-parametric approach, Sweeting et al seem extremely reluctant to make any assumptions whatsoever. In contrast, they go completely to town on assumptions once they get onto their preferred Bayesian method. It should firstly be mentioned that the outcome time is time to crossing of a biomarker and there is considerable auxiliary data available in the form of intermediate measurements. Thus, under any kind of monotonicity assumptions, we can see that extra modelling of the growth process of the biomarker has merit. Sweeting et al use a parametric, mixed effects growth model, to model the growth of the biomarker. The growth is measured in terms of time since seroconversion, which is unknown. They consider two methods: a naive method that assumes seroconversion occurs at the midpoint of the time interval and a uniform prior method that assumes a priori that the seroconversion time is distributed uniformly within the times at which it is interval censored (i.e. last time before seroconversion and the first biomarker measurement time). The first method is essentially like imputing the interval midpoint for X. The second method is like assuming the marginal distribution for X is uniform.

Overall, the paper comes across as a straw man argument against non-parametric methods appended onto a Bayesian analysis that makes multiple (unverified) assumptions.