Monday, 10 September 2012

Practicable confidence intervals for current status data


Byeong Yeob Choi, Jason P. Fine and M. Alan Brookhart have a new paper in Statistics in Medicine. Essentially the paper clarifies the practical implications of results relating to the asymptotic theory for current status data. In particular, it is known that the nonparametric bootstrap is inconsistent for current status data when the distribution of sampling times is continuous. The authors note that the most reliable method considered previously is a previous study by Ghosh et al concluded that construction of confidence intervals based on inversion of the likelihood ratio statistic (as original proposed in Banerjee and Wellner (2001)) gave the best results particularly for smaller sample sizes. However, they also note that this approach is difficult to implement (e.g. lack of available software). Here they therefore pursue approaches based on using the limiting Chernoff distribution to construct Wald type confidence intervals, but using cloglog or logit transformations to get better coverage, and also look at the performance of simple non-parametric bootstrap.

Perhaps unsurprisingly they find that, when sample sizes are relatively small, the performance of all methods is dependent on the quantile of the failure time distribution at which the confidence interval is computed (e.g. it performs much better when t is close to the median) and the observation density at the time point considered (performance is poorer at times with a lower observation density). Using cloglog or logit transformations is found to improve coverage, but the non-parametric bootstrap tended to outperform this approach for smaller sample sizes, suggesting the admissibility of using the non-parametric bootstrap.

An apparent omission in the paper is any mention of the altered asymptotics in the case where the observation distribution has support at a finite set of time points (or indeed where the rate of increase of points is less than ). This issue is most comprehensively discussed in Tang et al which didn't come out until after the paper was apparently submitted. However, the basic issue of standard asymptotics (and by implication a consistent bootstrap) when there is a finite set of observation points is discussed in Maathuis and Hugdens (2011) which the authors cite. For instance, in the Hoel and Walburg mice dataset used as illustration, the resolution of the data is to the nearest day. It is therefore reasonable to assume that in this case were the sample size to increase, the number of observations would either be bounded by a fixed value (e.g. ~1000) or else the number of unique points would increase at a rate much less than .

Friday, 7 September 2012

Effect of vitamin A deficiency on respiratory infection: Causal inference for a discretely observed continuous time non-stationary Markov process


Mingyuan Zhang and Dylan Small have a paper to appear in The Canadian Journal of Statistics, currently available here. The paper uses a multi-state model approach to obtain estimates of the causal effects of vitamin D deficiency on respiratory infection.

The observed data consist of observations of respiratory infection status, vitamin D deficiency status and whether the child is stunted at time t. Each of these is a binary variable, leading to 8 possible observation patterns.

The data are assumed to be generated from an underlying non-homogeneous Markov chain on 32 states, consisting of a latent 4-level definition of vitamin D deficiency, the stunting variable, the observed respiratory infection status and additionally a counter-factual respiratory infection status defined as the status at time t hat would have occurred had the child maintained the lowest level of vitamin D deficiency from time 0 to time t.

Let represent the observed infection status, the underlying vitamin deficiency and the counter-factual infection status, the assumed relationship between them is given by , and for j>0. then measures the additional risk of having a respiratory infection at a particular time, given a current vitamin D deficiency at level j>0.

The overall model is a pretty innovative use of a hidden Markov model structure to obtain those causal estimates. The true process is assumed to occur in continuous time. However, it is desired that the underlying transition intensities are not time constant. As a result, the authors choose to approximate the process by one in discrete time (with some similarities to the approach of Bacchetti et al 2010).

In practical terms, the weakness of the model seems to be the assumption that the relative effect of a current vitamin deficiency compared to a perfect record of vitamin D levels, is both constant in time and does not depend on the past history of vitamin D deficiency. The latter assumption is essentially the Markov assumption and would be quite difficult to relax. The observed infection status is effectively a misclassified version of the counter-factual infection status. As a result the former assumption could be relaxed by letting the depend on time, perhaps in a piecewise constant fashion.

The definition of the process as initially being continuous is a little artificial and ill-specified in places. For instance, the transition intensities are defined in terms of a logit transformation from the outset. Also, it is stated that the underlying 32 state process (including both of ) is a continuous-time Markov process. However if is defined only through the misclassification equations, there is no limiting intensity for as . Once the process is in discrete time this is not a problem, but it would be more sensible to define the underlying process in continuous time to be 16 states not including and then specify that the observed are observations in a hidden Markov model.