Friday 23 November 2012

Ties between event times and jump times in the Cox model


Xin, Horrocks and Darlington have a new paper in Statistics in Medicine. This considers approaches for dealing with ties in Cox proportional hazard models, not between event times but between event times and changes to time dependent covariates.

If a change in a time-dependent covariate coincides with a failure time there is ambiguity over which value of the time dependent covariate, z(t+) or z(t-), should be taken for the risk set at time t. By convention, it is usually assumed that z(t-) should be taken, i.e. that the change in the covariate occurs after the failure time. The authors demonstrate that for small sample sizes and/or a large proportion of ties, the estimates can be sensitive to the convention chosen. The authors also only consider cases where z(t) is a binary indicator that jumps from 0 to 1 at some point and cannot make the reverse jump. Obviously this will magnify the potential for bias because the "change after" convention will always underestimate the true risk whereas the "change before" will always overestimate the true risk.

The authors consider some simple adjustments for the problem: compute the "change before" and "change after" estimates and take their average or use random jittering. A problem with the averaging approach is estimating the standard error of the resulting estimator. An upper bound can be obtained by assuming the two estimators have perfect correlation. The jittering estimator obviously has the problem that different random jitters will give different results, though in principle the jittering could be repeated multiple times and combined in a fashion akin to multiple imputation.

It is surprising that the further option of adopting an method akin to the Efron method for ties. Essentially at each failure time there is an associated risk set. It could be argued that every tied covariate jump time had a 50% chance of occurring before or after the failure time. The expected contribution from a particular risk set could then be
It should also be possible to apply this approach using standard software, e.g. coxph() in R. It is simply necessary to replace any (start,stop) interval that ends with a tied "stop" with two intervals (start, stop - 0.00001) and (start, stop + 0.00001) each of which are associated with a weight of 0.5.

Sunday 28 October 2012

Survival analysis with time varying covariates measured at random times by design


Stephen Rathbun, Xiao Song, Benjamin Neustiftler and Saul Shiffman have a new paper in Applied Statistics (JRSS C). This considers estimation of a proportional hazards survival model in the presence of time dependent covariates which are only intermittently sampled at random time points. Specifically they are interested in examples relating to ecological momentary assessment where data collection may be via electronic devices like smart phones and the decision on sampling times can be automated. They consider a self-correcting point-process sampling design, where the intensity of the sampling process depends on the past history of sampling times, which allows an individual to have random sampling times that are more regular than would be achieved from a Poisson process.

The proposed method of estimation is to use inverse intensity weighting to obtain an estimate of an individual's integrated hazard up to the event time. Specifically the estimator is for a individual with sampling times at times and point process sampling intensity . This then replaces the integrated hazard in an approximate log-likelihood.

In part of the simulation study and in an application the exact point process intensity is unknown and taken from empirical estimates from the sample. Estimating the sampling intensity didn't seem to have major consequences on the integrity of the model estimates. This seems to suggest the approach might be applicable in other survival models where the covariates are sampled longitudinally in a subject specific manner, provided a reasonable model for sampling can be devised.

Drawbacks of the method seem to be that covariates measured at baseline (which is not a random time point) cannot be incorporated in the estimate and that it seems that the covariates must be measured at the event time which may not be the case in medical contexts. The underlying hazard also needs to be specified parametrically, but as stated flexible spline modelled can be used.

Friday 26 October 2012

Estimating parametric semi-Markov models from panel data using phase-type approximations


Andrew Titman has a new paper in Statistics and Computing. This extends previous work on fitting semi-Markov models to panel data using phase-type distributions. Here, rather than assume a model in which each state is assumed to have a Coxian phase-type sojourn distribution, the model assumes a standard parametric sojourn distribution (e.g. Weibull or Gamma). The computational tractability of phase-type distributions is exploited by approximating the parametric semi-Markov model by a model in which each state has a 5-phase phase-type distribution. In order to achieve this, a family of approximations to the parametric distribution, with scale parameter 1, is developed by solving a relatively large one-off optimization assuming the optimal phase-type parameters for given shape parameters evolve as B-spline functions. The resulting phase-type approximations can then be scaled to give an approximation for any combination of scale and shape parameter and then embedded into the overall semi-Markov process. The resulting approximate likelihood appears to be very close to the exact likelihood, both in terms of shape and magnitude.

In general a 2-phase Coxian phase-type model is likely to give similar results to a Weibull or Gamma model. The only advantages of the Weibull or Gamma model are that the parameters are identifiable under the null (Markov) model so standard likelihood ratio tests can be used to compare models (unlike for the 2-phase model). Also the Weibull or Gamma model requires fewer parameters so may be useful for smaller sample sizes.

Constrained parametric model for simultaneous inference of two cumulative incidence functions


Haiwen Shi, Yu Cheng and Jong-Hyeon Jeong have a new paper in Biometrical Journal. This paper is somewhat similar in aims to the pre-print by Hudgens, Li and Fine in that it is concerned with parametric estimation in competing risks models. In particular, the focus is on building models for the cumulative incidence functions (CIFs) but ensuring that the CIFs sum to less than 1 at the asymptote as time tends to infinity. Hudgens, Li and Fine dealt with interval censored data but without covariates. Here, the data are assumed to be observed up to right-censoring but the emphasis is on simultaneously obtaining regression models directly for each CIF in a model with two risks.

The approach taken in the current paper is to assume that the CIFs will sum to 1 at the asymptote, to model the cause 1 CIF using a modified three-parameter logistic function with covariates via an appropriate link function. The CIF for the second competing risk is assumed to also have a three-parameter logistic form, but covariates only affect this CIF through the probability of this risk ever occurring.

When a particular risk in a competing risks model is of primary interest, the Fine-Gray model is attractive because it makes interpretation of the covariate effects straightforward. The model of Shi et al seems to be for cases where both risks are considered important, but still seems to require that one risk be considered more important. The main danger of the approach seems to be that the model for the effect of covariates on the second risk may be unrealistic, but will have an effect on the estimates for the first risk. If we only care about the first risk the Fine-Gray model would be a safer bet. If we care about both risks it might be wiser to choose a model based on the cause-specific hazards, which are guaranteed to induce a model with well behaved CIFs albeit at the expense of some interpretability of the resulting CIFs.

Obtaining a model with a direct CIF effect for each cause seems an almost impossible task because, if we allow a covariate to effect the CIF in such a way that a sufficiently extreme covariate leads to a CIF arbitrarily close to 1, it must be having a knock-on effect on the other CIF. The only way around this would be to have a model that assigns maximal asymptote probabilities to the CIFs at infinity that are independent of any covariates e.g. where are increasing functions taking values in [0,1] and . The need to restrict the to be independent of covariates would make the model quite inflexible however.

Sunday 14 October 2012

Assessing age-at-onset risk factors with incomplete covariate current status data under proportional odds models


Chi-Chung Wen and Yi-Hau Chen have a new paper in Statistics in Medicine. This considers estimation of a proportional odds regression model for current status data in cases where a subset of the covariates may be missing at random for a subset of the patient population.

It is assumed that the probability that a portion of the covariates is missing depends on all the other observable outcomes (the failure status, the survey time and the rest of the covariate vector). The authors propose to fit a logistic regression model, involving all subjects in the dataset, for this probability of missingness. To fit the regression model for the current status data itself, they propose to use what they term a "validation likelihood estimator." This involves only working with the subset of patients with complete data but maximizing a likelihood that conditions on the fact that the whole covariate vector was observed. An advantage of using the proportional odds model over other candidate models (e.g. proportional hazards) is that the resulting likelihood remains of the proportional odds form.

Clearly a disadvantage of this "validation likelihood estimator" is that the data from subjects who have incomplete covariates is not used directly in the regression model. As a result the estimator is likely to be less efficient than approaches that effectively attempt to impute the missing covariate values. The authors argue that the validation likelihood approach will tend to be more robust since it is not necessary to make (parametric) assumptions about the conditional distribution of the missing covariates.

Friday 12 October 2012

Nonparametric estimation of the cumulative intensities in an interval censored competing risks model


Halina Frydman and Jun Liu have a new paper in Lifetime Data Analysis. This concerns non-parametric estimation for competing risks models under interval censoring. The problem of estimating the cumulative incidence functions (or sub-distribution functions) under interval censoring has been considered by Hudgens et al (2001) and involves an extension of the NPMLE for standard survival data under interval censoring.

The resulting estimates of the cumulative incidence functions are only defined up to increments on intervals. Moreover, the intervals by which the CIFs are defined are not the same for each competing risk. This causes problems if one wants to convert the CIFs into estimates of the cumulative cause-specific hazards. Frydman and Liu propose estimating the cumulative cause-specific hazards by first constraining the NPMLEs of the CIFs to have the same intervals of support (NB: this is just a sub-set of the set of all NPMLEs involving sub-dividing the intervals) and adopting a convention to distribute the increment within the resulting sub-intervals (they assume an equal distribution across sub-interval).

In addition they show that an ad-hoc estimator of the cumulative hazards based on the convention that the support of each interval of the NPMLE for each CIF occurs at its midpoint leads to biased results. They also show that their estimator has standard N^0.5 convergence when the support of the observation time distribution is discrete and finite.

Tuesday 2 October 2012

Effect of an event occurring over time and confounded by health status: estimation and interpretation. A study based on survival data simulations with application on breast cancer


Alexia Savignoni, David Hajage, Pascale Tubert-Bitter and Yann De Ryckea have a new paper in Statistics in Medicine. This considers developing illness-death type models to investigate the effect of pregnancy on the risk of recurrence of cancer amongst breast cancer patients. The authors give a fairly clear account of different potential models with particular reference to the hazard ratio The simplest model to consider is a Cox model with a single time dependent covariate representing pregnancy, here . This can be extended by assuming non-proportional hazards which effectively makes the effect time dependent i.e. . Alternatively, an unrestricted Cox-Markov model could be fitted with separate covariate effects and non-parametric hazards from each pregnancy state, yielding: This model can be restricted by allowing a shared baseline hazard for giving either under a Cox model with a fixed effect or for a time dependent effect.

If we were only interested in and any of these models seems feasible, there doesn't actual seem that much point in formulating the model as an illness-death model. Note that the transition rate does not feature in any of the above equations but would be estimated in the illness-death model. The above models can be fitted by a Cox model with a time dependent covariate (representing pregnancy) that has an interaction with the time fixed covariates. The real power of a multi-state model approach would only become apparent if we were interested in the overall survival for different covariates, treating pregnancy as a random event.

The time dependent effects are represented simply via a piecewise constant time indicator in the model. The authors do acknowledge that a spline model would have been better. The other issue that could have been considered is whether the effect of pregnancy depends on time since initiation of pregnancy (i.e. a semi-Markov effect). An issue in their data example is that pregnancy is only determined via a successful birth meaning there may be some truncation in the sample (through births prevented due to relapse/death).

Applying competing risks regression models: an overview

Bernhard Haller, Georg Schmidt and Kurt Ulm have a new paper in Lifetime Data Analysis. This reviews approaches to building regression models for competing risks data. In particular, they consider cause specific hazard regression, subdistribution hazard regression (via both the Fine-Gray model and pseudo-observations), mixture models and vertical modelling. The distinction between mixture models and vertical modelling is the order of the conditioning. In mixture models, imply a separate time to event model is developed for each cause of death. Whereas in vertical modelling, meaning there is an overall "all cause" model for survival with a time dependent model for the conditional risk of different causes. Vertical modelling fits in much nearer to the standard hazard based formulation used in classical competing risks. Haller et al also prefer it to mixture modelling for computational reasons. The authors conclude however that vertical modelling's main purpose is as an exploratory tool to check modelling assumptions which may be made in a more standard competing risks model. They suggest that in a study, particularly a clinical trial, it would be more appropriate to use a Cox model either on the cause-specific hazards or on the sub-distribution hazard. The choice between these two models would depend on the particular research question of interest.

Monday 10 September 2012

Practicable confidence intervals for current status data


Byeong Yeob Choi, Jason P. Fine and M. Alan Brookhart have a new paper in Statistics in Medicine. Essentially the paper clarifies the practical implications of results relating to the asymptotic theory for current status data. In particular, it is known that the nonparametric bootstrap is inconsistent for current status data when the distribution of sampling times is continuous. The authors note that the most reliable method considered previously is a previous study by Ghosh et al concluded that construction of confidence intervals based on inversion of the likelihood ratio statistic (as original proposed in Banerjee and Wellner (2001)) gave the best results particularly for smaller sample sizes. However, they also note that this approach is difficult to implement (e.g. lack of available software). Here they therefore pursue approaches based on using the limiting Chernoff distribution to construct Wald type confidence intervals, but using cloglog or logit transformations to get better coverage, and also look at the performance of simple non-parametric bootstrap.

Perhaps unsurprisingly they find that, when sample sizes are relatively small, the performance of all methods is dependent on the quantile of the failure time distribution at which the confidence interval is computed (e.g. it performs much better when t is close to the median) and the observation density at the time point considered (performance is poorer at times with a lower observation density). Using cloglog or logit transformations is found to improve coverage, but the non-parametric bootstrap tended to outperform this approach for smaller sample sizes, suggesting the admissibility of using the non-parametric bootstrap.

An apparent omission in the paper is any mention of the altered asymptotics in the case where the observation distribution has support at a finite set of time points (or indeed where the rate of increase of points is less than ). This issue is most comprehensively discussed in Tang et al which didn't come out until after the paper was apparently submitted. However, the basic issue of standard asymptotics (and by implication a consistent bootstrap) when there is a finite set of observation points is discussed in Maathuis and Hugdens (2011) which the authors cite. For instance, in the Hoel and Walburg mice dataset used as illustration, the resolution of the data is to the nearest day. It is therefore reasonable to assume that in this case were the sample size to increase, the number of observations would either be bounded by a fixed value (e.g. ~1000) or else the number of unique points would increase at a rate much less than .

Friday 7 September 2012

Effect of vitamin A deficiency on respiratory infection: Causal inference for a discretely observed continuous time non-stationary Markov process


Mingyuan Zhang and Dylan Small have a paper to appear in The Canadian Journal of Statistics, currently available here. The paper uses a multi-state model approach to obtain estimates of the causal effects of vitamin D deficiency on respiratory infection.

The observed data consist of observations of respiratory infection status, vitamin D deficiency status and whether the child is stunted at time t. Each of these is a binary variable, leading to 8 possible observation patterns.

The data are assumed to be generated from an underlying non-homogeneous Markov chain on 32 states, consisting of a latent 4-level definition of vitamin D deficiency, the stunting variable, the observed respiratory infection status and additionally a counter-factual respiratory infection status defined as the status at time t hat would have occurred had the child maintained the lowest level of vitamin D deficiency from time 0 to time t.

Let represent the observed infection status, the underlying vitamin deficiency and the counter-factual infection status, the assumed relationship between them is given by , and for j>0. then measures the additional risk of having a respiratory infection at a particular time, given a current vitamin D deficiency at level j>0.

The overall model is a pretty innovative use of a hidden Markov model structure to obtain those causal estimates. The true process is assumed to occur in continuous time. However, it is desired that the underlying transition intensities are not time constant. As a result, the authors choose to approximate the process by one in discrete time (with some similarities to the approach of Bacchetti et al 2010).

In practical terms, the weakness of the model seems to be the assumption that the relative effect of a current vitamin deficiency compared to a perfect record of vitamin D levels, is both constant in time and does not depend on the past history of vitamin D deficiency. The latter assumption is essentially the Markov assumption and would be quite difficult to relax. The observed infection status is effectively a misclassified version of the counter-factual infection status. As a result the former assumption could be relaxed by letting the depend on time, perhaps in a piecewise constant fashion.

The definition of the process as initially being continuous is a little artificial and ill-specified in places. For instance, the transition intensities are defined in terms of a logit transformation from the outset. Also, it is stated that the underlying 32 state process (including both of ) is a continuous-time Markov process. However if is defined only through the misclassification equations, there is no limiting intensity for as . Once the process is in discrete time this is not a problem, but it would be more sensible to define the underlying process in continuous time to be 16 states not including and then specify that the observed are observations in a hidden Markov model.

Wednesday 22 August 2012

Bootstrap confidence bands for sojourn distributions in multistate semi-Markov models with right censoring


Ronald Butler and Douglas Bronson have a new paper in Biometrika. This develops approaches to non-parametric estimation of survival or hitting time distributions in multi-state semi-Markov models with particular emphasis on cases where (multiple) backward transitions are possible. The paper is an extension of the authors' previous JRSS B paper. Here the methodology additionally allows for right-censored observations.

The key property used throughout is that, under a semi-Markov model, the data can be partitioned into separate independent sets relating to the sojourn and destinations for each state.
A straightforward, but computationally intensive, approach to finding bootstrap confidence bands for the survival distribution of a semi-Markov process involves a nested re-sampling scheme, first sampling with replacement from the set of pairs of sojourn times and destination states for each state in the network, and for each set of data simulating a large number of walks through the network (starting from the initial state and ending at the designated 'hitting' state) from the empirical distributions implied from the first layer of re-sampling. An alternative, faster approach proposed involves replacing the inner step of the bootstrap with a step which computes saddlepoint approximations of the empirical sojourn distributions and then using flowgraph techniques to find the implied hitting time distributions. A drawback of both approaches is that they require all the mass of the sojourn distributions to be allocated. Hence, in the presence of right-censoring some convention for the mass at the tail of the distribution must be made. Here, the convention of "redistribute-to-the-right" is used, which effectively re-weights the probability mass of all the observed survival times so that it sums to 1. On the face of it this seems a rash assumption. However, right-censoring in the sample only occurs at the end of a sequence of sojourns in several states. As such, in most cases where this technique would be used, the censoring in any particular state is likely to be quite light, even if the overall observed hitting times are heavily censored. Alternative conventions on the probability mass (for instance assuming a constant hazard beyond some point) could be made, but in all cases would be arbitrary, but hopefully would have little impact on the overall estimates.

Unlike non-parametric estimates under a Markov assumption, for which the overall survival distribution will essentially equal the overall Kaplan-Meier estimate, with increasing uncertainty as time increases, under a (homogeneous) semi-Markov assumption the estimated hazard tends to a constant limit and can hence be estimated to a relatively large degree of precision at arbitrarily high times.

Saturday 18 August 2012

A semi-Markov model for stroke with piecewise-constant hazards in the presence of left, right and interval censoring


Venediktos Kapetanakis, Fiona Matthews and Ardo van den Hout have a new paper in Statistics in Medicine. This develops a progressive three-state illness-death model for strokes with interval-censored data. The proposed model is a time non-homogeneous Markov model. The main approach to computation is to assume a continuous effect of age (time) on transition intensities but to use a piecewise constant approximation to actually fit it. The intensity to death from the stroke state additionally depends on the time since entry into the state (i.e. age at stroke) and since the exact time is typically unknown, it is necessary to numerically integrate over the possible range of transition times (here using Simpson's rule).

The data include subjects whose time to stroke is left-censored because they have already suffered a stroke before the baseline measurement. The authors state that they cannot integrate out the unknown time of stroke because the left-interval (i.e. the last age at which the subject is known to have been healthy) is unknown. They then proceed to propose a seemingly unnecessary ad-hoc EM-type approach based on estimating the stroke age for these individuals, which requires the arbitrary choice of an age at which it can be assumed the subject was stroke free. However, surely if we can assume a , we can just use as the lower limit in the integral for the likelihood?

The real issue seems to be that all subjects are effectively left-truncated at the time of entry into the study (in the sense that they are only sampled due to not having died before their current age). For subjects who are healthy at baseline this left-truncation is accounted for by just integrating the hazards of transition out of state 1 from their age at baseline rather than age 0. For subjects who have already had a stroke things are more complicated because the fact they have survived provides information on the timing of the stroke (e.g. if stroke increases hazard of death, the fact they have survived implies the stroke occurred sooner than one would assume if no information on survival were known). Essentially the correct likelihood is conditional on survival to time and so the unconditional probability of the observed data needs to be divided through by the unconditional probability of survival to time . For instance, in their notation, a subject in state 2 at baseline censored at time should have likelihood contribution: The authors claim that their convoluted EM-approach has "bypassed the problem of left truncation". In reality, they have explicitly corrected for left-truncation (because the expected transition time is conditional on being in state 2 at baseline) but in a way that is seemingly much more computationally demanding than directly computing the left-truncated likelihood would be.

Tuesday 14 August 2012

Absolute risk regression for competing risks: interpretation, link functions, and prediction

Thomas Gerds, Thomas Scheike and Per Andersen have a new paper in Statistics in Medicine. To a certain extent this is a review paper and considers models for direct regression on the cumulative incidence function for competing risks data. Specifically models of the form where is a known link function and is the cumulative incidence function for event 1 given covariates X. The Fine-Gray model is a special case of this class of models, where a complementary log-log link is adopted. Approaches to estimation based on inverse probability of censoring weights and jackknife based pseudo-observations are considered. Model comparison based on predictive accuracy as measured through Brier score and model diagnostics based on extended models allowing time dependent covariate effects are also discussed.
The discussion gives a clear account of the various pros and cons of direct regression of the cumulative incidence functions. In particular, an obvious, although perhaps not always sufficiently emphasized issue is that if, in a model with two causes, a Fine-Gray (or other direct model) is fitted to the first cause, and another to the second cause, the resulting predictions will not necessarily have the property that an issue that is not problematic if the second cause is essentially a nuisance issue, but obviously problematic if both causes are of interest. In such cases regression of the cause-specific-hazards is preferable even if it makes interpreting the effect on the cumulative intensity functions more difficult.

Friday 10 August 2012

Semiparametric transformation models for current status data with informative censoring

Chyong-Mei Chen, Tai-Fang Lu, Man-Hua Chen and Chao-Min Hsu have a new paper in Biometrical Journal. In common with the recent paper by Wang et al, this considers the estimation of current status data under informative censoring. Here, rather than using a copula to describe the dependence between censoring and failure time distributions, a shared frailty is assumed between the censoring intensity and the failure intensity. The frailty is assumed to be log-normal such that and Covariate effects are also allowed for via a general class of transformation models. For estimation, the authors approximate the semi-parametric maximum likelihood estimate by assuming that the conditional intensities, for the censoring and failure events, are piecewise constant functions with an arbitrarily chosen set of change points. Since maximization of the likelihood requires estimation of a large number of unknown parameters and integration of the frailty distribution, the authors propose an EM algorithm. The method attempts to non-parametrically estimate the failure and censoring time distributions and also the variance of the frailty term. While the assumed dependence between T & C is reasonably restrictive, ie. the frailty could feasible have appeared as within the intensity for C allowing other types of dependence. Nevertheless, even with the restrictions it is not clear how the overall model is identifiable. We can only observe where is an indicator for whether the failure has occurred by censoring time . Log-normal frailties are not particularly nice computationally, whereas a Gamma frailty would allow some tractability. In the case of a shared frailty it can be shown that the marginal distribution of censoring times is and the conditional probability of a failure by time X is given by The problem is that we can vary the value of v and find new cumulative intensity functions which will result in the same distribution functions. The addition of covariates via a particular model facilitates some degree of identifiability but, in a similar way to frailty terms in univariate Cox-proportional hazard models, this could just as easily be describing misspecification of the covariate model rather than true dependence.