Multi-state modelling: 2010

Friday, 17 December 2010

Estimating the distribution of the window period for recent HIV infections: A comparison of statistical methods

Michael Sweeting, Daniela De Angelis, John Parry and Barbara Suligo have a new paper in Statistics in Medicine. This considers estimating the time to a biomarker crossing a threshold from seroconversion in doubly-censored data (denoted T). Such methods have been used in the past to model time to seroconversion from infection. Sweeting et al state in the introduction that a key motivation is to allow the prevalence of recent infection to be estimated. This is defined as

$P(d) = \int_{0}^{d} I(t)S(d-t) dt$

where S(t) is the time from seroconversion to biomarker crossing. Here we see there is a clear assumption that T is independent of calendar time d. This is consistent with the framework first proposed by De Gruttola and Lagakos (Biometrics, 1989) where the time to first event is assumed to be independent of the time between first and second events, the data are essentially a three-state progressive semi-Markov model where interest lies in estimating the sojourn distribution in state 2. Alternatively, the methods of Frydman (JRSSB, 1992) are based on a Markov assumption. Here the marginal distribution of the sojourn in state 2 could be estimated by integrating over the estimated distribution of times to entry into state 2. Sweeting et al however treat it as standard bivariate survival data (where X denotes the time to serocoversion and Z denotes time to biomarker crossing) using MLEcens for estimation. It's not clear whether this is sensible if the aim is to estimate P(d) as defined above. Moreover, Betensky and Finklestein (Statistics in Medicine, 1999) suggested an alternative algorithm would be required for doubly-censored data because of the inherent ordering (i.e. Z>X). Presumably as long as the intervals for seroconversion and crossing are disjoint the NPMLE for S is guaranteed to have support only at non-negative times.

Sweeting et al make a great deal about the fact that since the NPMLE of (X,Z) is non-unique, e.g. the NPMLE can only ascribe mass to regions of the (X,Z) plane, and that T = Z - X, there will be huge uncertainty over the distribution of S between the extremes of assuming mass only in the lower left corner or only in the upper right corner of the support rectangles. They provide a plot to show this in their data example. The original approach to doubly-censored data of De Gruttola and Lagakos assumes a set of discrete time points of mass for S (the generalization to higher order models was given by Sternberg and Satten). This largely avoids any problems of non-uniqueness (though some problems remain see e.g. Griffin and Lagakos 2010).

For the non-parametric approach, Sweeting et al seem extremely reluctant to make any assumptions whatsoever. In contrast, they go completely to town on assumptions once they get onto their preferred Bayesian method. It should firstly be mentioned that the outcome time is time to crossing of a biomarker and there is considerable auxiliary data available in the form of intermediate measurements. Thus, under any kind of monotonicity assumptions, we can see that extra modelling of the growth process of the biomarker has merit. Sweeting et al use a parametric, mixed effects growth model, to model the growth of the biomarker. The growth is measured in terms of time since seroconversion, which is unknown. They consider two methods: a naive method that assumes seroconversion occurs at the midpoint of the time interval and a uniform prior method that assumes a priori that the seroconversion time is distributed uniformly within the times at which it is interval censored (i.e. last time before seroconversion and the first biomarker measurement time). The first method is essentially like imputing the interval midpoint for X. The second method is like assuming the marginal distribution for X is uniform.

Overall, the paper comes across as a straw man argument against non-parametric methods appended onto a Bayesian analysis that makes multiple (unverified) assumptions.

Thursday, 16 December 2010

A Measure of Explained Variation for Event History Data

Stare, Perme and Henderson have a new paper in Biometrics. This develops a measure of explained variation, in some sense analogous to $\inline R^2$ for linear regression, for survival and event history data. The measure, denoted $\inline R_E$ is based on considering all individuals at risk at event times and considering the rank of the individual that had the event, in terms of the estimated intensities under a null model (e.g. assuming homogeneity of intensities across subjects), the current model (e.g. a Cox regression or Aalen additive hazards model) and a perfect model (where the individual who had an event always has the greatest intensity). In the case of complete observation, $\inline R_E$ is the ratio of the sum of the difference in ranks between the null and current model and the sum of the difference in ranks between the null and the perfect model. Thus $\inline R_E = 1$ would represent perfect prediction whilst $\inline R_E = 0$ would imply a model as good as the null model. Note that it is possible to have $\inline R_E$ < 0 when the predictions are worse than under a null model.

When there is not complete observation a weighted version is proposed. Weighting is based on the inverse probability of being under observation. For data subject to right-censoring independent of covariates, this can be estimated using a 'backwards' Kaplan-Meier estimate of the censoring distribution. The weighting occurs in two places. Firstly, the contribution of each event time is weighted to account for missing event times. Secondly, at each event time the contribution to the ranking of each individual is weighted by the probability of observation. This latter weighting is relevant when censoring is dependent on covariates.

A very nice property of the measure is that a local version relevant to a subset of the observation period is possible. This is a useful alternative way of diagnosing, for instance, time dependent covariate effects, e.g. lack of fit will manifest itself in a deterioration in $\inline R_E$ .

One practical drawback of the measure is the requirement to model the "under observation" probability. For instance, left-truncated data would require some joint model of left-truncation and right-censoring times. In the context of Cox-Markov multi-state models, it would in principle be possible to compute a separate $\inline R_E$ for the models for each intensity. However, there will be inherent left-truncation and its not clear whether weighting to get the data under complete observation makes sense in this case because complete observation is unattainable in reality since subjects can only occupy one state at any given time.

The authors provide R code to calculate $\inline R_E$ for models fitted in the survival package. However, left truncated data is not currently accommodated.

Friday, 3 December 2010

Interpretability and importance of functionals in competing risks and multi-state models

Per Kragh Andersen and Niels Keiding have a new paper currently available as a Department of Biostatistics, Copenhagen research report. They argue that three principles should be adhered to when constructing functionals of the transition intensities in competing risks and illness-death models. The principles are:

1. Do not condition on the future.
2. Do not regard individuals at risk after they have died.
3. Stick to this world.

They identify several existing ideas that violate these principles. Unsurprising, the latent failure times model for competing risks rightly comes under fire for violating (3), i.e. to say anything about hypothetical survival distributions in the absence of the other risks requires making untestable assumptions. Semi-competing risks analysis where one seeks the survival distribution for illness in the absence of death has the same problem.

The subdistribution hazard from the Fine-Gray model violates principle 2 because it involves the form $\inline P(X(t + dt) = j | X(t) \neq j)$ . Andersen and Keiding say this makes interpretation of regression parameters difficult because they are log(subdistribution hazard ratios). The problem seems to be that many practitioners interpret the coefficients as if they are standard hazard ratios. The authors go on to say that linking covariates directly to cumulative incidence functions is useful. The distinction between this and the Fine-Gray model is rather subtle as in the Fine-Gray model (when covariates are not time dependent): $\inline CIF(t; Z) = 1 - \exp{(-H(t)\exp{(b^{T}Z)})}$ i.e. b is essentially interpreted as a parameter in a cloglog model.
The conditional probability function recently proposed by Allignol et al has similar problems with principle 2.

Principle 1 is violated in the pattern-mixture parametrisation. This is where we consider the distribution of event times conditional on the event type, e.g. the sojourn in state i given the subject moved to state j. This is used for instance in flow-graph Semi-Markov models.

A distinction that is perhaps needed that isn't really made clear in the paper is that there is a difference between violating the principles for mathematical convenience e.g. for model fitting and violating the principles in terms of the actual inferential output. Functionals to be avoided should perhaps be those where no easily interpretable transformation to a sensible measure is available. Thus a pattern-mixture parametrisation for a semi-Markov model without covariates seems unproblematic, since we can retrieve the transition intensities. However, when covariates are present the transition intensities will have complicated relationships to the covariates without an obvious interpretation.

**** UPDATE : The paper is now published in Statistics in Medicine. *****

Sunday, 21 November 2010

The analysis of tumorigenicity data using a frailty effect

Yang-Jin Kim, Chung Mo Nam, Youn Nam Kim, Eun Hee Choi and Jinheum Kim have a new paper in the Journal of the Korean Statistical Society. This concerns the analysis of tumorigenicity data where a three-state model with states representing tumour-free, with tumour and death is used. The main characteristic of the data is that disease status (presence of tumour) can only be determined at either death or sacrifice - meaning we have data with possible observations: sacrifice without tumour, sacrifice with tumour, death without tumour and death with tumour.
Note that while the time of the tumour is always interval censored, if the mouse dies before sacrifice this death time is taken to be known exactly.

Kim et al extend the model proposed by Lindsey and Ryan (1993, Applied Statistics) by allowing a shared-frailty term that affects both tumour onset rate and the hazard of death given the presence of a tumour. Like Lindsey and Ryan, piecewise constant intensities are used to model the baseline intensities, with a Cox proportional hazards model for the effect of covariates. The same baseline hazard is used for pre- and post-tumour hazard of death, with the presence of tumour changing the effect of covariates. For some reason the introduction talks about n^{1/3} convergence rates of the non-parametric survival distribution for current status data. This doesn't seem relevant here given that the piecewise constant intensities model is parametric.

An EM algorithm is used to maximize the likelihood. Gauss-Hermite quadrature is required to perform the M-step due to the presence of the frailty terms. The authors appear to be basing standard error estimates on the complete data (rather than observed data) likelihood.

The new model is applied to the same data as used in Lindsey and Ryan. In addition, in a simulation it is shown that there is some reduction in bias compared to Lindsey and Ryan's method is the proposed model is correctly specified.

Monday, 8 November 2010

Accounting for bias due to a non-ignorable tracing mechanism in a retrospective breast cancer cohort study

Titman, Lancaster, Carmichael and Scutt have a new paper in Statistics in Medicine. This applies methods developed by Copas and Farewell (Biostatistics, 2001) to data from a retrospective cohort study where patients were traced conditional on survival up to a certain time. The authors note that the resulting observed process can be viewed as a purged process (Hoem, 1969). In addition to the pseudo-likelihood method of Copas and Farewell (which requires specification of the entry time distribution of patients), a full likelihood approach based on piecewise constant intensities under a Markov assumption is also applied. The term to take into account the conditional survival involves a transition probability, so estimation has similar difficulties to interval-censored data. For the breast cancer study considered, the two methods give very similar results.

A semi-competing risks model for data with interval-censoring and informative observation

Jessica Barrett, Fotios Siannis and Vern Farewell have a new paper in Statistics in Medicine. Essentially the paper uses similar methods as in Siannis et al 2006, to investigate informative loss-to-follow-up (LTF) in a study of aging and cognitive function.

LTF, refers to loss-to-follow-up of monitoring cognitive impairment - crucially survival continues to be monitored. LTF (from healthy) is modelled as a separate state in the process with its own transition intensity. Once LTF, the subject experiences different intensities of becoming cognitive impaired or dying. An unidentifiable parameter k determines the relative rate at which people who are lost to follow-up (before becoming cognitively impaired) proceed to the cognitively impaired state rather than the death state, compared to those not lost to follow-up.
k can be varied to see what impact assumptions about those LTF have on overall estimates. In the current study k has quite a large impact on estimates of cumulative incidence of cognitive impairment. This is in contrast to the Whitehall study where these methods were applied to right-censored data, where k had little effect.

A parametric Weibull intensities Markov model is used to model the data. Due to the interval censoring, computation of the likelihood requires numerical integration.

As an informal goodness-of-fit test the authors compare on Cox model of overall survival, with the corresponding survival estimates for the multi-state model with proportional intensity models on each intensity. The authors note that the Cox proportional hazards model for overall survival should be unbiased. Of course, it is only unbiased if the proportional hazards assumption holds on the overall hazard of death. If, however, the covariates are proportional on the individual intensities, as assumed in the multi-state model, the Cox model on overall survival will be biased. This approach is therefore more of a test of robustness to assumptions about the covariates than a goodness-of-fit test because the models aren't nested. The same method was used in Siannis et al 2006 and in Van den Hout et al, 2009.

Monday, 1 November 2010

A regression model for the conditional probability of a competing event: application to monoclonal gammopathy of unknown significance

Arthur Allignol, Aurélien Latouche, Jun Yan and Jason Fine have a new paper in Applied Statistics (JRSS C). The paper concerns competing risks data and develops methods for regression analysis of the probability of a competing event conditional on no competing event having occurred. In terms of the cumulative incidence functions, for the case of two competing events, this can be written as $\inline CPF_1 = CIF_1/(1-CIF_2)$ . In some applications this quantity may be more useful than either the cause-specific hazards or the cumulative incidence functions themselves. One approach to regression is this scenario might be to compute pseudo-observations and perform the regression using those. The authors instead propose use of temporal process regression (Fine, Yan and Kosorok 2004), allowing estimation of time dependent regression parameters, by considering the cross-sectional data at each event time.

Tuesday, 26 October 2010

Multiple imputation for estimating the risk of developing dementia and its impact on survival

Yu, Saczynski and Launer have a new paper in Biometrical Journal. This proposes an approach to fitting Cox-Markov or Cox-semi-Markov models to illness-death models subject to interval censoring via the use of iterative multiple imputation.

In essence this paper is a generalization of the method for interval censored survival data proposed by Pan (2000; Biometrics). The idea is to start by imputing the mid-point of the interval for unknown event times, calculate the parameter estimates treating these times as known. Then based on the new estimates, impute e.g. 20 different complete datasets, estimate the parameters based on these complete datasets, and the re-impute. This process continues until the parameter estimates converge.

The authors consider data regarding mortality and the onset of dementia. They model post-dementia hazard as being proportional to pre-dementia hazard and additionally allow onset age to affect hazard (for the semi-Markov model).

A drawback of the approach, which was mentioned in the paper by Pan, but not seemingly referred to by Yu et al, is that the support points for the baseline intensities will depend entirely on the initial values chosen, i.e. the final baseline intensities will have support at a subset of the original support points. For small datasets, there may be a decay in the number of support points down to 1. Pan suggested using a Link estimate of the baseline hazard (i.e. linear smoothed Breslow estimate) and to impute estimates from this to avoid the problem.

A further issue is determining what exactly constitutes convergence of the parameter estimates. After all, at each iteration a random imputation of M datasets takes place. Hence, we can expect the new parameter estimates to have a certain degree of random variation compared to the previous iteration.

The authors should be commended for having provided code and a subset of the full data in the Supporting Material of the paper. However, there seems to be an error in the code which results in an overestimation of the cumulative disease onset hazard. The code obtains the Breslow estimate of the cumulative baseline hazard. It then estimates the hazard between event times to be $\inline h(t) = (H(t_{i+1}) - H(t_i))/(t_{i+1} - t_i).$ But instead of then simulating new event times from a piecewise-exponential distribution, they ascribe h(t) to all event times (e.g. previously imputed onset times and death times) between t_i and t_{i+1}. The coding error seems to also result in considerable bias in the other estimates as corrected code (assuming support for the hazard only at imputed onset times) seems to give quite different results.

The analysis seems to be incorrect in that it treats all patients as left-truncated (i.e. healthy) at age 50 (presumably time 0 in the data is age 50). No deaths occur before age 71.6 (or 21.6 on the time scale used in the data). Even prevalent cases enter the study conditional on survival until their entry age.

A comparison with existing methods can be made by computing the NPMLE under a Markov assumption. For this we use methods developed by Gauzere (2000; PhD Thesis) and Frydman and Szarek (2009; Biometrics). For data removing the prevalent cases, a corrected version of the imputation code gives estimates that are reasonably similar to the NPMLE (see figure), whereas the uncorrected version substantially overestimates the cumulative incidence of dementia.

The results of the analysis of the data in the paper are close to those using the subsample data with the code, suggesting the same code was used for the full data. It is unclear whether the same code was used for the simulation data.

Finally, it should be noted that the method, while undoubtedly more straightforward to implement than either a semi-parametric maximum likelihood estimate or the penalized likelihood approach of Joly and Commenges, is nevertheless still quite slow. Moreover, as with the other methods, the complexity increases rapidly as the number of states in the model increases.

Parameterization of treatment effects for meta-analysis in multi-state Markov models

Malcolm Price, Nicky Welton and Tony Ades have a new paper in Statistics in Medicine. This considers how to parameterize Markov models of clinical trials, so that meta-analysis can be performed. They consider data that is panel observed (at regular time intervals) but aggregated into the number of transitions of each type among all patients between two consecutive observations. As a result, it is necessary to assume a time homogeneous Markov model, since there is no information to consider time inhomogeneity or patient inhomogeneity. These types of cost-effectiveness trials are usually analyzed using a discrete time framework (Markov transition model), with the assumption that only one transition is allowed between observations. Price et al advocate using a continuous time method. The main advantage of this is that fewer parameters are required e.g. rare jumps to distant states can be explained by the presence of multiple jumps rather than having to include the transition probability as an extra parameter in the model.

Price et al consider asthma trial data where they are interested in combining data from 5 distinct two-arm RCTs, there is some overlap between the treatments tested so evidence networks comparing treatments can be constructed.

The main weakness of the approach is the reliance on the DIC. Different models for the set of non-zero transition intensities are considered using the DIC, with the models allowing different intensities for each trial treatment arm. While later in the paper there are benefits of adopting the Bayesian paradigm, here it doesn't seem useful. For these fixed effects models and given the flat priors chosen, DIC should (asymptotically) be the same as AIC. However, AIC is dubious here also because of the aspect of testing on the boundary of the parameter space and is likely to put too much favour on simpler models in this context. Likelihood ratio tests based on mixtures of Chi-squared distributions (Self and Liang, 1987) could be applied as in Gentleman et al 1994. However, judging from the example data given for treatments A and D, the real issue is lack of data, e.g. there are very few transitions to and from state X.

Thursday, 30 September 2010

Flexible hazard ratio curves for continuous predictors in multi-state models: an application to breast cancer data

Cadarso-Suarez, Meira-Machado, Kneib and Gude have a long-awaited paper (it was accepted in October 2008) newly published in Statistical Modelling. This proposes the use of P-splines to allow smooth transition intensities and smooth functionals of covariates in a Cox-Markov or Cox-semi-Markov multi-state model.

The principal difficulty in fitting these models lies in obtaining appropriate values for the penalization parameters. However, since for right-censored data, both the Cox-Markov and Cox-semi-Markov models allow factorization of the likelihood into parts relating to the constituent transition intensities, methods from univariate survival analysis can be used. The R packages survival and BayesX are both used in the analysis.

Monday, 27 September 2010

maxLik: A package for maximum likelihood estimation

Arne Henningsen and Ott Toomet have a new paper in Computational Statistics. The paper isn't directly related to multi-state modelling, but rather is on their package, maxLik, for general maximum likelihood estimation. Primarily their package is a wrapper for existing optimization packages in R such as optim and nlm. However, they do in addition provide an implementation of the BHHH (Berndt, Hall, Hall and Hausman) algorithm. This is a quasi-Newton algorithm in a similar vein to Fisher scoring, but rather than use the Expected Fisher information, it uses the mean of the outer product of the scores of each observation. Like the BFGS algorithm, a line search is performed to find the step length at each iteration.

For panel observed Markov multi-state models the Fisher scoring algorithm proposed by Kalbfleisch and Lawless (1985, JASA) and generalised by Gentleman et al (1994, Stat Med) is superior to BHHH. However, for data with a mixture of panel observed observations and exact times of absorption (e.g. death), the Fisher scoring algorithm cannot be applied. Here BHHH seems to perform significantly better than the BFGS algorithm supplying first derivatives.

Tuesday, 17 August 2010

A novel semiparametric regression method for interval-censored data

Seungbong Han, Adin-Cristian Andrei and Kam-Wah Tsui have a paper currently available as a University of Wisconsin Biostatistics and Medical Informatics Department working paper. This essentially extends the concept of pseudo-observations, which up till now have only concerned right-censored, to interval censored survival data. The idea remains the same except that S(t) is estimated using the NPMLE of the survival function (e.g. via Turnbull self-consistent estimator or iterative convex minorant).

The paper is a little disappointing in only providing a very brief heuristic justification for using pseudo-observations in the interval-censoring case. For right-censored data, an estimate of the baseline survival function can be obtained as well as the regression parameters. No discussion of whether this is possible for the interval censored case is given. However, the baseline estimates are likely to be highly unreliable (e.g. non-monotonic) because particular subjects may have extreme influence because they effect where the mass points of the NPMLE occur. For example, the plot above is based on 1000 subjects with survival generated from an exponential with rate 0.25, subject to independent current status observation (uniformly on (0,10)). Estimating the baseline survival from the pseudo-observations (calcuated at times 1,2,3,...,10) leads to a survivor function which increases at time 5. It seems necessary that there should be more consideration of this issue as well as the choice of how many time points to evaluate the pseudo-observations at.

The authors choose to transform the pseudo-observations before regressing on the covariates, rather than using a link function in a GLM. One problem with this approach is presumably that if the estimate of S(t) is 0 or 1, g(S(t))=-Inf or Inf.

On a practical level the authors use Icens to calculate the NPMLE. As noted previously the MLEcens package seems to perform considerably better than Icens and would presumably speed up computation of the pseudo-observations method.

A natural next step would be to consider pseudo-observations for interval censored multi-state data. The lack of non-parametric methods except in a few simple cases is an obvious bar to development in this direction.

Update: The paper has now been published in Communications in Statistics - Simulation and Computation.

Thursday, 12 August 2010

Two Pitfalls in Survival Analyses of Time-Dependent Exposure: A Case Study in a Cohort of Oscar Nominees

Wolkewitz, Allignol, Schumacher and Beyersmann have a new paper in The American Statistican. This uses data on the survival outcomes of Oscar nominees as a nice illustrative example of the possibility of length bias and time dependent bias in survival analysis problems. This was originally considered by Redelmeier and Singh in an Annals of Internal Medicine paper where it is was claimed that winning an Oscar significantly increased survival prospects. Possible pitfalls of such an analysis could be assuming everyone is at risk of death from birth rather than when they enter the study (e.g. at nomination), assuming people who win an Oscar have the hazard relating to the win from the time of first nomination rather than the time they actually won, and so forth. The authors show the correct model, as well as a series of possible incorrect models, in terms of multi-state models.

Tuesday, 29 June 2010

Hidden Markov models with arbitrary state dwell-time distributions

Langrock and Zucchini have a new paper in Computational Statistics and Data Analysis. This develops models methodology for fitting discrete-time hidden semi-Markov models. They demonstrate that hidden semi-Markov models can be approximated through hidden Markov models with aggregate blocks of states. Dwell-time in a particular state is made up of a series of latent states. An individual enters a state in latent state 1. After k steps in a state where k < m , a subject either leaves the state with probability c(k) or progresses to the next latent state with probability 1-c(k). c(k) therefore represents the hazard rate of the dwell time distribution. If time m in the state is reached then the subject may either stay in latent state m with some probability c(m) or else leave the state. Hence the tail of the dwell-time distribution is constrained to be geometric. By choosing m sufficiently large a good approximation to the desired distribution can be found.

A multistate model for events defined by prolonged observation

Vern Farewell and Li Su have a new paper in Biostatistics. This models remission in psoriatic arthritis, based on panel data assessing joints at clinic visits. Existing models use a two-state model. However, a spell in remission should last a discernible amount of time. Rather than specify an artificial length of time (e.g. 6 months) e.g. a guarantee time in a state, Farewell and Su adopt a model with two states referring to remission: they refer to these as "early stage remission" and "established remission". It is assumed an individual must progress through both early and established remission before returning to active disease. A subject is observed to be in early stage remission if they have no active joints at a visit not preceded by at least 2 other zero count visits. They are in established remission if there is a zero count and at least 2 previous zero counts. State misclassification is allowed in the model through misclassification of the active disease count, i.e. patients may have 0 active joints without being in remission. It is assumed that misclassification to the early stage remission is possible but not to the established remission stage. In the example the misclassification probability is also allowed to depend on whether the previous observed count was zero or not.

The basic problem with the method is that having the states defined by the pattern of previous observations means that it is essentially impossible for the observed data (in terms of the three-state model) to come from the claimed Markov model: In the observed data, an established remission stage must be preceded by two early stage remission observations. Yet the actual Markov model allows the passage time from active disease to established remission to be arbitrarily close to zero (e.g. just the sum of two independent exponential - or perhaps piecewise exponential - distributions). As a result it is not clear how to interpret the resulting transition intensity estimates since the estimated process will not reproduce the original data.

Misclassification is effectively dealt with twice in the model. Firstly in an ad hoc way through rules on what early and established remission are. Then by allowing these observed states to have classification error over some true states. But in fitting a hidden Markov model the misclassification is assumed independent conditional on the underlying state. There is then an inherent contradiction because on the one hand the model says P(Observed zero | Active disease) >0, but at the same time P(Observed zero and two previous zero | Active disease) =0.

An approach using a guarantee time (e.g. Kang and Lagakos (2007)) or perhaps an Erlang distribution through latent states would be far more satisfactory even if it might require "special software". Potentially the guarantee time could be dependent on covariates.

Thursday, 17 June 2010

Estimating summary functionals in multistate models with an application to hospital infection data

Arthur Allignol, Martin Schumacher and Jan Beyersmann have a new paper in Computational Statistics. This considers methods for obtaining outcome measures based on estimates of the cumulative transition intensities in nonhomogeneous Markov models. Specifically they consider hospital length of stay data and consider estimating the expected excess time spent in hospital given an infection has occurred by time s, compared to if no infection has occurred by time s. For nonhomogeneous models this is clearly a function of the time s. They consider possible weighting methods to get a reasonable summary measure.

The plausibility of a Markov model is tested informally by including time since infection as a time dependent covariate in a Cox PH model. Some discussion of the extension of the methods to the non-Markov case is given.

Monday, 14 June 2010

Estimation from aggregate data

Gouno, Coutrai and Fredette have a new paper in Computational Statistics and Data Analysis. This proposes a new approach to estimation of transition rates in multi-state models panel observed in the form of aggregate prevalence counts. Kalbfleisch and Lawless dealt with this problem in the 1980s under a time-homogeneous Markov assumption. The full likelihood is very difficult to compute (for large N), but a least-squares approach is quite effective.

Gouno et al's approach is somewhat different. Essentially they make a discrete-time approximation to a continuous time process by assuming that only 1 transition may occur in a time unit and moreover, that transitions between observation times occur at the midpoint of the interval. The first assumption is convenient because it means the transition counts between states can be established from the prevalence counts. The second assumption allows the sojourn time distributions to be characterised as multinomial random variables. "Complete" data of the form of the number of sojourns of different lengths can be obtained via an Expectation or Monte-Carlo Expectation step in an EM or MCEM algorithm. An MCEM algorithm is required

The authors suggest that estimated sojourn distributions in each state could be compared using a log-rank test. A log-rank test would obviously be inappropriate. A likelihood ratio test between the full model and a model where the two sojourn distributions were constrained to be equal might be appropriate. However, an issue not addressed is identifiability. Regardless of the overall number of units from which the aggregate data are taken, the actual degrees of freedom of the data are fixed. The model fitted to example data appears to be saturated.

There is a lack of explanation in places in the paper. For instance, it is unclear why several of the estimated survival functions for the example dataset start from values other than 1. Also it is unclear why the estimated survival in state 1 drops to 0 at 28 time units, when 10 individuals in the original data survive to t=31.

An application of hidden Markov models to French variant Creutzfeldt-Jakob disease epidemic

Chadeau-Hyam et al have a new paper in Applied Statistics (JRSS C). This is concerned with modelling vCJD in France. A 5 state multi-state model is assumed, with states representing susceptible to infection, asymptomatic infection, clinical vCJD, death from vCJD and death from causes other than vCJD. The data available are extremely sparse since no reliable test is available to distinguish susceptible from asymptomatic. Indeed the only data actually observed are the yearly transitions from infected to clinical vCJD and clinical vCJD to death. As a result, pseudo-observed quantities, estimated in previous studies or from general population data are used to get quantities such as the numbers susceptible. Various approximations in terms of the number and type of transitions possible by an individual in one year are also made. Some simulations are performed which suggest the results are reasonably robust to these approximations.

The most interesting methodological aspect of the paper is the use of an (approximately) Erlang distribution for the incubation time (rather than an Exponential). This is achieved by assuming that the incubation state is made up of 11 latent phases.

Regression analysis of censored data using pseudo-observations

Erik Parner and Per Kragh Andersen have a new paper available as a research report at the Department of Biostatistics, University of Copenhagen. This develops STATA routines for implementing the pseudo-observations method of performing direct regression modelling of complicated outcome measures (such as cumulative incidence functions or overall survival times) for multi-state models subject to right censoring. The paper is essentially the STATA equivalent of the 2008 paper by Klein et al which developed similar routines for SAS and R.

[Update: The paper is now published in The STATA Journal]

Tuesday, 1 June 2010

Progressive multi-state models for informatively incomplete longitudinal data

Chen, Yi and Cook have a new paper in the Journal of Statistical Planning and Inference. This covers similar ground to the paper by the same authors in Statistics in Medicine, being concerned with missing not at random (MNAR) data. This article deals with the discrete time case, whereas the other paper modelled the process in continuous time.

Thursday, 27 May 2010

Incorporating frailty in a multi-state model: application to disease natural history modelling of adenoma-carcinoma in the large bowel

Yen, Chen, Duffy and Chen have a new paper in SMMR. This considers methods for incorporating frailty into a progressive multi-state model under interval censoring (the data resemble current status data being a case-cohort design with known sampling probabilities). They note that the accommodation of frailty is made easier by considering a model where only one transition intensity is subject to heterogeneity. The main deficiency of the paper is the lack of reference to the fairly wide existing literature on frailty and random effects models for panel observed multi-state models. In particular, the tracking model of Satten is directly applicable to their data. In the tracking model, a common frailty affects all transition intensities and results in a likelihood that doesn't require numerical integration. Some comment on the extendability of their model to the case of multiple observations per patient would also have been useful.

Wednesday, 12 May 2010

Estimation of overall survival in an "illness-death" model with application to vertical transmission of HIV-1

Halina Frydman and Michael Szarek have a new paper in Statistics in Medicine. This presents an additional application to their previous work on non-parametric estimation in a Markov illness-death model under interval censoring. Specifically, they use estimates of survival pre and post disease, to provide insights into the effect on overall survival. Sub-sampling is used to obtain estimates of standard errors (rather than a simple bootstrap).

Wednesday, 5 May 2010

Analysis of interval-censored disease progression data via multi-state models under a nonignorable inspection process

Chen, Yi and Cook have a new paper in Statistics in Medicine. This covers similar ground to the paper of Sweeting et al as it is also concerned with a nonignorable inspection process for a panel observed progressive Markov multi-state model. Again there are a set of known planned observation times, but unlike Sweeting et al there is no auxiliary data. A MNAR model where the missingness probability depends on the true state at the potential observation time is used. This is similar to the MNAR model in Sweeting et al, except that here piecewise constant transition intensities are used rather than a time homogeneous model. The model is fitted using an EM algorithm.

Both this study and Sweeting et al's depend on there being a known set of planned examination times. In the majority of cases, only the actual examination times are known and the observation process is not fully understood. As Chen et al note, informative observation times could be dealt with by determining a stochastic model for the observation process, but would require assumptions about the process.

Multistate Markov models for disease progression in the presence of informative examination times

Sweeting, Farewell and De Angelis have a new paper in Statistics in Medicine. This deals with the problem of a panel observed disease process where the examination times are generated by a non-ignorable mechanism. This is in general a very difficult problem. The authors consider a special case where, while the disease process is only observed at informative examination times, an auxiliary variable is able to be observed at a full set of planned (or ignorable) times. They consider a model where the disease process is missing at random (MAR) conditional on the values of the auxiliary variable. The likelihood then becomes of the form of a (partially) hidden Markov model. An alternative missing not at random model (MNAR) with missingness dependent on the current state is also considered. However the comparison between the MAR and MNAR models is somewhat unfair. In the simulations and the application, the disease process is only observed at a minority of planned examination times. The auxiliary variable has a fairly strong correlation with the disease process, meaning significant information about the process can be obtained through using the auxiliary variable in a hidden Markov model. Indeed, even without the suspicion of informative missingness the HMM would be worth using (if the strong assumptions about the relationship with the auxiliary variable could be reliably assumed) for improved efficiency. However, the MNAR model does not use the auxiliary variable at all. A better model to compare would be a hybrid of the two where a partially HMM is used in conjunction with the logistic model for the MNAR (although this wouldn't necessarily be consistent under AD-MAR unless the logistic model was correctly specified). This does not seem like a difficult extension.

A further shortcoming of the paper is the lack of any suggestions for model checking. Measurements from the auxiliary variable are assumed to be Normally distributed and independent conditional on the underlying disease states at the examination times. This seems fairly unrealistic and should at least be supported by the data.

Tuesday, 20 April 2010

Comparison of state occupation, entry, exit and waiting times in two or more groups based on current status data in a multistate model

Lan and Datta have a paper in Statistics in Medicine. This complements their previous work on non-parametric estimation of current status data multi-state models. Specifically, they develop a K-sample test to compare state occupation probabilities, entry time distribution or state waiting time distributions between groups. The test statistic is based on the integrated L1 distance between the quantities of interest. A p-value for the test is found by bootstrap resampling. This is computationally feasible because of the relative simplicity of the estimator requiring only pool-adjacent-violator algorithm and kernel smoothing. While no Markov assumption is required for the state occupation probabilities, entry and waiting time distributions require a Markov assumption.

The method is demonstrated using a 5-state progressive model on data relating to pubertal development in children. They confirm that there is a significant difference in pubertal development between girls and boys, with boys reaching stages of development later.

Friday, 9 April 2010

Cause-specific cumulative incidence estimation and the Fine and Gray model under both left truncation and right censoring

Ronald Geskus has a new paper in Biometrics. This extends the Fine and Gray model, for regression of the subdistribution hazards for competing risks models, to the case of left truncated (and right censored) data. Essentially it is shown that the standard estimator of the cumulative incidence function (CIF) can be derived as a inverse-probability-weighting estimator. Thus the Fine-Gray model for left-truncated and right-censored data can be obtained as a weighted Cox-model where individuals who are censored or experience a competing risk at time s, have weights at time t>s given by $\inline G(t)H(t)/G(s)H(s)$ where G(t) is the empirical CDF of the censoring distribution and H(t) is the empirical survivor distribution of the truncation distribution. This form appears to imply that independence is assumed between the censoring and truncation distributions. However, the equivalence of estimates of the CIF in the no-covariates case holds regardless of the relationship between censoring and truncation - only independence with failure times is required.

Experimenting with the Coxian Phase-Type distribution to uncover suitable fits

Marshall and Zenga have a new paper in Methodology and Computing in Applied Probability. This considers the best approach to fitting Coxian phase-type distributions to fully observed survival or duration data. Several previous studies have shown that the EM algorithm, Nelder-Mead and quasi-Newton approaches give estimates that are highly dependent on initial values. In a series of simulations it is shown that the quasi-Newton algorithm is preferable to Nelder-Mead in converging to values closer to the true parameters. However, in the absence of a criterion for global convergence it still seems wise to consider several starting values to try to ensure a global rather than local optimum is obtained.

Thursday, 18 March 2010

Multi-state Markov models in cancer screening evaluation: a brief review and case study

Uhry et al have a new paper in Statistical Methods in Medical Research. This looks at the use of multi-state Markov models for evaluating cancer screening regimes. Time homogeneous three-state and five-state models to describe progression through asymptomatic (but detectable by screening) stages of cancer. Difficulties with this type of modelling are that either screening sensitivity has to be considered to be 100% or else has to be accounted for in the likelihood. Moreover, the progression rates from asymptomatic to symptomatic disease can only be observed indirectly (i.e. via interval cancers) meaning estimates have substantial uncertainty. Estimates of mean sojourn times will be quite highly dependent on the time homogeneous Markov assumption made. The authors note that there is evidence the 5-state model is unrealistic as sojourn times in the preclinical states (node negative and node positive) are likely to be correlated, but generalising to non-homogeneous and/or non-Markov models would be difficult given the lack of data.

Nonparametric estimator for the survival function of quality adjusted lifetime (QAL) in a three-state illness–death model

Biswabrata and Dewanji have a new paper in the Journal of the Korean Statistical Society. This considers estimation of quality adjusted lifetime (QAL) in illness-death models under right censoring. In contrast to their previous work, estimation is non-parametric rather than parametric. A semi-Markov assumption, where sojourn times in each state are assumed independent, is taken.

Monday, 22 February 2010

Non-Markov Multistate Modeling Using Time-Varying Covariates

Bacchetti et al have a new paper in The International Journal of Biostatistics. This considers modelling progression of liver fibrosis due to hepatitis C following liver transplant using a 5-state progressive multi-state model. The data are panel observed at irregular time points, so it would be most natural to model the data in continuous time. In addition the observed states are subject to classification error. To avoid having to making either Markov or time homogeneity assumptions, the authors adopt a discrete time assumption: assuming 4 time periods per year. They then model the transition probabilities as linear on the log-odds scale, depending on covariates such as medical center, donor age, year of transplant as well as log time since entry into the current state (relaxing the Markov assumption). Potentially, time since transplant could also be included in the model (for non-homogeneity).

The main challenge in fitting the models is to enumerate all possible complete "paths" of true states at the discrete time points that could result in the observed data. Obviously an iterative algorithm is needed for this. The computational complexity will depend on the complexity of the state transition matrix, the misclassification probability matrix and the degree of discretization.

The main drawback of approximating a continuous time process by a discrete-time process is the restriction that only 1 transition may occur between time points. While this can be acceptable for progressive models such as the one considered here, it may be more problematic when backward transitions are allowable.

The authors have developed an R package called mspath, designed to complement the existing package for continuous-time Markov and hidden Markov model msm, to allow non-Markov models through the discrete-time approximation.

Monday, 8 February 2010

A Note on Variance Estimation of the Aalen-Johansen Estimator of the Cumulative Incidence Function in Competing Risks

Allignol, Schumacher and Beyersmann have a new paper in Biometrical Journal. This considers variance estimation for the cumulative incidence functions in a cause-specific hazards competing risks analysis. They compare a Greenwood-type estimator of variance with the counting process derived estimator for left-truncated and right-censored data. They find that the Greenwood-type estimator is generally preferable for finite sample sizes.

Friday, 5 February 2010

Modelling the likely effect of the increase of the upper age limit from 70 to 73 for breast screening in the UK National Programme

Duffy, Sasieni, Olsen and Cafferty have a new paper in Statistical Methods in Medical Research. This considers the possible effect of increasing the upper age of breast cancer screening from 70 years to 73 years. Two approaches are taken. Firstly, a 4-state time homogeneous Markov model is considered, with the states representing healthy, asymptomatic breast cancer (detectable by screening), symptomatic breast cancer and death. Secondly, a discrete-time model where incidence of breast cancer and mortality from other causes is assumed to be uniform in each year of age. Other rates are still considered to be time homogeneous. The benefit in life years gained up to 88 years of age is considered. The intensities are estimated from external data. Both approaches give similar results, with about 1 life-year gained per 1000 women screened.

Wednesday, 3 February 2010

Estimating Transition Probabilities for Ignorable Intermittent Missing Data in a Discrete-Time Markov Chain

Yeh, Chan, Symanski and Davis have a new paper in Communications in Statistics - Simulation and Computation. This considers data from a discrete-time Markov model where some observations are missing in an ignorable way. This is a very common situation in longitudinal data. If the transition matrix is P for a single time period, then it is just P^2, P^3 etc. for 2,3,... time periods. For some reason the authors think that not being able to have a closed form expression for the MLE is a problem. They consider a naive approach based on only considering the observed one-step transitions, an EM algorithm approach and what they term a non-linear equations method. This latter approach is essentially computing the maximum likelihood estimate. However, the authors have failed to consider the possibility of boundary maximums. Hence they have sought to find where the gradient is 0 w.r.t. transition probabilities lying between 0 and 1. They run into problems in some cases simply because the MLE is at p=0, meaning there is no solution to the gradient equation within the allowable limits so negative values are suggested.

A further problem with the paper is that they have incorrectly estimated the standard errors under missing data in the EM algorithm, plugging estimates of the marginal counts into the complete-case formula. This is equivalent to using the full likelihood information rather than the observed information. Hence the standard errors will be underestimates.

In topic the paper has some similarities to Deltour et al, Biometrics (1999). However, that paper was considering a non-trivial problem of non-ignorable missing data, where a Stochastic-EM algorithm was proposed.

Wednesday, 27 January 2010

A nonstationary Markov transition model for computing the relative risk of dementia before death

Yu et al have a new paper in Statistics in Medicine. This is concerned with estimating the risk of dementia before death. A 5 state multi-state model is used, with three transient states representing levels of cognitive impairment, plus two absorbing states dementia and death. Unlike other recent dementia studies using multi-state models, improvements, as well as deteriation, in cognitive ability are assumed possible. A discrete-time approach is taken. This has some advantages in terms of the flexibility in modelling possible in terms of incorporating non-homogeneity over time and a frailty term - the Markov assumption. A drawback of using a discrete-time approach in the current study was that the study did not have equally spaced observation times, but it was necessary to assume this was the case for the discrete-time model. This is likely to cause some bias.

The main theoretical development in the paper is expressions for the mean and variance of the time spent in each transient state before absorption in the non-homogeneous case.

Tuesday, 26 January 2010

Vertical modeling: A pattern mixture approach for competing risks modeling

Nicolaie, van Houwelingen and Putter have a new paper in Statistics in Medicine. This presents a new approach to modeling competing risks data. In essence this involves splitting the model in two parts: firstly model all cause survival, e.g. P(T >=t), secondly model P(D | T=t), the probability of a particular type of failure given a failure at time t, which they term as the relative hazard. The cause specific hazards and cumulative incidence functions can be retrieved under this formulation. A multinomial type model is applied to the relative hazards, with time dependency modelled using either piecewise constant functions or cubic splines. All cause survival can be modelled through any standard survival model e.g. a proportional hazards model. Vertical modeling provides a third approach particularly useful in cases where a proportional hazards assumption is not appropriate on either the cause-specific hazards (the classical approach) or on the sub-distribution hazard (Fine-Gray model).

Whilst it would be relatively straightforward to implement a vertical model using existing R packages, there are plans to include vertical modeling within the mstate package.

Tuesday, 12 January 2010

Estimating disease progression using panel data

Micha Mandel has a new paper in Biostatistics. This considers estimation for panel observed data relating to MS progression. The underlying process can have backward transitions, so reaching a higher state is not in itself considered as evidence of progression. Instead, the patient is required to have stayed in the higher state for some period of time (e.g. 6 months). The quantities of interest in the study is therefore the time taken to first have stayed in state 3 for 6 months. For a continuous time, time-homogeneous Markov process expressions for the mean time and the distribution function are obtained.

As noted by Mandel, the estimates obtained are strongly dependent on the Markov assumption and time homogeneity. This is demonstrated in a simulation study. Mandel notes that more methods for semi-Markov models are required, the recent paper on phase-type semi-Markov models may be of use. Indeed, in the simulations Mandel actually uses phase-type distributions to create a semi-Markov process. However, the MS dataset may be too small to reliably estimate a semi-Markov model.

Informative observation times are a potential problem in the MS study. Mandel shows that observations that occur away from the scheduled 26-week gap time, are more likely to involve a transition, suggesting these observation times are informative. As a result, only observations within a 4 week period of the scheduled visit time are included in the analysis.

A comparison with a discrete-time Markov model approach showed that estimates based on assuming a discrete-time process gave larger estimates of the hitting times.

Links

Followers

Blog Archive