Tuesday, 29 June 2010
Hidden Markov models with arbitrary state dwell-time distributions
Langrock and Zucchini have a new paper in Computational Statistics and Data Analysis. This develops models methodology for fitting discrete-time hidden semi-Markov models. They demonstrate that hidden semi-Markov models can be approximated through hidden Markov models with aggregate blocks of states. Dwell-time in a particular state is made up of a series of latent states. An individual enters a state in latent state 1. After k steps in a state where k < m , a subject either leaves the state with probability c(k) or progresses to the next latent state with probability 1-c(k). c(k) therefore represents the hazard rate of the dwell time distribution. If time m in the state is reached then the subject may either stay in latent state m with some probability c(m) or else leave the state. Hence the tail of the dwell-time distribution is constrained to be geometric. By choosing m sufficiently large a good approximation to the desired distribution can be found.
A multistate model for events defined by prolonged observation
Vern Farewell and Li Su have a new paper in Biostatistics. This models remission in psoriatic arthritis, based on panel data assessing joints at clinic visits. Existing models use a two-state model. However, a spell in remission should last a discernible amount of time. Rather than specify an artificial length of time (e.g. 6 months) e.g. a guarantee time in a state, Farewell and Su adopt a model with two states referring to remission: they refer to these as "early stage remission" and "established remission". It is assumed an individual must progress through both early and established remission before returning to active disease. A subject is observed to be in early stage remission if they have no active joints at a visit not preceded by at least 2 other zero count visits. They are in established remission if there is a zero count and at least 2 previous zero counts. State misclassification is allowed in the model through misclassification of the active disease count, i.e. patients may have 0 active joints without being in remission. It is assumed that misclassification to the early stage remission is possible but not to the established remission stage. In the example the misclassification probability is also allowed to depend on whether the previous observed count was zero or not.
The basic problem with the method is that having the states defined by the pattern of previous observations means that it is essentially impossible for the observed data (in terms of the three-state model) to come from the claimed Markov model: In the observed data, an established remission stage must be preceded by two early stage remission observations. Yet the actual Markov model allows the passage time from active disease to established remission to be arbitrarily close to zero (e.g. just the sum of two independent exponential - or perhaps piecewise exponential - distributions). As a result it is not clear how to interpret the resulting transition intensity estimates since the estimated process will not reproduce the original data.
Misclassification is effectively dealt with twice in the model. Firstly in an ad hoc way through rules on what early and established remission are. Then by allowing these observed states to have classification error over some true states. But in fitting a hidden Markov model the misclassification is assumed independent conditional on the underlying state. There is then an inherent contradiction because on the one hand the model says P(Observed zero | Active disease) >0, but at the same time P(Observed zero and two previous zero | Active disease) =0.
An approach using a guarantee time (e.g. Kang and Lagakos (2007)) or perhaps an Erlang distribution through latent states would be far more satisfactory even if it might require "special software". Potentially the guarantee time could be dependent on covariates.
The basic problem with the method is that having the states defined by the pattern of previous observations means that it is essentially impossible for the observed data (in terms of the three-state model) to come from the claimed Markov model: In the observed data, an established remission stage must be preceded by two early stage remission observations. Yet the actual Markov model allows the passage time from active disease to established remission to be arbitrarily close to zero (e.g. just the sum of two independent exponential - or perhaps piecewise exponential - distributions). As a result it is not clear how to interpret the resulting transition intensity estimates since the estimated process will not reproduce the original data.
Misclassification is effectively dealt with twice in the model. Firstly in an ad hoc way through rules on what early and established remission are. Then by allowing these observed states to have classification error over some true states. But in fitting a hidden Markov model the misclassification is assumed independent conditional on the underlying state. There is then an inherent contradiction because on the one hand the model says P(Observed zero | Active disease) >0, but at the same time P(Observed zero and two previous zero | Active disease) =0.
An approach using a guarantee time (e.g. Kang and Lagakos (2007)) or perhaps an Erlang distribution through latent states would be far more satisfactory even if it might require "special software". Potentially the guarantee time could be dependent on covariates.
Thursday, 17 June 2010
Estimating summary functionals in multistate models with an application to hospital infection data
Arthur Allignol, Martin Schumacher and Jan Beyersmann have a new paper in Computational Statistics. This considers methods for obtaining outcome measures based on estimates of the cumulative transition intensities in nonhomogeneous Markov models. Specifically they consider hospital length of stay data and consider estimating the expected excess time spent in hospital given an infection has occurred by time s, compared to if no infection has occurred by time s. For nonhomogeneous models this is clearly a function of the time s. They consider possible weighting methods to get a reasonable summary measure.
The plausibility of a Markov model is tested informally by including time since infection as a time dependent covariate in a Cox PH model. Some discussion of the extension of the methods to the non-Markov case is given.
The plausibility of a Markov model is tested informally by including time since infection as a time dependent covariate in a Cox PH model. Some discussion of the extension of the methods to the non-Markov case is given.
Monday, 14 June 2010
Estimation from aggregate data
Gouno, Coutrai and Fredette have a new paper in Computational Statistics and Data Analysis. This proposes a new approach to estimation of transition rates in multi-state models panel observed in the form of aggregate prevalence counts. Kalbfleisch and Lawless dealt with this problem in the 1980s under a time-homogeneous Markov assumption. The full likelihood is very difficult to compute (for large N), but a least-squares approach is quite effective.
Gouno et al's approach is somewhat different. Essentially they make a discrete-time approximation to a continuous time process by assuming that only 1 transition may occur in a time unit and moreover, that transitions between observation times occur at the midpoint of the interval. The first assumption is convenient because it means the transition counts between states can be established from the prevalence counts. The second assumption allows the sojourn time distributions to be characterised as multinomial random variables. "Complete" data of the form of the number of sojourns of different lengths can be obtained via an Expectation or Monte-Carlo Expectation step in an EM or MCEM algorithm. An MCEM algorithm is required
The authors suggest that estimated sojourn distributions in each state could be compared using a log-rank test. A log-rank test would obviously be inappropriate. A likelihood ratio test between the full model and a model where the two sojourn distributions were constrained to be equal might be appropriate. However, an issue not addressed is identifiability. Regardless of the overall number of units from which the aggregate data are taken, the actual degrees of freedom of the data are fixed. The model fitted to example data appears to be saturated.
There is a lack of explanation in places in the paper. For instance, it is unclear why several of the estimated survival functions for the example dataset start from values other than 1. Also it is unclear why the estimated survival in state 1 drops to 0 at 28 time units, when 10 individuals in the original data survive to t=31.
Gouno et al's approach is somewhat different. Essentially they make a discrete-time approximation to a continuous time process by assuming that only 1 transition may occur in a time unit and moreover, that transitions between observation times occur at the midpoint of the interval. The first assumption is convenient because it means the transition counts between states can be established from the prevalence counts. The second assumption allows the sojourn time distributions to be characterised as multinomial random variables. "Complete" data of the form of the number of sojourns of different lengths can be obtained via an Expectation or Monte-Carlo Expectation step in an EM or MCEM algorithm. An MCEM algorithm is required
The authors suggest that estimated sojourn distributions in each state could be compared using a log-rank test. A log-rank test would obviously be inappropriate. A likelihood ratio test between the full model and a model where the two sojourn distributions were constrained to be equal might be appropriate. However, an issue not addressed is identifiability. Regardless of the overall number of units from which the aggregate data are taken, the actual degrees of freedom of the data are fixed. The model fitted to example data appears to be saturated.
There is a lack of explanation in places in the paper. For instance, it is unclear why several of the estimated survival functions for the example dataset start from values other than 1. Also it is unclear why the estimated survival in state 1 drops to 0 at 28 time units, when 10 individuals in the original data survive to t=31.
An application of hidden Markov models to French variant Creutzfeldt-Jakob disease epidemic
Chadeau-Hyam et al have a new paper in Applied Statistics (JRSS C). This is concerned with modelling vCJD in France. A 5 state multi-state model is assumed, with states representing susceptible to infection, asymptomatic infection, clinical vCJD, death from vCJD and death from causes other than vCJD. The data available are extremely sparse since no reliable test is available to distinguish susceptible from asymptomatic. Indeed the only data actually observed are the yearly transitions from infected to clinical vCJD and clinical vCJD to death. As a result, pseudo-observed quantities, estimated in previous studies or from general population data are used to get quantities such as the numbers susceptible. Various approximations in terms of the number and type of transitions possible by an individual in one year are also made. Some simulations are performed which suggest the results are reasonably robust to these approximations.
The most interesting methodological aspect of the paper is the use of an (approximately) Erlang distribution for the incubation time (rather than an Exponential). This is achieved by assuming that the incubation state is made up of 11 latent phases.
The most interesting methodological aspect of the paper is the use of an (approximately) Erlang distribution for the incubation time (rather than an Exponential). This is achieved by assuming that the incubation state is made up of 11 latent phases.
Regression analysis of censored data using pseudo-observations
Erik Parner and Per Kragh Andersen have a new paper available as a research report at the Department of Biostatistics, University of Copenhagen. This develops STATA routines for implementing the pseudo-observations method of performing direct regression modelling of complicated outcome measures (such as cumulative incidence functions or overall survival times) for multi-state models subject to right censoring. The paper is essentially the STATA equivalent of the 2008 paper by Klein et al which developed similar routines for SAS and R.
[Update: The paper is now published in The STATA Journal]
[Update: The paper is now published in The STATA Journal]
Tuesday, 1 June 2010
Progressive multi-state models for informatively incomplete longitudinal data
Chen, Yi and Cook have a new paper in the Journal of Statistical Planning and Inference. This covers similar ground to the paper by the same authors in Statistics in Medicine, being concerned with missing not at random (MNAR) data. This article deals with the discrete time case, whereas the other paper modelled the process in continuous time.
Subscribe to:
Posts (Atom)