Monday, 14 June 2010

Estimation from aggregate data

Gouno, Coutrai and Fredette have a new paper in Computational Statistics and Data Analysis. This proposes a new approach to estimation of transition rates in multi-state models panel observed in the form of aggregate prevalence counts. Kalbfleisch and Lawless dealt with this problem in the 1980s under a time-homogeneous Markov assumption. The full likelihood is very difficult to compute (for large N), but a least-squares approach is quite effective.

Gouno et al's approach is somewhat different. Essentially they make a discrete-time approximation to a continuous time process by assuming that only 1 transition may occur in a time unit and moreover, that transitions between observation times occur at the midpoint of the interval. The first assumption is convenient because it means the transition counts between states can be established from the prevalence counts. The second assumption allows the sojourn time distributions to be characterised as multinomial random variables. "Complete" data of the form of the number of sojourns of different lengths can be obtained via an Expectation or Monte-Carlo Expectation step in an EM or MCEM algorithm. An MCEM algorithm is required

The authors suggest that estimated sojourn distributions in each state could be compared using a log-rank test. A log-rank test would obviously be inappropriate. A likelihood ratio test between the full model and a model where the two sojourn distributions were constrained to be equal might be appropriate. However, an issue not addressed is identifiability. Regardless of the overall number of units from which the aggregate data are taken, the actual degrees of freedom of the data are fixed. The model fitted to example data appears to be saturated.

There is a lack of explanation in places in the paper. For instance, it is unclear why several of the estimated survival functions for the example dataset start from values other than 1. Also it is unclear why the estimated survival in state 1 drops to 0 at 28 time units, when 10 individuals in the original data survive to t=31.

No comments: