How to publish a model that hasn’t converged

The article by Matsuo et al. (2019) appeared in my newsletter. It is another attempt to sell deep-learning (DL) as a promising alternative to traditional survival analysis. To this end, they compare the performance of their DL model to Cox proportional hazard regression (CPH) when predicting survival for women with newly diagnosed cervical cancer. In fact, they compare a series of cox-based regression models (CPH, Cox Lasso, Random Survival Forest, and Cox Boost) to their proposed DL model. Not surprisingly, they conclude the DL model shows better performance.

Nevertheless, the paper is full of misconceptions and misreporting making their conclusions debatable. Let’s start with a few of their claims:

They claim, “Unlike CPH and its variants, deep-learning approaches can model nonlinear risk functions …”. This is not true; both Random Survival Forests, and Cox Boost allow for nonlinear functions of the risk factors. (In fact, even CPH allows for nonlinear functions, albeit they need to be pre-specified.) They continue arguing that, “deep-learning models […] easily can handle censoring in survival data”. Again, not entirely true; all the models mentioned above handle censored data. Third, they claim “… the performance of the deep-learning neural network model will perform better when large feature sets are used”. This is generally true - using more features can improve performance- but it is also clinically useless and practically infeasible. It could be quite costly and time-consuming to collect a continuously increasing number of features to achieve fractional increases in any performance metric.

Besides all these, I have two more serious concerns: (1) about using the mean absolute deviation (MAD) as a performance metric and (2) the problem of algorithmic convergence.

MAD is not properly defined

The authors define the mean absolute deviation as “the absolute difference between the original survival time (ground truth) and the model’s predicted survival time measured in months.” This definition is ambiguous since a time-to-event model outputs the probability a patient survives (or dies) up to a timepoint. Of course, we could calculate the estimated time the outcome is expected to occur with high probability. But for some subjects this can be larger than the study follow-up time. In other words, not all subjects will necessarily experience the outcome within the follow-up time. This creates two problems.

First, we are extrapolating time. There is no way of knowing how the survival curves act beyond the sampled survival times. Second, if we do not want to extrapolate time what happens to the censored individuals? These are people that experience the event after the last follow-up evaluation or are lost to follow-up, hence their event time is unobserved. It is unclear how the data from censored individuals have been analysed. Of course, one option is to simply remove all the censored individuals. However, this will likely bias the results. There are two general approaches to overcome this issue: weighting and imputation. Both methods have been explored in survival settings, for example, by Graf et al. (1999), Schemper and Henderson (2000) and Gerds and Schumacher (2006).

To convergence or not to converge … does it matter?

Another concern is about the convergence of the CPH model. To my surprise the authors report a MAD of 316 months! for the CPH model (Table 2), which they attribute to failed convergence. Fair enough, but this raises two other questions - why did it the algorithm fail in the fist place and why was it not fixed? More importantly, since the algorithm did not converge, its output cannot be trusted.

Usually, when we write an algorithm, we are interested in knowing if the solution the algorithm provides is the correct one for the problem it solves. This can sometimes come in the form of a convergence. In iterative algorithms (such as the one used for CPH), every step generates a different error. And what the algorithm tries to do is to minimize that error so it ever gets smaller and smaller. We say that the algorithm converges if it sequence of errors converges. If the algorithm does not converge implies that a different solution can give us a lower error. In other words, our solution is not optimal.

As a result, we cannot trust the parameter estimates from an algorithm that has not converged and even less other derived quantities such as standard errors or p-values (which are reported in Table 4 and discussed throughout the paper). So why the authors interpret the results in the main text as “the deep-learning model had significantly better predictions compared with the CPH model, with>10-fold difference between the 2 analytic approaches (mean absolute error for CPH vs deep-learning: 316.2 vs 29.3)”?

It is disheartening to know such work passes both the editorial team and peer-review process and gets published.

References

Gerds, Thomas A, and Martin Schumacher. 2006. “Consistent Estimation of the Expected Brier Score in General Survival Models with Right-Censored Event Times.” Biometrical Journal 48 (6): 1029–40.
Graf, Erika, Claudia Schmoor, Willi Sauerbrei, and Martin Schumacher. 1999. “Assessment and Comparison of Prognostic Classification Schemes for Survival Data.” Statistics in Medicine 18 (17-18): 2529–45.
Matsuo, Koji, Sanjay Purushotham, Bo Jiang, Rachel S Mandelbaum, Tsuyoshi Takiuchi, Yan Liu, and Lynda D Roman. 2019. “Survival Outcome Prediction in Cervical Cancer: Cox Models Vs Deep-Learning Model.” American Journal of Obstetrics and Gynecology 220 (4): 381–e1.
Schemper, Michael, and Robin Henderson. 2000. “Predictive Accuracy and Explained Variation in Cox Regression.” Biometrics 56 (1): 249–55.
Solon Karapanagiotis
Solon Karapanagiotis
Research Associate
MRC Biostatistics Unit

Related