DeepProfile: Deep learning of patient molecular profiles for precision medicine in acute myeloid leukemia

Update: The preprint discussed in this has been updated since these reviews were posted. These reviews may no longer apply to the current version of the manuscript.

“DeepProfile: Deep learning of patient molecular profiles for precision medicine in acute myeloid leukemia” by Ayse Berceste Dincer, Safiye Celik, Naozumi Hiranuma, and Su-In Lee. https://doi.org/10.1101/278739

I selected this article for review for a few reasons:

I was intrigued by the compilation of a large dataset from many small datasets before applying unsupervised learning, which we have had success with in other settings.
I have an interest in methods that use variational autoencoders to learn low-dimensional representations.
I was curious about the translational utility of such methods.

I personally thought one of the neatest aspects of this paper was the construction of a data compendium by aggregating many different studies. The compendium was then used to train unsupervised models with a degree of resolution that would be impossible at the single-dataset level.

Both reviewers found the variational autoencoder methodologically interesting. However, both still had questions about the extent to which performance improved with this method as opposed to linear alternatives. One reviewer would have liked to see source code available (personally, I would also like to see models available - Kipoi was just announced and might be a good home for models).

The second reviewer raises a number of substantial issues that relate to the translational (as opposed to ML/genomics research) value of the work. It may be helpful for the authors to compare against approaches that are currently used or to reframe the work to focus on the take home message for a genomics audience.

I want to thank the authors for sharing their work on bioRxiv before it has undergone peer review. I also want to take an opportunity to thank the reviewers who have donated their time to evaluate the novelty, strengths, and limitations of this work. Both reviewers are either faculty or professional scientists. One chose to be named and one chose to remain anonymous. Both would be willing to consider invitations to review this manuscript, especially in a revised form, if asked by an editor. Both reviews are available in full below.

Reviewer 1 - Mikael Huss (Peltarion)

Summary

The authors propose a method for extracting latent variables underlying gene expression in publicly available cancer datasets using variational autoencoders. A variational autoencoder (VAE) is a computational model that can be used to learn a latent feature representation that follows a specified statistical distribution (usually Gaussian). This means that after successful training of the model, examples (gene expression profiles) that are similar will also be similar in the latent feature space. The authors apply their VAE, which was trained on gene expression data from thousands of leukemia samples from public sources, on a separate set of samples where the task is to predict response to 160 different drugs, and to a complete-remission prediction problem. That is, the VAE is trained in an unsupervised way, but its learned feature representation can be successfully applied to a supervised learning problem.

Major comments

The most glaring omission in the paper is that code is not provided. We are only told that the model is implemented in Keras. The parameters of the learned representation are also not given (possibly because each training run will give different parameters). I would suggest that, at a minimum, code is provided in the form of a GitHub repo or similar.
The performance using the VAE representation is not all that much better than the performance using K-means based dimensionality reduction. In Figure 4, the ROC-AUC for VAE is 0.787 and the one for K-means is 0.764 (and with only 30 samples the estimates are noisy) and in Figure 3, VAE seems to have a MSE of about 0.86 and K-means has about 0.88. I wonder if it would be possible to find some sort of “best of both worlds” solution between clustering and VAE by adapting the VQ-VAE technique (https://arxiv.org/pdf/1711.00937.pdf), even though gene expression might not be a natural fit given its continuous nature.

Minor comments

P1: “Most patients with advanced cancer…” would seem to belong to the Motivation paragraph.
P1: The complete remission problem is not discussed in the abstract.
P2: I think the sentence “Our approach, namely DeepProfile, is different from the past studies in that, to our knowledge, DeepProfile is the first attempt to use deep learning to learn a feature representation from a large number of unlabeled (i.e, without phenotype) expression samples that are not incorporated to the prediction problem and use the feature representation to solve prediction tasks “ would benefit from rewriting. Perhaps the sentence could be broken up in two sentences with a clearer explanation of how the representation learned with VAE unsupervised from one set of samples is “transferred” to another problem type with different samples, where the task is supervised learning.
P2: The three unique aspects - I wonder if (2) is really unique (“ (2) DeepProfile uses deep learning in order to learn nonlinear mappings between genes and latent variables which might reveal deeper structures within the data and potentially capture complex, nonlinear relationships between gene expression and their complex traits (drug sensitivity)”). As the reference list shows, there exists previous work on this. Also, mentioning “deep learning” here seems a bit redundant.
P3: It would be good to get more detail on how the batch correction was done. Which type of method?
P4: As mentioned above, code should be provided in some form.
P4: Why use sparse regression when the feature space has already been compressed down to just 8 dimensions? Is it to be comparable to L1 regularized regression using all genes? In that case this information could perhaps be added.
P8: The information on how the K-means based dimensionality reduction is done is a bit sparse. My guess is that the data are clustered into K=8 clusters, and then each sample is represented by their distance from each cluster centroid, resulting in an 8-dimensional representation.
P8: The sentence “This is potentially because non-linear dimensionality reduction of VAE produces more informative LDR relative to the linear methods.” I wouldn’t really call K-means a linear method (although I would call PCA one). Anyway, this sentence does not really add anything substantial to the results.
P10: I assume the jagged shapes of the ROC curves come from the low number of samples in this classification problem. Perhaps that could be explained.

Evaluation

An interesting paper, but the methods including code need to be made available. The authors need to argue for why someone with a similar problem setting should go for the more complicated task of training a VAE rather than using the standard K-means method.

Reviewer 2 (Anonymous Reviewer)

In this work, the authors describe a variational auto-encoder (VAE) approach to learn low-dimensional representations (LDRs) of mRNA expression data from acute myeloid leukemia (AML) patients, a group with appalling relapse rates and dismal overall survival. The authors find that the VAE’s LDR is relevant for unsupervised class discovery and translation of the discovered factors into drug response prediction, with superior results when compared to principal component analysis (PCA) or k-means, two traditional methods to discover groups in data. If the VAE approach is superior – whether due to its ability to learn nonlinear relationships (i.e. manifold embedding) or other differences from traditional approaches – it will benefit the authors to demonstrate this against tools considered state-of-the-art for the purpose of drug response prediction. The authors mention Bayesian multi-kernel learning (MKL), in passing; indeed the choice of mRNA expression microarrays as the benchmark data source is partly justified from their superior contribution to the winning MKL performances.

One interesting wrinkle in AML is that, while approved AML treatments up until a year or so ago were quite limited, the past year saw 5 drugs approved. If this method can indeed better predict response to new agents, that would represent a noteworthy and clinically relevant advance indeed! The one near-constant in AML clinical trials is that proposed biomarkers (whether genetic, transcriptomic, or immunophenotype-based) fail to deliver, as often as not. Mutant FLT3 inhibitors are particularly infamous examples (e.g. http://www.bloodjournal.org/content/117/26/6987), as is mylotarg (an antibody-drug conjugate targeting CD33). If an unsupervised manifold embedding offers additional traction against this problem, especially compared to the state of the art, the implications for future work are substantial. If time to relapse (rather than CR/not-CR) can be predicted better by a VAE-AML score for a profile-drug[s] combination than by other approaches and inputs, that too is of great interest (most patients with AML will achieve an initial CR; the durability of the remission is the item of greatest interest to most clinical hematologist-oncologists).

However, I have some concerns with the presentation of the work.

Some of the comparisons (e.g. using k-means with k=8 to compare with 8 leading PCs from PCA and an 8-dimensional LDR from the VAE) are not only a bit anachronistic, but also favor the simpler methods – in Figure 4, the performance of the latent representation “learned” by the centroids of k-means with k=8 is superior to that of VAE-AML with 7 layers. Other popular approaches and relatively simple methods such as factor analysis are not compared to VAE-AML. For reasons not entirely clear, given that the complete Affy hgu133plus2 is essentially the standard for AML studies, a mixture of hgu133a/b and hgu133plus2 arrays were extracted from GEO to train the VAE, and only their intersection was retained for training and benchmarking. There are a number of fiddly bits involved in this type of study, and given the slight AUC edge demonstrated by VAE-AML compared to (e.g.) k-means, these bits may matter.

For example, batch effect “removal” was performed with little comment upon choice of method or SNR (see https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-016-1212-5 ), and imputation does not appear to have been considered, let alone compared, which is also odd. Some of the largest expression Affy Plus2 datasets from AML patients (e.g. GSE10358, GSE15061, and GSE22056, roughly equivalent to the entirety of the HGU133A/B haul) were not included in training, another strange omission; there is no discussion of whether the features missing from the intersection of the two platforms are missing at random, or systematically enriched for any particular qualities. It is entirely possible that, with careful training, the latent representation learned from VAE-AML may blow all other methods out of the water in predicting in vitro response to chemotherapeutic drugs. But with the slim margin, odd choice of competitors, and curious decisions taken in training the VAE, this particular comparison is not (yet) convincing, particularly with regards to how general the results may be. Most of these comparisons are against approaches that were considered obsolete 15 years ago. The authors selected a particular dimensionality to benchmark VAE-AML against other LDR approaches, but this is not representative of most modern studies (nor indeed is the use of just one assay, without any attempt to leverage standard-of-care tests, as the authors clearly indicate in their MERGE paper). A more convincing study would pit the VAE-AML LDR of microarray against other work, perhaps including:

standard backbone response & time to relapse from a reduced (17- or 3-gene score) representation of gene expression: https://ash.confex.com/ash/2017/webprogram/Paper106035.html
another validated reduced feature subset approach: http://www.haematologica.org/content/haematol/early/2017/12/11/haematol.2017.178442.full.pdf
sparse grouped factor analysis, e.g. https://academic.oup.com/bioinformatics/article/32/16/2457/2240476

There also does not seem to be an attempt to compare against existing standards. e.g. “Patient XYZ [will/will not] respond to [midostaurin/a PARP inhibitor/decitabine/fludarabine]” is in some cases predicted by information collected as part of the standard of care (e.g. FISH for KMT2A rearrangements, FLT3-ITD, CEBPA compound heterozygous mutation, etc). If a learned embedding from one assay contributes substantially to improving over the standard of care for backbone induction, that’s important. If not, that too is (for better or worse) important. Another issue is that all of the AUC figures are insufficient for clinical relevance and would be ignored by most clinicians with whom I’ve worked. Another concern is that, while the existence of clinical drug synergy is an open question, some combinations of drugs appear to work better together (e.g. 7+3 or ADE for induction in AML; ATRA+As2O3 for promyelocytic leukemia). Thus, judging the performance of a representation based on its marginal 0/1 response prediction for specific in vitro response to individual drugs may discard an enormous amount of potentially actionable information.

I do not wish to minimize the contributions of this work. AML, perhaps more so than many diseases, is awash in newly approved and prospective treatments with relatively sparse (supervised) data usable to predict responses. It seems to remain an open question whether the VAE, perhaps with some additional feature engineering, does in fact produce a more useful LDR for this task than other approaches. However, the current framing of the work emphasizes certain aspects of potential clinical relevance that don’t yet appear to be justified. The authors’ previous MERGE work took a principled, iterative approach to learning a probabilistic graphical model for drug responses (similar to Bayesian MKL or sparse grouped factor approaches). This paper presents a (seemingly quite marginal) improvement in the utility of a latent representation of a single “view” of the data, with many subtle inputs to the comparison (preprocessing, imputation or lack thereof, selection of dimensionality) remaining unexamined. The authors do an admirable and concise job of demonstrating how big[ger] data plays to the VAE’s strengths. As clinical sequencing for minimal residual disease appears to be the new standard for relapse risk assessment (http://www.nejm.org/doi/full/10.1056/NEJMoa1716863) in AML, methods that can better cope with complicated relationships in semi-supervised “big data” may pull away from traditional, mostly linear, statistical learning approaches. This early investigation of such benefits would be timely even if the results were profoundly negative. I expect that a more rigorous treatment will be of wide interest.

Menu

DeepProfile: Deep learning of patient molecular …