**Update: This paper has now been published at Nature Communications.**

“Single cell RNA-seq denoising using a deep count autoencoder” by Gokcen Eraslan, Lukas M. Simon (both first-authors contributed equally), Maria Mircea, Nikola S. Mueller, Fabian J. Theis. https://doi.org/10.1101/300681

I selected this article for review for two reasons:

- I work with single-cell RNA-Sequencing (scRNA-seq) data quite a bit and I have a general interest in methods that remove technical sources of variation from it.
- I have a general interest in methods that use machine learning approaches (in this case an autoencoder network) with genomic data.

Recent efforts to correct for dropout events in scRNA-seq data have been made including scImpute, Single-cell Analysis Via Expression Recovery (SAVER), and Markov Affinity-based Graph Imputation of Cells (MAGIC). I would note however, this is not an exhaustive list and I would be curious to see how DCA compares to other imputation methods with pre-prints available, such as single-cell Variational Inference (scVI) and kNN-smoothing.

In this manuscript, Eraslan and Simon *et al.* compare their proposed method DCA to the first three methods listed above. They argue while, scImpute, MAGIC and SAVER differ in the approaches, they all rely on estimating correlation structure of the single-cell gene expression data between cells and/or genes, which limits the scalability. Furthermore, they argue linear methods, such as scImpute, fail to capture “underlying complexity of scRNA-seq” data. Therefore suggest a non-linear approach using a “deep count autoencoder” (DCA) for imputation.

Both reviewers thought this was an interesting application using machine learning to correct for dropouts in scRNA-seq data. However, they shared overlapping concerns of potential overfitting and over imputation using the autoencoder framework and asked if the authors could expand on these points a bit more in the manuscript. Furthermore, they have each made suggestions to improve the manuscript. Specifically, it would be great if the authors could (1) use more stringent mathematical language in the description of their approach, (2) expand on some details, for example, the connection between the de-noising step and regression step? how the method keep gene-level information after dimension reduction? and (3) describe how DCA can be combined with other pre-processing and normalization methods already available for scRNA-seq data. These details would significantly enhance the manuscript.

I want to thank the authors for sharing their work on *bioRxiv* before it has undergone peer review. I also want to take an opportunity to thank the reviewers who have donated their time to evaluate the novelty, strengths, and limitations of this work. One reviewer has chosen to be named. One reviewer was a graduate student and one was a faculty member. The two reviews are available in full below.

**Reviewer 1** (Matt Ritchie)

biOverlay review of preprint entitled: “Single cell RNA-seq denoising using a deep count autoencoder” by Eraslan et al. bioRxiv (2018)

This manuscript introduces an unsupervised machine learning approach (auto encoder) for correcting signal in zero-inflated single cell RNA-seq data.

The authors compare their DCA approach to other popular imputation methods (scImpute, SAVER and MAGIC) that have been developed for single cell data and show improved performance as measured in terms of concordance of results from bulk data and speed.

This article is well written, providing an informative introduction to the method in Figure 1 followed by analysis on both simulated and experimental data (Figures 2-9) to demonstrate its superiority over other approaches across a range of tasks. Analyses involving clustering, differential expression, pseudotime and correlation analysis between gene-protein or gene-gene are improved after denoising with DCA. DCA is available as a Python package and is run via a simple command-line interface on the filtered counts matrix.

The questions I have are all minor ones:

Figure 7. Great that DCA can handle such large data sets more efficiently than other methods. Does it also improve inference on this data set (say if you restrict to the 5,000 cell analysis that all methods complete. I imagine so given previous results in the manuscript)?

Preprocessing: Can DCA be combined with other normalization methods, or is library size the only option?

The authors acknowledge the issue of overfitting and over imputation in the Discussion and mention DCA?s regularisation and hyperparameter search options to reduce this problem. I would have liked to see more analysis on this very interesting topic. I guess this is left for future work.

Is there a minimum size data set in terms of cell number that DCA can be applied to?

Figure 5 caption. ‘identify line’ -> ‘identity line’

I couldn’t find the Supplementary figures and tables (they don’t seem to be available on bioRxiv), so these results could not be assessed.

**Reviewer 2**: (Anonymous Reviewer)

Eraslan et al. proposed a method, DCA, to denoise scRNA-seq datasets based on the autoencoder network. Even though the idea is interesting, I think the authors need to provide more theoretical justification of their modeling (see comment 1 and 2), otherwise it looks like a direct application of autoencoder without accounting for single-cell RNA-seq data characteristics. I also believe the authors need to present their method in a clearer way and more stringent mathematical language. Based on the current description, many method details are not straightforward to the readers.

Major comments:

It is not clear to me why it is necessary to use the autoencoder. It looks like that directly fitting a ZINB model can also provide estimations of the mean parameters and dropout rates. What’s the connection between the denoising step and the regression step?

How does DCA account for different cell subtypes in its modeling?

Page 8: “we restricted the autoencoder bottleneck layer to two neurons and visualized the activations of these two neurons.” What’s the reasoning of using two neurons, and how should users determine the number in real practice?

Figure 3: The CD4+ and CD8+ cell types were distinguishable by the first two dimensions of tSNE, but they were clustered together after DCA. This result contradicts the authors’ argument that “DCA captures cell population structure in real data”.

Figure 4C: How does the method keep gene-level information after dimension reduction? Where does the cell-to-cell heterogeneity comes from if the authors “replacing the original count values with the mean of the negative binomial component”?

Page 11: “Single-cell specific noise was added in silico by gene-wise subtracting values drawn from the exponential distribution such that 80% of values were zeros”. Why do authors select exponential distribution? I doubt if this simulation really leads to data that capture the properties of real single-cell data.

Page 14: “When comparing the estimated fold changes across all bootstrap iterations, DCA showed highest correspondence with bulk fold changes (Fig. 5F)”. Since bulk data is not the gold standard for single-cell analysis, the increase of correlation from ~0.75 to ~0.8 does not serve as an evidence that DCA is really doing better.

Figure 6D: Since there are only eight pairs, a scatterplot or violin plot may be a better choice for illustration than boxplot.

Page 28: In formula (1), what’s the definition for W’s? What are the dimensions of each vector or matrix? Are the z-score normalization performed by rows or columns? I think the authors should write the formulas in a mathematically stricter form.

Page 29: Given the excess of dropout events in single-cell data, why do the authors want to add regularization on pi?

Page 30: “The denoised matrix is generated by replacing the original count values with the mean of the negative binomial component as predicted in the output layer.” This means that every gene would have the same expression across the cells? Then it is not possible to distinguish individual cells.

Minor comments:

Page 6: “We generated two simulation datasets with 200 genes…” Why do the authors use only 200 genes instead of performing a genome-wide study?

Page 27: “The hidden features are then used by the decoder to estimate the mean parameter of a normal distribution for each feature”. Should “normal distribution” be ZINB?