Building a tumor atlas: integrating single-cell …

Building a tumor atlas: integrating single-cell RNA-Seq data with spatial transcriptomics in pancreatic ductal adenocarcinoma

Update: This paper has now been published in Nature Biotechnology.

“Building a tumor atlas: integrating single-cell RNA-Seq data with spatial transcriptomics in pancreatic ductal adenocarcinoma” by Reuben Moncada, Marta Chiodin, Joseph C. Devlin, Maayan Baron, Cristina H. Hajdu, Diane Simeone, and Itai Yanai. https://doi.org/10.1101/254375

I selected this article for review for a few reasons:

  • I thought the spatial aspects of the work were intriguing.
  • I have an interest in methods that use single cell data to deconvolve bulk samples.
  • I was curious about heterogeneity in this disease based on some of our previous work with FNA-derived PDX models.

I remain enthusiastic after the reviewer comments, though the issues that they note make clear the early-stage nature of this technology. I would note that certain caveats identified by Reviewer 3 related to how representative the cell populations are suggest that using only 806 single cells may be insufficient if the tissue is not sufficiently well understood to add back key cell types. It is also possible that certain cell types are not amenable to this procedure and there may be no sufficient number of cells to profile to capture those. The authors’ downstream analyses require profiles of each cell type in the mixture, so for types of tissue for which all the relevant cell types are not known and profiles are not available, this approach may have substantial limitations. Though probably less likely, it’s possible the cell type markers from other sources that are used for the missing cell populations are different in the context of a sample in question which could confound deconvolution analyses. The limitations section of the discussion currently focuses entirely on the spatial characteristics of the array. Though these limitations are worth mentioning, they seem much less relevant that most of the limitations identified by the reviewers. I would like to see the authors revise this portion to give a more detailed picture of the limitations of the current stage of this technology.

I want to thank the authors for sharing their work on bioRxiv before it has undergone peer review. I also want to take an opportunity to thank the reviewers who have donated their time to evaluate the novelty, strengths, and limitations of this work. Two reviewers chose to remain anonymous, and one chose to be named. Two of our reviewers were faculty and one was a graduate student. The named reviewer also posted the review to bioRxiv as a comment. All three reviews are available in full below.


Reviewer 1 (Gregory Way)

Moncada et al. present an interesting and well-written study introducing a method for the co-analysis of single-cell RNAseq (scRNA-Seq) with spatial transcriptome data (ST). Data in many of the figures are beautifully represented. The authors apply their approach to a single pancreatic ductal adenocarcinoma (PDAC) tumor. The integration step uses cell-type markers derived from scRNA-seq data to determine relative cell-type proportions in the ST data. The authors apply this approach and identify different sub-populations of cells - including normal pancreas cells and 3 distinct groups of cancer cells. Interestingly, two of the three cancer groups identified largely overlap with a pathologist’s histological annotation. The third group is more dispersed and does not seem spatially constrained. Furthermore, key transcription factors distinguishing the progenitor PDAC subtype are highly expressed in many of these cells. With this resolution of ST data, the authors present the ability to spatially track the expression of marker genes throughout identified cell-type populations.

While an exciting approach, I am not sure the authors have successfully demonstrated the full benefit of using scRNA-seq in the integration step. The authors currently apply tSNE to scRNA-seq data to identify cell clusters. Next, cell-type markers derived from these clusters are used to infer proportion of cell-types spatially. Many of these markers are well known and it is unclear how much more information the scRNA-seq data provides over these resources. Also, inferring cell-type from tSNE results can be misleading as distances in tSNE space are difficult to interpret, and solutions are dependent on input parameters. Furthermore, since the cells profiled are not exactly the same between ST and scRNA-seq, isn’t it possible that entire populations of cancer cell-types could be missed if only the scRNA-seq profiles are considered for deconvolution?

The validation of the data-type integration, beginning on line 189, is also not clear. The authors are asking if subpopulation substructure found in ST data are also observed in scRNA-Seq data. The expression patterns of REG1A is shown across PCA loadings (Figure 5D) and across ST array spots (Figure 5E) (Are the Figure 5E axes labeled incorrectly?). The mechanism by which the authors claim an integration is by demonstrating that PC1 of scRNA-Seq data also retains differential REG1A expression. A similar pattern is given for APOL1 in Figure S5 and a list of additional similar genes are presented in Table S1. While this is certainly an interesting observation, it is not clear that any additional knowledge is gained. Don’t we expect to see variation in these genes? What are the negative controls?

Minor Concerns and General Discussion Points:

  1. The paragraph starting on line 44 was hard to understand. The paragraph introduces the problem of spatially resolving transcriptomes, but it is difficult to parse exactly what is meant.
  2. I appreciate the benefits of a simple cell-type explanation in this paper, but there is no discussion on the difficulties of identifying cell-types in scRNAseq data. For example, could Cell-type A in Figure 1 tSNE be two subpopulations?
  3. Line 158 - The sentence starting with “Deconvolving each spot…” is difficult to understand.
  4. Line 160 - Is it possible to confirm the pathologist’s margins with the ST data? How strictly do the margins separate inferred cell-type proportions? Are there places to refine the pathologist’s margins?
  5. Line 163 - Compared to Figure 2, there are very few activated cancer cells marked “A” in Figure 4C. Is this because they exist in low proportion in most spots? Could this be an early cancer progenitor line?
  6. The methods section needs substantial expansion for appropriate reproducibility. For example:
    1. Line 93 - how are the 615 genes determined to be “dynamically expressed”
    2. Line 137 - How was enrichment determined? What was the background gene list? What was the cutoff of the highest loadings?
    3. Line 149 - There is no discussion on how the 46 cell-type mixtures are simulated.
    4. Are the software and data publicly available? This will help a researcher reproduce the analyses.
  7. There are a couple typos and incorrect references to figures in many places. For example:
    1. Line 109 - Figure 2C - possibly not a typo, but CLDN1 is not listed in the relevant paragraph.
    2. Line 112 - Figure S2 legend - Is the appropriate citation 42 (Bailey et al.), not 30 (Chen et al.)?
    3. Line 116 - Figure 2 is referenced, but it should list Figure 3.
    4. Line 191, Line 195, and Line 196 - Figure 5F is reference, but there is no Figure 5F. It looks like the Figure is mislabeled as G? Figure 5I is referenced in Line 196.
  8. Other scRNA-seq papers have shown single-cell specific heterogeneity in subtype assignments. Can all single cells (and spots) be assigned to the progenitor subtype using some sort of single sample gene set enrichment approach? Or is there also substantial intratumor heterogeneity in this PDAC tumor?

Reviewer 2 (Anonymous Reviewer)

  1. This is a very interesting study by Moncada et al. that combines two novel methodologies, scRNA-Seq and spatial transcriptomics, to identify subpopulations of tumor cells in a heterogeneous pancreatic tumor.
  2. It is not stated whether the tumor, which was processed within 2 hours of resection and receipt in the laboratory was obtained from a primary tumor or treatment naïve tumor. This methodology may have substantial clinical promise to identify and characterize subpopulations of clinically relevant tumor cells admixed with cells that represent a dynamic desmoplastic stroma that may persist post treatment. This may allow for identifying actionable targets within this treatment resistant population.
  3. A potential limitation may be that the scRNA-Seq data and the unintended consequences of tissue dissociation may not capture a minority cell population that may be important for disease progression.

Reviewer 3 (Anonymous Reviewer)

The authors proposed an interesting approach to marry single-cell RNA-Seq data with the spatially informative but more bulky (10-20 cells per spot) barcoded microarray derived RNA-Seq data (Spatial Transcriptome, or ST). The idea is to use the computationally de-convoluted cell-type specific scRNA-Seq transcriptomics data, to guide categorization the cell types in each ST. Within hundreds or thousands of such ST sampled over the tissue, one can then gain spatial distributions of tumor atlas.

While the idea is very sensible and interesting, the reviewer found significant technical issues in experimentation and computation, which obscure or bias the conclusions made in the report. Specifically:

  1. From the method section, it appeared that scRNA-Seq data were generated from one of the 1mm^3 chucks of tumor tissues, and total 4000 single cells were encapsulated by inDrop, after which only 806 single cell transcriptomes remained post filtering. One would worry (1) if the 806 single cell transcriptomes are bona fide proportional to what is in the original tumor, and (2) if all the important cell types are kept in the 806 single cells (esp the rare cell types that might be the case for the chuck of tissue for scRNA-Seq, but not for the other tissue chucks under-going ST experiments). As the authors pointed out, two cell types (acinar and ductal) were not retained by the protocol and they added them back when using the cell type specific marker genes to de-convolute the ST data. Note this step is very importance, as their algorithm Bseq-SC relies on the cell-specific marker matrix to de-convolute the ST data. The authors should have got more single cells to make more convincing “cell type specific libraries”.
  2. For the 3 cancer cell types, how confident are the authors assign them to other classification systems? More quantitative comparisons will be helpful, besides listing the marker names. More, although these 3 cancer types display spatial distribution preference, one would wonder what is their relationship and how they end up aggregating where they are (on the tissue slides). Particularly, if cancer cells A is progenitor cell type as the authors speculate, why would they be located evenly everywhere?
  3. On the four regions identified by just ONE pathologist, they do not comprise the whole slide space. What are the big regions in between normal tissue, inflammation, duct and desmoplasma? First, is there AI tools that classify the slides at alternative (and maybe a less biased way)? These zoning seem to be very important, because the TS spots are scattered on them.
  4. The data quality on further subpopulations among cancer A, B and C types are very poor, not convincing enough. For example, the subtypes identified by PCA plot (Figure 5B) do not have clear boundaries, and when these subtypes are mapped back to the ST tissue slide, the boundaries are not very clear either. The authors need to use better clustering methods than PCA to identify subpopulations, for example NMF or other more sensitive methods that can distinguish the highly similar subpopulations. Further, using so-called top marker genes (such as REG1A) failed to show good (or even modest) separations on ST PCA, tissue slide, or scRNA-Seq PCA.
  5. Scale up issue. The tumor tissues studied here are very small in terms of size. How would one scale this method (assuming the data qualities are much improved) to much bigger tissues or organs? It is exaggeration to say “as well as the inference of cell architecture in any tissue”, scratch it off.

Other minor points are:

  1. Line 126, do the author mean “normally distributed” rather than “uniformly distributed”?
  2. On Figure 3, PC1 shows strong scores of two regions (red), but only one is cancerous, explanations? GO terms also need P-values.
  3. Bseq-sc (line 145), it is an essential computational method that is used to do de-convolution of ST data. It seems to be some wrapper around CIBERSORT. I do not see URL of the link, neither do I see the comparison of this method to other de-convolution based method. Need more details.
  4. Again, how did the authors determine the optimal number of subpopulations among each cancer cell types A, B and C? What methods they uses? How do they know the number of subpopulations they called are the most accurate? PCA is a clustering method but not classification method.

Casey Greene is an Assistant Professor in the Department of Systems Pharmacology and Translational Therapeutics at the University of Pennsylvania's Perelman School of Medicine and the Director of Alex's Lemonade Stand Foundation's Childhood Cancer Data Lab. His lab aims to develop deep learning methods that integrate distinct large-scale datasets to extract the rich and intrinsic information embedded in such integrated data.