Whole-genome deep learning analysis reveals causal role of noncoding mutations in autism

Update: This paper has now been published at Nature Genetics

“Whole-genome deep learning analysis reveals causal role of noncoding mutations in autism” by Jian Zhou, Christopher Park, Chandra Theesfeld, Yuan Yuan, Kirsty Sawicka, Jennifer Darnell, Claudia Scheckel, John Fak, Yoko Tajima, Robert Darnell, Olga Troyanskaya

https://doi.org/10.1101/319681

It is well known that Autism has a strong genetic component. In recent years the discovery of genetic factors linked to Autism has skyrocketed, powered by next generation sequencing. Exome sequencing studies have allowed the discovery of hundreds of coding genetic risk variants. Based on those studies, it is now accepted that de novo mutations play an important role in Autism risk. Nonetheless it is still evident that a good proportion of the Autism risk is non-coding, which has spurred the sequencing of the whole genome in large cohorts of patients and families.

I selected this article for review for the following reasons:

Detecting the contribution of non-coding mutations to disease is notoriously tricky. This article is the first to use a deep-learning algorithm, DeepSEA, to address that issue in Autism.
The algorithm seem to be successful at predicting a contribution of non-coding mutations in autism patients compared to their siblings, which is a remarkable step forward for Autism Genetics.
The results also suggest that there is a link between non-coding mutations and IQ. Given that IQ is the biggest predictor of severity within Autism, these results seem to indicate that non-coding mutations may have a mayor role in how broad the Autism spectrum is.

Both reviewers agree that the authors present some intriguing results that could propel the field of Autism Genetics forward considerably. However, some concerns were raised regarding the details of how the algorithm was run, that could potentially affect the predictions. In particular, some more details on the methods section are needed to confidently assess whether the conclusions are justified by the current version of the result and methods.

I want to thank the authors for sharing their work on bioRxiv before it has undergone peer review. I also want to take an opportunity to thank the reviewers who have donated their time to evaluate the novelty, strengths, and limitations of this work. One reviewer chose to remain anonymous, and one chose to be named. Both reviewers are faculty. The two reviews are available in full below.

Reviewer 1 (Jeffrey Barrett)

There has been a lot of attention to the role of de novo coding mutations in autism and other neurodevelopmental disorders, and a natural next question is whether it’s possible to identify non-coding mutations that confer disease risk. One of the major challenges to this is how to understand the “regulatory code” and distinguish “synonymous” from “non-synonymous” mutations outside of genes. This paper applies a deep learning (as the authors note six times) algorithm, DeepSEA, to this problem.

The major result is that DeepSEA successfully predicts a systematic difference between de novo mutations in individuals with autism and their unaffected siblings, which is an impressive achievement. One of my major suggestions for this paper is to provide more information about the input data (de novo mutations called from whole genome sequence), as this is notoriously tricky. I’m somewhat less concerned because the dataset provides a natural internal control between affected individuals and their siblings, but it would still be good to see more detail. For example:

How many de novos were called per proband and per sib? This is a basic QC metric, but I couldn’t find it.
More info would be helpful: “Further filtering was then applied to remove variants that were called in more than one SSC families.”
The paper refers to 127,140 de novo SNVs. I assume these are only non-coding (and coding mutations are stripped out), but it’s not totally clear.

DeepSEA predicts biochemical disruption, and these predictions were further trained on curated HGMD disease mutations and variants observed in 1000 Genomes. What happens if the predictions from DeepSEA are used directly in the autism data? The noncoding disease mutations in HGMD might be a problematic training set, as there are not that many known, and some may not be actually pathogenic, even in the curated set.

Further analyses (e.g. of tissue specific expression and enriched biological functions) provide additional support for the main findings.

For a follow-up paper, someone (the authors or another interested party) should run this analysis on the Deciphering Developmental Disorders dataset (which I was involved in), which should have good power to find specific causal mutations: https://www.nature.com/articles/nature25983

Minor comments

The authors suggest that 30% of simplex ASD probands have a de novo coding cause (and their point is that this is not very much), but I think that’s high. I’m not sure where the number comes from, as ref 3 finds diagnostic mutations for 11%, and their Sup Note says 2.4%.

The comparison between versions of DeepSEA is described only fleetingly: “leading to significantly improved performance, p=6.7x10-123, Wilcoxon rank-sum test”. More generally, one needs to read that paper to really understand what’s going on. This is often the case, but a bit more of a summary of the method would help.

On page 18 there’s a repeated word in “40 SSC families families”.

Reviewer 2 (anonymous)

The authors train a multitask convolutional neural network to predict various DNA and RNA associated regulatory molecular profiles (TF ChIP-seq, Histone ChIP-seq, DNase-seq, RBP CLIP-seq data) in a wide variety of cell lines and tissues from DNA and RNA sequence respectively. The model is a variation of their previous DeepSEA approach. The model can be used to predict the allele specific effects of any variant/mutation on all the predictions tasks (molecular phenotype in each tissue). The authors then train a regularized linear model to discriminate curated human disease regulatory mutations (HGMD) against rare variants from healthy individuals in the 1000 Genomes populations. This metaclassifier is used to obtain a disease impact score for all mutations in patients with simplex autism spectrum disorder (ASD) and matched healthy siblings. The authors find an elevated burden of disruptive transcriptional-regulatory disrupting (TRD) and RBP-regulatory disrupting (RRD) proband mutations in ASD with elevated effect sizes observed around loss-of-function intolerant genes. They identify specific pathways and tissues affected by these mutations. The use luciferase assays to experimentally verify the differential regulatory effect of prioritized variants. They suggest some links between the noncoding mutations and IQ in ASD.

Overall, the authors present several intriguing results and its great to see some preliminary experimental evaluation of predictions.

However, there are a few issues to be addressed.

The Deep learning model

The authors report the area under the ROC curves (auROCs) in Supp. Fig 1 to claim that their deep learning model is accurate and robust. However, auROC does not provide a realistic evaluation of performance for prediction tasks with significant class imbalance and low prevalence of the class of interest (in this case the positive class of peaks). All (or most) of the tasks exhibit a low prevalence of peaks and hence a significant class imbalance. auROCs can appear to be artificially high when the prevalence of the positive class is low. Please see https://www.nature.com/articles/nmeth.3945 and https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4349800/ for guidelines on reporting performance on classification tasks with significant class imbalance. The authors should report the area under the precision recall curve, the recall at specific precision/false discovery rates (e.g. 10% and 50%) and other associated measures such as F1 or MCC. These measures are particularly relevant to get a sense for how well the model is able to accurately rank the set of peaks as a whole above the background regions.
The authors appear to be making predictions for mutations genome-wide irrespective of the whether the deep learning classifier is able to accurately predict the labels for the overlapping genomic region. The authors should use a cutoff on the prediction scores corresponding to a specific FDR and restrict downstream analyses to correctly predicted examples. The mutagenesis scores in a region are only as reliable as the prediction accuracy of the model for that region.
Multi-task models are often beneficial when applied to highly related tasks. However, for regulatory genomics models in particular, multi-tasking often significantly hurts performance when the model is trained on very diverse tasks with widely varying class imbalance across tasks (such as the ones in this model). The authors should report performance of representative single task models for the key molecular phenotypes to show that multi-tasking is not hurting performance. Also, fine-tuning individual tasks one at a time after multi-task training almost always boosts performance. These performance boosts could significantly improve downstream variant scoring.

Disease Impact Score predictor

The authors train disease impact score metaclassifiers for chromatin mediated (TRD) and RBB mediated (RRD) scores to discriminate gold standard HGMD regulatory variants from background rare variants from 1000 genomes project. Are HGMD mutations enriched for neuronal/brain related disorders or other disease phenotypes related to ASD? If not, its unclear why a classifier trained on non-coding mutations associated with unrelated disease phenotypes would generalize to predict ASD specific mutations. Unlike coding variants, non-coding regulatory variants affect regulatory elements that typically exhibit highly tissue-specific activity in tissues relevant to the disease. e.g. One might expect brain tissues to be causal for ASD mutations but not for rare autoimmune disorders for which immune cell types are more likely to be the causal. The sequence features that are predictive of tissue-specific activity of regulatory elements typically correspond to binding sites of transcription factors with cell-type/tissue specific activity. Hence, it is confusing as to how a classifier trained on an assortment of rare disease phenotypes would be able to pull out regulatory features that are specific to a disease of interest (in this case ASD). Further, how many HGMD curated regulatory variants are affecting post-transcriptional processes? How could these provide a good training set for the RBP associated RRD scores? The authors should clarify the rationale for these choices.
What is the performance of the disease impact score metaclassifier? This doesn’t seem to be reported anywhere.
The authors use rare variants from 1000 genomes project as the background set. How many variants are in the negative set? Are the positive and negative set artificially balanced (this would not be appropriate)? Also, were other confounding factors taken into account when selecting the background variants e.g. distance to TSS, di/trinucleotide content. These are very important to account for biases in the HGMD set of regulatory variants.
The authors do not provide any baselines for their disease impact score predictor. E.g. what happens if you simply train a model that uses sequence features, distance to TSS, GC content to distinguish the HGMD regulatory variants from the rare variants? How do we know if their metaclassifier is not simply learning some bias in the HGMD set? Another strong baseline is to train the metaclassifier to discriminate the same set of positive (HGMD) and negative (1000 genome variants) using features that are the observed binary or signal labels for all tasks corresponding to the region overlapping each variant in the positive and negative set.
“When considering all de novo mutations, we observed a significantly higher functional impact in probands compared to unaffected siblings, independently at the transcriptional (p=9.4x10-3, one-side Wilcoxon rank-sum test for all; FDR=0.033, corrected for all mutation sets tested) and post-transcriptional (p=2.4x10-4, FDR=0.0049) levels (Fig. 1b, all variants)”.

These results are intriguing but also very confusing. On average, an individual with ASD is likely to have < 100 de-novo mutations. Most of these should be benign barring say one or a handful (probably < 5-10 de novos) that would causally affect ASD. Hence, the expectation is that the distribution of effect scores for mutations from the ASD individuals and their siblings should be very similar for most mutations (the random benign ones) except for a few outliers at the tail of the distribution. In other words, the distributions should primarily differ at the tails. That is not what the box plot like figures look like in Fig 1. It appears as if the whole body of the distribution is shifted indicating that most or a large number of de novos have large predicted effect sizes and a possible causal role (I could not verify this as I was unable to find the supplementary tables on biorxiv). The authors should provide some kind of rationale for why one should expect significantly shifted distributions rather than simply differences in the extreme tails. Also, it would be ideal to plot the actual distribution of scores so that the behavior of the body and the tails of the distributions are clearly visible. Otherwise it is hard to know which parts of the distribution (low z-scores vs. extremes) are driving the differences between the distributions.

A previous version of the preprint had an entire section on embryonic stem cells and related derived cell types being the top ranked cell type of origin. In the latest version of the preprint, that section seems to have been abandoned and the cell-type/tissue of origin now seems to be brain related tissues based on GTEx data. The Roadmap datasets also contain similar brain-related tissues that were not found in the previous analysis. I found it a bit disturbing that the conclusions changed so dramatically.

Luciferase assays

The luciferase assays are certainly interesting and show that the mutations exhibit allele-specific reporter activity in the BE2-C cell-line. However, the selection criterion is not clear. The methods section states “For experimental testing, we selected variants of high predicted disease impact scores larger than 0.5 and included mutations near genes with evidence for ASD association, including those with LGD mutations (e.g. CACNA2D3) and a proximal structural variant.” Were all variants with scores > 0.5 tested? Were they further filtered to those in proximity to genes already known to be associated with ASD?
Some negative controls and matched controls appear to be missing from the luciferase experiments such as lower scoring mutations that also lie near known ASD genes or are matched for distance from TSS of genes with similar baseline expression levels to the ones with high z-scores that were tested . The reason these controls are important is because they test whether the luciferase effects are simply the result of testing mutations in promoters of genes. Mutations in promoters of genes are in general expected to have large effects on expression.
The authors do not mention any caveats to testing these variants in a single cell line BE2-C. How good a model is BE2-C of the causal cell type for ASD?
Did the direction of effect of the alleles from the model agree with what was observed in the luciferase?
Unfortunately, there is a lack of any analysis of the sequence features or other properties (e.g. distance to TSS) underlying the high-scoring mutations or the luciferase validated mutations. These would be very useful to get a sense for what these mutations are disrupting.
Unfortunately, there don’t seem to be any experimental tests for the RBP associated predictions, which appear to have even stronger effects

Menu

Whole-genome deep learning analysis reveals causal …