Human 5' UTR design and variant effect prediction …

Human 5' UTR design and variant effect prediction from a massively parallel translation assay

Update: This paper has now been published at Nature Biotechnology.

“Human 5’ UTR design and variant effect prediction from a massively parallel translation assay” by Paul J. Sample, Ban Wang, David W. Reid, Vlad Presnyak, Iain McFadyen, David R. Morris, and Georg Seelig.

I selected this article for review for two reasons:

  • I am excited to learn about applications of massively parallel reporter assays (MPRAs).
  • I have a general interest in methods that use deep learning (in this case convolutional neural networks) with genomic data.

There are an increasing number of papers using deep learning with genomics data every year. Furthermore, the advent of MPRAs has revolutionized our ability to understand the functional consequences of noncoding regions in a high-throughput manner. Here, the authors were able to use a MPRA to measure protein expression (translation) from 5’ untranslated regions (UTRs). They also trained convolutional neural networks (CNN) that related human 5’ UTR sequence to translation.

Both reviewers felt this work was thoughtful and they were excited about the performance improvements of the CNN over a linear model. However, they posed some interesting questions about the work and made some suggestions that would improve the manuscript. Specifically, it would be great if the authors could (1) substantiate some of the claims with additional analyses, (2) expand the discussion to address potential problems with the CNN e.g. model overfitting, observational noise or over-parameterization, (3) include more comparisons to baseline and alternative approaches, (4) include additional references, (5) provide more intuition and explanations about the CNNs. These details would significantly enhance the manuscript.

I want to thank the authors for sharing their work on bioRxiv before it has undergone peer review. I also want to take an opportunity to thank the reviewers who have donated their time to evaluate the novelty, strengths, and limitations of this work. Both reviewers have chosen to be named. Both reviewers were graduate students. The two reviews are available in full below.

Reviewer 1 (Matthew Ploenzke)

The authors develop a massively parallel reporter assay based on polysome profiling and RNA-seq to measure the degree of translation of over 300,000 random 5’ untranslated regions (UTRs) in order to quantify the translational effects of noncoding variants within 50bp upstream of a coding sequence (CDS). The polysome fractions were collected for each UTR, providing a distribution of ribosome occupancy, from which the mean ribosome load (MRL) was calculated. Two convolutional neural networks (CNNs) were then trained on the 50bp UTR sequences to predict 1) MRL and 2) the full polysome distribution (14 output predictions corresponding to 14 polysome fractions). The authors report strong performance, specifically noting improved predictive accuracy on the test set in comparison to a linear model utilizing 5-mers as features. Feed-forward visualizations were used to interpret the convolutional filters revealing recognizable motifs including stop codons and translation initiation sites.

The trained CNN was also used to design novel UTRs targeting a specific level of protein expression (MRL) under either of two training procedures; both of which rely upon a genetic algorithm to modify the UTR sequence during calibration. These model-generated sequences were subsequently tested via polysome profiling to assess accuracy. The model performed poorest on those UTRs targeting high levels of MRL due to these UTRs containing long stretches of poly-U sequences, a feature rarely encountered in the original 300,00 UTRs. The original model was in turn retrained on these new sequences and model performance was reassessed, noting improvements. This retrained model was then tested on a collection of 50bp UTRs preceding the start codon of over 35,000 human transcripts. 3,577 variant sequences were also included in this set, all of which were subsequently tested via polysome profiling to assess accuracy. 81% of variation in MRL was captured by the model indicating that relevant cis-regulatory properties of human 5’ UTRs were learned through the randomized UTR training procedure. Lastly, a mutation analysis of common 5’ UTR single nucleotide variants indicated the strongest effects were for mutations introducing an upstream start codon.

Major issues:

  • The claim regarding the learning of secondary structures’ repressive effect on ribosomal load is not substantiated: the only figure is provided in the supplementary material and there is no statistical assessment of the agreement (correlation).
  • The claim that filters do not match previously described PWMs indicates potential “previously undescribed regulatory interactions” (page 5) is just one possibility and the authors fail to mention potential others including model overfitting, observational noise, or over-parameterization. In other words, the CNN may simply have learned nuances in training data that do not pertain to real biology and the claim should address the other possibilities.
  • How did the k-mer model perform on the polysome distribution prediction? How did it perform on the pseudouridine or 1-methyl-pseudoridine analysis? These analyses should all include an accuracy benchmark. All descriptions regarding model accuracy, such as 93% constituting “exceedingly well”, should be made in reference to a baseline model.
  • There are alternatives to the genetic algorithm for designing sequences and these should be mentioned. One would be deep motif (using the trained model) and two would be deep generative modeling. These should be mentioned to convey alternatives or limitations of the algorithm. Another limitation may be the uniform randomness with which nucleotide mutations/switches are induced during calibration. Given that location in the 50bp UTR is important, the uniform assignment may not be ideal and possibly lead to local minima.

Minor issues:

  • The second paragraph on page 5 switches to reporting R instead of R2, as reported in the rest of the paper.
  • The mutation analysis presented on page 7 is similar to the Zhou and Troyanskaya procedure (reference 5) and does not cite attribution.
  • Results for the human transcript analysis (page 6) should also contain the original (not retrained) model. Since these high-U stretches don’t exist in nature, it shouldn’t impact model accuracy. To that end it would be interesting to provide an analysis of what sequences the model misses on. These may provide information about what the model has learned from randomness relative to human UTRs.
  • Why do categories 9 and Max not match the increasing trend described in the last paragraph of page 5 when they do in fact seem to? Further, the range of the MRL for the realized values in the plot of figure 3B is much larger than the CNN predictions. This may be used to infer more about model uncertainty versus biological uncertainty and should be addressed. It may also indicate overfitting as the model does not capture the true variability in the data.
  • Please provide justification or citations for the selection of 50 base-pair sequences as opposed to, for example, 30bp. The location of the maximal convolutional activations is slightly touched on in the text but does not explicitly justify the selection of 50bp. Note also that not all 50bp random sequences are generated in the experiment whereas a shorter selection could examine all possibilities. With this in mind, were any sequences deliberately left out (like those high-U sequences mentioned later), or was the procedure truly random?
  • How were the representative sequences to be measured via IncuCyte live cell imaging in figure2E selected? Were they randomly selected or chosen due to specific characteristics? Also, the 15-fold range based on fluorescence is mentioned but no comparison to the prediction ranges is provided. Further, how did the bulk model perform on the library with an mCherry CDS replacing eGFP (only the performance of the polysome distribution model is reported)? The decrease in accuracy is attributed to specific protocols differences but left unaddressed.
  • In figure 2D, the model performance was worst for the polysome fractions near the middle of the distribution. Why might this be and what does this mean for what the CNN has learned? Why is correlation shown in 2F and not discussed?
  • Please describe the biological inspiration or provide the citation for the step-wise evolutionary technique being inspired by the evolution of a UTR. Secondly, the dashed line in figure 3C does not seem to match the 800 iterations description on page 5.
  • Do iterations refer to epochs (the classifier having seen the entirety of the training data once) or a different unit of training?
  • The claim on page 5 that “Prediction accuracy could be further improved by training the models directly on data from the modified RNAs” need not be stated. However the subsequent claim that this would be due to learning the impact of ? and m1? on secondary structure is non-trivial given the earlier lack of evidence of the model learning secondary structure.
  • On page 8 it is noted that the the introduction of an upstream codon dramatically affects ribosome loading and this claim is validly supported. However, how come different upstream-codon-inducing mutations do not exhibit such a strong effect (for example, figure 4C: RPL5 position -41 mutation: G ? U)? If this is due to the frame, a plot such as figure 1C would suffice.


  • X-axis labelling/limits in the figure 3C bottom-right plot.

Editorial note: While biOverlay is not an academic journal that publishes content, this reviewer noted that, in his opinion, this paper should be accepted with major revisions.

Reviewer 2 (Leslie Myint)

The authors develop a massively parallel reporter assay to measure the translational activity induced by different 5’ UTRs via polysome profiling. From this data, they build a convolutional neural network model to predict the translational activity of a mRNA transcript from its 5’ UTR. They also use similar modeling techniques to design 5’ UTRs that target certain levels of activity. Overall, this is interesting work with thoughtful and meaningful evaluations. Continued evaluation of these models in increasingly realistic in vivo contexts would be a worthwhile line of future work.


  • Figure 1E: Regarding “This analysis recapitulated the importance of a purine (A or G) at position -3 relative to AUG and a G at +4”, it would help to have the x-axis indices labeled. It was unclear if the A in AUG was 0 or 1.
  • Page 2: Define abbreviation CDS = coding sequence
  • Page 3: To allow readers to better assess the validation, it would be helpful to have all results from references 10, 22, and 23 summarized in a table or incorporated within Figure 1E. This helps with interpretation of other positions in the seqlogo.
  • Page 3: Add that the train and test sequences were randomly selected. (This is in the Supplement but a short note in the main is useful.)
  • Figure 2B: It is natural to want to see a y=x line, but it seems that by using R-squared as a performance metric interest is more in predicting that certain sequences have higher or lower translational activity than other (i.e. relative activity prediction). Can the authors explain the departure from the y = x line?
  • Figure 2B: What are the outlier sequences (top left)? Are there systematic features about why they predict poorly (consistently over predict)?
  • Figure 2D: Why do certain polysome fractions lead to lower or higher accuracy? Is it just random variation or something systematic?
  • Page 6, endogenous transcripts: How similar are the sequences in this new 5’ UTR library to the original completely random library? This experiment is worthwhile, but I don’t think it answers the interesting question of predicting activity of fully endogenous sequences. It would have been interesting to see the predictive ability of the model on fully endogenous sequences despite the confounding factors if the experimental design allowed for some reasoning about how the confounding factors would alter the prediction. This is more of a comment than a criticism because I don’t know how feasible it is to do such an experiment.

Stephanie Hicks is an Assistant Professor in the Department of Biostatistics at Johns Hopkins Bloomberg School of Public Health. She develops statistical methods, tools and open software for the analysis of (epi)genomics, functional genomics and single-cell genomics data.