Mapping DNA sequence to transcription factor binding energy in vivo

Update: This paper has now been published in PLOS Comp Bio.

“Mapping DNA sequence to transcription factor binding energy in vivo” by Stephanie L Barnes, Nathan M Belliveau, William T Ireland, Justin B Kinney and Robert Phillips

https://doi.org/10.1101/331124

In this article, Barnes and colleagues constructed a libraries of strains in which GFP expression is controlled by a transcription factor, and the transcription factor (TF) binding sites were randomly mutated. Libraries were sorted by GFP fluorescence and sequenced to determine the likelihood that a given binding site corresponds to a range of fluroescence. From this data, the contribution of a single nucleotide in the binding site to the TF binding free energy was determined from a model assuming that each nucleotide contributes independently to TF-DNA binding.

My interest in this article stemmed from its similarity to base-substitution experiments used to determine DNA-binding affinities for the bacteriophage lambda TFs CI and Cro by Sarai, Takeda and Rivera. This data has been invaluable in my work, and the Sort-Seq and analysis methods applied by Barnes et al to LacI, XylR and PurR produce more useful data and can be further applied to additional repressor:operator pairs.

I thank the authors for sharing their work on bioRxiv, including supplementary material in the bioRxiv submission, and making analysis code available.

I solicited review from two scientists with experience in synthetic biology and modeling gene regulation. Reviewer 1 is a postdoctoral scholar who chose to remain anonymous. Reviewer 2 is Dr. Steven Burgess (ORCID 0000-0003-2353-7794), a visiting scholar at the University of Illinois at Urbana-Champaign. I thank the reviewers for taking the time to provide such thorough reviews.

Both reviewers requested more information about data analysis approaches beyond assuming independent contributions from each nucleotide, with specific questions about repression of operators with 2 or more mutations.

Reviewer 1 put the methods used in this work in the context of related methods. Reviewer 2 made many suggestions regarding how the authors can present their work differently and make it more accessible to other researchers. Both reviewers made additional, valuable suggestions, and in my opinion, the reviewers are in agreement that both the data here and the methods introduced will be valuable additions to the field.

Reviewer 1 (Anonymous Reviewer)

In “Mapping DNA sequence to transcription factor binding energy in vivo,” authors show that Sort-Seq (a massively parallel reporter assay) can be used to model transcription factor (TF)-DNA binding energies. Specifically, they show the utility of Sort-Seq to:

Predict the binding energies for a set of both wild-type and mutated lac operators
Model an induction response in a lac-based system using IPTG as effector
Assess the impact of mutations in the DNA-binding domain of the Lac repressor to its binding preferences
Study the influence of binding site context (i.e. flanking regions) to TF binding.

Comments:

The step 3 of the pipeline to obtain energy matrices from Sort-Seq involves FACS-sorting plasmids into four bins based on YFP fluorescence levels (Figure 1). Why authors use only four bins and not more is not specified anywhere in the text, while the separation of plasmids into more bins (e.g. 10) would result in better energy matrices.
Related to the previous question, in the step 4, energy matrices are created from the nucleotide sequences and activities of the tested promoters (the latter of which depends on the bin where each promoter is classified). As noted by the authors, provided of the limitations of matrices (e.g. they assume that each nucleotide position within a binding site contributes independently to the final binding energy), I wonder whether the use of deep/machine learning tools or high-order (and more complex) models, such as hidden-Markov models, would be more appropriate for modelling binding energies.
When predicting the binding energies for a set of Lac operators derived from the operator O1, O2 and O3, authors state that “a library generated with O1 as the reference sequence is unlikely to share any mutant sequences with a library generated with O2 or O3 as the reference sequence”. I wonder if that is the case. I would like the authors to provide: 1) a Venn diagram showing the intersection between the sequences in each library (if any); and 2) to compute what is the probability of inserting different number of mutations to O1, O2, and O3, so that two sequences derived from one of these operators is the same.
In the same section, authors state that TF binding site “positions 4–10 show the greatest sequence preference” (i.e. are the most important). Hence, I wonder how these positions relate to the TF-DNA contacts between the Lac repressor and the operator. For reference, the structure of a Lac repressor bound to a natural operator O1 is provided here (DOI: 10.1006/jmbi.2001.5024).
Moreover, they show that “the majority of mutations available to O1 incur a penalty to binding energy, while many of the mutations available to O3 enhance the binding energy” and that “this is consistent with the observation that O1 has a strong binding energy while O3 has a weak binding energy”. Hence, I wonder whether the mutations that enhance the binding in O3 make the mutated O3 look more similar to the operator O1 (i.e. more canonical).
In Figure 2B, authors state that “while the energy matrices are qualitatively similar for all three operators, the sequence logos indicate clear differences in the information that can be provided by each operator.” In my opinion, O2 is a weaker version of O1, while O3 is a much weaker version of O1 (and hence of O2). Even at the positions indicated with an arrow, the O1 logo allows for a G, which are the preferred nucleotides in the O3 logo.
In Figure 3A, fold-changes in expression obtained for 1, 2, and 3 bp mutants are very similar between them, so are their fitted binding energies, while one would expect 3 bp mutants to be weaker than 2 bp mutants and, in turn, 2 bp mutants to be weaker than 1 bp mutants. Figure 3B shows a similar trend: binding energies for 1, 2 and 3 bp mutants (dots) overlap. Is there any explanation for this result? Moreover, the maximum delta binding energy between any two mutants is 6 kBT. For Figure 3C, delta binding energies of up to 1.5 kBT for sequences with 4 or fewer mutations are considered acceptable. Hence, are these differences of up to 6 kBT meaningful?
In Figure 6A, authors compare the energy matrices for two adjacent XylR binding sites at the xylE promoter and show that “the energy matrices and sequence logos for these binding sites have some significant dis-similarities”, particularly “at positions 6-8, where the left-hand site prefers TTT and the right-hand site prefers AAA”. In the xylE promoter, the left-hand XylR site is adjacent to a CRP site, while the right-hand XylR site is adjacent to the RNAP site. Hence, authors conclude that “the close proximity of these binding sites suggests that there may be direct interactions between proteins, which could alter how each XylR interacts with the DNA, thus altering sequence preferences”. However, because the energy matrices and sequence logos for the two XylR sites are obtained in the context of the natural xylE promoter, it is possible that the observed differences are simply due to the limitation of the method to explore n bp mutants of the xylE promoter.

In summary, Sort-Seq has the potential to replace different high-throughput methods at specific tasks, at least in prokaryotes. For example, like HT-SELEX (DOI: 10.1371/journal.pcbi.1000590), Sort-Seq can be used to predict TF-DNA binding energies. Moreover, like the yeast one-hybrid system (DOI: 10.1126/science.aad2257), Sort-Seq can be used to systematically evaluate the effect of mutations on TFs to their DNA-binding specificities. Furthermore, like MITOMI (DOI: 10.1073/pnas.1715888115), Sort-Seq can be used to assess the effect of regions flanking TF binding sites to the binding energy. Finally, in a recently published paper (DOI: 10.1073/pnas.1722055115), the authors showed that like ATI (DOI: 10.1038/nbt.4138), Sort-Seq combined with mass spectrometry can be used to identify active TFs. However, transcriptional regulation in eukaryotes is far more complex than in prokaryotes and thus, so there is the possibility that Sort-Seq is not a valid tool for the study of eukaryotic systems.

Reviewer 2 (Steven Burgess)

In the paper “Mapping DNA sequence to transcription factor binding energy in vivo” Barnes and co-workers use energy level matrices calculated from Sort-SEQ data to parameterize thermodynamic models of LacI-mediated transcription. These data were then used for rational design of promoter architecture altering leakiness and expression, providing an important demonstration of an approach which can complement large scale screening of promoter libraries in synthetic biology.

This work moves beyond the authors’ previous studies that have developed from using thermodynamic models to predict the effect of LacI repressor copy number on expression (Garcia and Phillips 2011), rationally increasing LacI promoter strength from models (Brewster, Jones, and Phillips 2012) to application of Sort-SEQ to analysis of LacI operator (Belliveau et al. 2018). The major novelties are in:

More sophisticated tuning of induction kinetics than previously described (Brewster, Jones, and Phillips 2012; Rohlhill et al. 2017): producing several distinct promoters with a focus on leakiness and EC50 (Fig 4).
Use of Sort-SEQ and energy matrixes to identify protein-DNA interactions through the analysis of amino acid substitutions (Fig 5)
Use of Sort-SEQ to analyse the effect of genomic location on cis-element activity (Fig 6).

Throughout the study, the authors use energy level matrices to represent nucleotide contributions to TF binding energies with the assumption of independence of nucleotide contribution to binding specificity. There is a good description of alternate methods in Appendix D, and citations indicating that in most instances these models perform sufficiently well. Further they provide evidence that there is little benefit in using alternate methods (Appendices C and D), and that the models were sufficient for promoter tuning (Fig 4). However, it would illuminating to include discussion of in the main text on recent work that finds, in the case of the LacI operator, when the number of mutations is >2, independence is not a fair assumption due to issues such as compensatory mutations (Zuo and Stromo 2014; Zuo and Stromo 2015). A direct comparison would be particularly helpful in context to the data presented in Figure 3, where the effect of >2 mutations on gene expression is investigated, and the authors do not appear to find increased deviation between predicted and observed values as the number of mutations increases.

The authors find their modelling approach works well for strong affinity binding sites such as the O1 LacI operator, but the correlation breaks down for weak sites such as O3 (Fig 2; Appendix E) which is attributed to experimental noise. This is plausible, and the authors mention a “landing pad” approach, whereby constructs are integrated at a particular location as a potential solution. Readers might benefit from some clarification about this, as it differs from the method used in this paper, whereby LacI and promoter constructs were integrated at the ybcN and galK loci, respectively. In addition, mentioning alternatives such as comparison against and internal fluorescence standard would be beneficial.

Another potential cause of poor fit for weak binders may be bias introduced during Sort-SEQ experiments, which have been shown to be sensitive to choice of bin width (Peterman and Levine 2016), particularly as the number of bins used here (n=4) is on the lower side. To assess this, it would be helpful to include a sensitivity analysis looking at accuracy and efficiency of the Sort-SEQ experimental setup as described by Peterman and Levine. In addition, the authors demonstrated that intrinsic sequencing biases and a two-point model could increase the accuracy of predictions when applied to a strong binding site (O1) (Appendix D). Engineering of weak sites may sometimes be desirable, so repeating this analysis for O3 would be useful to see if predictions can be improved, and a means of ruling out a stronger influence of dinucleotide substitutions or sequencing biases in that particular operator.

Finally the authors use Sort-SEQ to look at how genomic context affects DNA binding matrixes for TFs with three different modes of operation, including co-activation (XylR), repression by occlusion (PurR) and both repression by occlusion or preventing promoter release (LacI) (Fig 6). They find some evidence supporting the hypothesis that repression by occlusion is unaffected by genomic context (PurR, LacI) but that conformational changes induced during co-binding may alter binding site preferences (XylR). However, there is a discrepancy for the second LacI site, which the authors suggest works through trapping of RNAP on the promoter, as this site does not (as hypothesized) differ from that of the simple repression site. However, this may be explained by the observations of Becker et al. (2013) which make the case against RNAP trapping by LacI operator sequence, which could be reflected in the text. These findings also highlight the difficulty in drawing conclusions from the data, which would require knowledge of the precise regulatory mechanisms involved - something which is currently lacking for the promoters analysed (e.g. xylE). Accordingly the authors do qualify all statements about Fig 6. Using Sort-SEQ to determine whether TF binding is affected by genomic context is an exciting prospect but is not really conclusively demonstrated here. It might be worth considering whether inclusion of this data in the main body of an otherwise very strong text strengthens the overall manuscript or not.

Minor comments

Might consider including ORCID IDs, and using CREDIT taxonomy for author contributions
Abstract: Mention the species being analysed in the abstract to orientate readers, what the measured sites are. e.g. E. coli, XylR, PurR, LacI
Figure 1 and 2: Clarify how the data presented in the energy matrices is different that results presented in 10.1073/pnas.1722055115 (Fig 2), which also used Sort-Seq to look at the LacI promoter, are the data the same just the results have been expanded to show the effect of different mutations?
The actual bp sequences of the operators should be detailed in the methods section.
Figure 2: include what criteria/cut off the inference that preferred bases differ substantially from the O1 matrix.
Figure 3(A): Clarification about what weak and strong operators that were used (O1 and O3)? It would also be helpful to annotate the figure with this as well as the legend to allow for quick scanning. It is unclear what the numerical values on the plot refer to.
Figure 3: It would be helpful to quantify the quality of binding energy predictions for the different number of mutations using something like Root Mean Squares Error or pearson correlation coefficient as done in appendices. Figure 3C: Clarify whether the data points displayed a mixture of values using O1 and O2 matrices?
Figure 4: Please provide an indication of what fold-change refers to on the figure itself, i.e. fold change repressor binding energy (KbT) Please indicate what error bars correspond to and provide n for measured values.
Figure 4: A: y-axis label it is unclear how parameter value is defined. C–H it might be helpful to label y-axis with YFP fluorescence (A.U.). Text on plots is too small to read. Values for n and definition of what error bars represent would be helpful.
Figure 4: Consider the choice of colors for those suffering from color blindness (see Color Oracle app)
Figure 5: Please include definition of p value. Much of the legend text should perhaps be in results.
Figure 6: Could consider making more concise, several sentences are repeated from the main text and are perhaps more suited to results.
Methods: Consider provide the raw sequences for the promoter library used in the Sort-Seq experiment either as a supplementary table or deposited in a database like FigShare to promote reproducibility.
Methods: Provide a citation for pUA66-operator-GFP
Methods: Provide reaction conditions for electroporation or link to extended protocol on site such as protocols.io or bioprotocol.
Methods: Consider providing centrifugation steps in g rather than rpm or include centrifuge make and rotor number
Methods: Information about sequencing chemistry used i.e. HiSEQ, and stats relating to number of reads etc would be useful
Methods: It is excellent that the code is shared on Github. Might consider turning this into an entry on the Zenodo database to obtain a DOI for referencing purposes. Github repositories can be directly converted.
Table 1: Might consider highlighting the position of the bp mutation for quick scanning, i.e. coloring or underlining mutated bases
Table 1: Please include n values for the measured binding Might want to include conflict of interest statement

Menu

Mapping DNA sequence to transcription factor …

Mapping DNA sequence to transcription factor binding energy in vivo