“The harmonic mean p-value for combining dependent tests” by Daniel J. Wilson. https://doi.org/10.1101/171751

I selected this article for review for the following reasons:

- Multiple testing approaches are extraordinarily important in modern science and I am personally interested in new developments in that area.
- The mathematical solution and the equivalency between the mean maximized likelihood and the harmonic mean p-values (HMP) seemed particularly elegant.
- Genome-wide association studies are well-known for their dependent testing issues, so an application to this area of this kind of approach seemed particularly interesting.

I selected as reviewers two biostatisticians who have expertise in genomic big data. Specifically, the reviewers were selected for their expertise in multiple testing and modeling of genome-wide association studies, respectively. Both reviewers found the article potentially interesting but noted that it could benefit from an improved structure, with certain aspects that would benefit from clarification. Reviewer 1 noted that the utility of the approach is unclear, given that it sets up the multiple testing framework so that the null hypotheses are nested, whereas the Bonferroni and Benjamini-Hochberg frameworks have a set-up where each null is either true or false. Both reviewers also noted that more care is required in the analysis and interpretation of GWAS results.

I want to thank the author for sharing his work on bioRxiv before it has undergone peer review. I also want to take an opportunity to thank the reviewers who have donated their time to evaluate the novelty, strengths, and limitations of this work. One reviewer chose to remain anonymous, and one chose to be named. Both reviewers were faculty. The two reviews are available in full below.

**Reviewer 1** (Wolfgang Huber)

The article proposes the harmonic mean as a summarization statistic for multiple p-values. In summary: while there are some interesting mathematical calculations, the utility of the approach is not made sufficiently evident, and the presentation would benefit from better structure and precision. Some important technical steps to understand claims in the main text are tucked away in the Supplement, but are not properly linked to.

Setup: The stated application scenario is L mutually exclusive alternative hypotheses M_i, i=1,…,L, all with the same nested null hypothesis M_0. Given the p-values p1,…,pL, the method returns a single ‘combined’ p-value (the harmonic mean of the pi, HMP) which can be used for a single decision whether to reject M_0. The main conceptual problem of the manuscript is that the proposed method is contrasted against methods (Bonferroni, Benjamini-Hochberg) that have a rather different setup: L hypotheses H_i, i=1,…,L, either of which can be true or false; and which return of L adjusted p-values that can be used to reject (or not) each H_i individually. Thus, the comparison between HMP on one side and FWER or FDR control by the methods of Bonferroni or Benjamini-Hochberg on the other is a comparison of an apple with oranges. These methods do different things.

Usefulness of the setup: The utility of the second setup described above (FWER, FDR) is evident through its many applications. It is less clear what applications of the HMP should be. Those proposed here, two GWAS studies of which more below, do indeed not fall under the author’s own assumptions—the alternatives are not mutually exclusive—and therefore seem ineligible. Despite some headscratching, I have also not been able to think of relevant applications. An internet search led me to the preprint of Vovk and Wang (2012), which considered yet another setup: L different statistical tests of the same hypothesis. This is practically relevant, and in any case, due to the technical overlap this work should be refered to and discussed.

The neuroticism GWAS application example (Fig.1) is not well presented. The results from the proposed method are not systematically compared to those of Okbay et al. (2016). For instance, Okbay et al. (2016) reported two highly significant loci at the end of Chromosome 8 (Fig.1c in their paper), which—surprisingly—are not found by the HMP method (Fig.1) – but this is not discussed.

I am also unsure what the scientific (biological) utility would be of using the HMP method for finding a region (say, 10 Mb as is done here) if there is no individual SNP with a strong association, but rather, a diffuse subset of neighboring, invididually ‘sub-significant’ and only jointly significant SNPs in that region. Such a scenario would be indicative of complex genetics (perhaps structural rearrangements or genetic interactions) that would anyway require careful manual dissection. Scientists looking to work on such regions do not need a confirmatory, aggregated (HMP) p-value to start with, they can start by eye-balling the Manhattan plot, analyse the raw data and then find confirmation in the dissection of the complex genetics. Some practical reasons for that type of analysis are indeed described in the paper of Okbay et al. (2016).

In the joint human-pathogen GWAS example (Fig.2), different things are shown in panels A and B. Panel A shows individual pairwise association tests, between each pair of human and virus variant. In contrast, Panel B shows tests for association between each human variant somehow jointly across all virus variants together; and vice versa. This conceptual difference makes it difficult to make sense of the different results of the two analyses or to conclude which one is preferable. Despite this, the pattern of small p-values look surprisingly similar. Indeed, I wonder why/how a region of many small p-values on human chromosome 19 in Panel A has been reduced in Panel B.

(A comment on visualization technique: from the caption, it appears that Fig. 2A results from plotting more than 300 million circles on a relatively limited screen area; this seems needlessly clumsy; the vast majority of data shown are hopelessly illegible.)

**References**

Okbay, Aysu, Bart M L Baselmans, Jan-Emmanuel De Neve, Patrick Turley, Michel G Nivard, Mark Alan Fontana, S Fleur W Meddens, et al. 2016. “Genetic Variants Associated with Subjective Well-Being, Depressive Symptoms and Neuroticism Identified Through Genome-Wide Analyses.” Nature Genetics 48 (6): 624–33. doi:10.1038/ng.3552.

Vovk, V., and R. Wang. 2012. “Combining p-values via averaging.” arXiv:1212.4966 [Math.ST].

**Reviewer 2** (Anonymous Reviewer)

In this manuscript, the author introduces the harmonic mean p-value (HMP) which controls the FWER by combining dependent tests using generalized central limit theorem. He applies this method to analyze a human GWAS for neuroticism and a joint human-pathogen GWAS for hepatitis C viral load.

My major comments are as follows:

- The formula of the HMP is based on the weights w of p-values (variants). While the author has conducted sensitivity analyses to look at the robustness of the HMP to the distribution of weights w, this is only in the Supplement and should be cited clearly from the main text. In terms of the application to the GWAS example, note also that some SNPs are synonymous (i.e., not causing a change in the amino acid) or nonsynonymous (when the amino acid is altered), and this difference can impact their weights on disease-genotype associations. The author should also address how to choose reasonable weights of SNPs (> millions) in practice.
- The simulation studies and their results are only presented in the Supplement and are not clearly cited in the main paper. On first reading, it did not appear that any simulations were performed.
- For the GWAS example, it would be helpful if the author focuses on “known susceptibility” SNPs i.e. SNPs that have been already replicated to be associated with the disease outcomes by several independent studies. For example, in describing the result on the most significant pairwise interaction on rs8099917, the author focused on how it would not meet the Bonferroni threshold but would meet the HMP threshold. While one can claim that this is an improvement over Bonferroni if this is a true signal, this claim does not hold because this could be a false positive.
- Can the author address the computation times for analyzing the two sets of GWAS data?