“Detecting and harmonizing scanner differences in the ABCD study - annual release 1.0” Dylan M Nielson, Francisco Pereira, Charles Y Zheng, Nino Migineishvili, John A Lee, Adam G Thomas, Peter A Bandettini bioRxiv 309260; doi: https://doi.org/10.1101/309260
I have selected the manuscript this article to be featured in biOverlay because it discusses an increasingly important topic of statistical issues involved in combining data from multiple sources. The authors used a large dataset produced by the ABCD consortium that spans multiple sites and scanners. They employed a machine learning scheme to evaluate how well one can predict the site each scan comes from. Furthermore they also estimated how scanner/site related variance compares in magnitude to variance associated with sex, headedness and age. The authors did not stop at evaluating how much of an issue site effects are. They also showed that applying a method called ComBat, developed in bioinformatics to deal with batch effects, can greatly reduce the problem (but not remove it completely). The reviewers proposed a list of enhancements to the manuscripts that are outlined below.
I want to thank the authors for sharing their work on bioRxiv before it has undergone peer review. I also want to take an opportunity to thank the reviewers who have donated their time to evaluate the novelty, strengths, and limitations of this work.
Reviewer 1 (Anonymous)
This is an important study, given the increasing number of large, multi-center, multi-scanner trials that are already happening, and given the move towards pooling of data from different studies into central repositories. The authors state:
The ABCD project has set a new standard for rapid, coordinated data sharing by making both the raw images and pre-processed data available well before the primary investigators had analyzed or published on them.
This statement has one large caveat, however. To date, the ABCD study has released only a summary of key scanning parameters. The authors cite Casey et al. (2018) as a source of scanning parameters. The Casey paper contains only the headline scanning parameters, however, as does the ABCD study website. While the ABCD study claims to have harmonized scanning parameters across scanner vendors, this cannot be taken on faith. Considering just the fMRI data as an example, it is well known that the default phase encoding direction for axial EPI is anterior-posterior on Siemens scanners but posterior-anterior on GE scanners. If this important parameter is not harmonized, the distortion direction and Nyquist ghost locations will be strong indicators of the scanner being used. Similarly, EPI on the Philips scanner required a partial Fourier acquisition in the phase encoding direction to match the TE of 30 ms as used on GE and Siemens. This implies a slower gradient slew rate and longer echo spacing. If so, the the distortion level in Philips would be greater than for the other two vendors. Combining these factors alone would permit an experienced physicist to be able to categorize by inspection each vendor’s raw data with high confidence. The potentially differing distortions would propagate as systematic displacements of the fMRI data released by ABCD. Until the complete scanning parameters are available, we are left to speculate on what might be driving the results of the current study. (I note that other large studies, including UK Biobank, ADNI and the Human Connectome Project, all made available full PDF records of their scanning parameters.)
It is a little harsh to criticize the current study for a lack of full parameter transparency by ABCD. However, a large omission in the current study is indeed the lack of a more thorough explanation for their predictions. I would strongly encourage the authors to request all scanning parameters from ABCD and then attempt to discern which factors are most likely responsible for the vendor predictions before they submit their paper to a peer-reviewed journal. This could serve two purposes: (1) it would alert ABCD (and future multi-center studies) to subtleties in parameter harmonization, and (2) it may permit future users of ABCD study data to consider remedial measures to tackle specific limitations in the data. For example, it may be feasible to eliminate some of the vendor-specific EPI distortions with field map corrections, should distortions prove to be a major vendor-specific issue.
Reviewer 2 (Anonymous)
The manuscript by Nielsen et. al. examines the effect of different scanners on post‑processed statistics of resting and task fMRI data from the ABCD Study . Furthermore, the paper examines the improvements from applying ComBAT , a batch effect correction technique widely used in genomics and more recently in anatomical and diffusion MRI. The paper evaluates data with and without ComBAT correction using two methods. First, they compare classifier performance, specifically the ability to classify manufacturer, make and model. Second, they quantify percent variance explained by manufacturer, make and model above and beyond demographic variables. Using the above criteria, the authors employ a compelling set of empirical experiments to illustrate the existence of scanner effects on resting and task activation statistics at the level of large scale brain regions and networks in the ABCD study and show that ComBAT can reduce scanner‑specific variation in these statistics.
With the growth of large scale multi‑center fMRI studies, addressing scanner‑specific nuisance variation is an extremely important topic and this timely paper effectively highlights this issue. In my comments below I explain some concerns on two issues i) The particular methods employed to evaluate scanner effects, especially the usage of balanced permutations ii) Clearly delineating the methods employed to illustrate scanner effects and usage of ComBAT in the analysis of the ABCD dataset from general recommendations on the best way ComBAT ought to be employed for harmonizing fMRI data.
The figures are compelling and clearly labelled. The overall exposition of the paper is very good and the procedures easy to understand. However, a few procedures critical to performance evaluation are not clear.
- How are permutation tests for classification performed? i.e. What is the test statistic comparing classifier performance before/after ComBAT?
- It is not always clear whether multiple testing correction has been applied to two‑sample tests that compare pre/post ComBAT corrected values or to separate one‑sample tests for whether scanner effects exist before ComBAT and after ComBAT separately. In Fig. 1 and Fig. 4 for instance, the red lines for multiple testing threshold suggest the latter. In this case, it is not clear if the difference between statistical significance and statistical non‑significance will be significant. I don’t have enough information to know whether the analysis has been done appropriately. Regardless of individual features however, it is clear that there is a consistent reduction in overall scanner effects.
2. Comparing classification performance using permutation tests
On pg. 2‑3, the authors discuss their rationale for balancing covariates in their permutations.
we balanced these permutations on sex, handedness, and age so that our null distribution takes into account the imbalance of these factors between scanners. This gives us a more accurate null distribution for testing the percent variance explained by manufacturer, model, and scanner, but it precludes assessing the significance of the percent variance explained by sex, handedness, and age.
The paper does not describe how such balanced permutations have been achieved. It is a common but incorrect intuition that balancing covariates in randomization or permutations is desirable to eliminate statistical bias. Randomization breaks the relationship between covariates and outcome, on average. It has been shown that deliberately picking permutations that have some form of balance is undesirable as this can make permutation tests more optimistic.* Illustrative examples have been provided by Senn  motivated by randomization in clinical trials and by Southworth et. al. in the context of permutation tests .
Note however that stratifying cross‑validation folds to ensure all scanner/make/models are balanced across folds is important to assess out‑of‑fold prediction error. Thus, the above comment only applies to balancing the randomizations in different permutations.
*The only exception however is that in modest sample sizes, one might employ re‑randomization techniques to achieve covariate balance in treatment assignment to obtain greater precision in treatment effects. The benefits are supposedly negligible in large sample sizes as the average treatment effect remains unchanged. https://projecteuclid.org/euclid.aos/1176344064 & https://projecteuclid.org/euclid.aos/1342625468
3. Addressing the issue of factors confounded with site
To what extent are age, sex and other demographic variables confounded with scanner/make and model in the ABCD data? It would be great if there was a small table or figure showing this as this it provides important information on how to assess ComBAT but also the overall difficulty of addressing scanner effects. These could be considered variables of interest in other studies for example.
A recent paper  compared many strategies for reducing unwanted variation and batch effects and studied ComBAT under scenarios with no confounding, modest confounding and severe confounding in simulations. They found that ComBAT does effectively handle the scenario where there is confounding between covariates of interest and site.
However, from the perspective of quantifying variance explained or classifier accuracy of scanner effects, it is desirable to eliminate variation due to age/sex/handedness by regressing these variables out. Doing so is reasonable as it evaluates how much scanner effects unexplained by demographic variables have changed pre and post ComBAT. However, readers should not mistake the models used in Methods section, pg. 3, equations (1‑5) to be the recommended way of applying ComBAT or batch effect correction procedures in general. The paper should make the purpose of this approach to evaluating site effects clear and distinguish it from the recommendations about applying ComBAT in general.
Explanation: Addressing demographic variation between centers of data collection requires careful epidemiological modeling of how demographic factors are causally related to MRI measurement error as well as the biological parameters of interest. For instance, increase in age might genuinely change the task activation of interest but also result in dropout and thus measurement error. Thus, it would not be advisable to simply regress out age from the data as a general procedure for address variation between sites.
4. Should ComBAT be applied on post‑processed derivatives?
The ABCD 1.0 release has over 4000 participants and the release provides post‑processed statistics of task‑activation and resting state correlations at the level of brain regions and networks. The use of ComBAT to illustrate the existence and extent of scanner effects on these statistics is thus an important contribution of this paper. Nevertheless, it seems important for future readers to understand the nature and limitations of applying ComBAT on post‑processed derivatives rather than for instance minimally processed voxel intensities in the EPI images before statistical analysis, i.e. before the estimation of task activations or resting state correlations. A full answer to this question is beyond the scope of this paper and I certainly don’t suggest the authors tackle this issue. However, it would be nice for the authors to highlight caveats/limitations on this issue in the discussion, particularly to emphasize that there might be alternative ways to apply ComBAT to fMRI that yield further improvements.
Explanation: Previous papers by Fortin et. al. (2016, 2017) on applying ComBAT shows substantial reduction in unwanted between subject heterogeneity of voxel‑level intensities in structural MRI and DWI. Thus, it is quite likely that there could be further improvements in improving statistical power of task/resting signals if ComBAT were applied to the EPI images directly. The most likely scenario for applying ComBAT to post‑ processed data is that it ameliorates scanner variation but may not improve statistical power for biological variables of interest. For complicated post‑processing it is not clear if applying ComBAT gives you back the measures one might have obtained had there been no scanner variation in the input to the analysis. I expect that the more non‑linear (beyond additivity/scaling operations) the intermediate statistical analysis for post‑ processing is, the more important it will be to apply ComBAT to the EPI rather than post‑ processed statistics.
Terry L Jernigan, Betty J Casey, Duncan Clark, Ian Colrain, Anders Dale, Thomas Ernst, Raul Gonzalez, Mary Heitzeg, Krista Lisdahl, Monica Luciana, Bonnie Nagel, Elizabeth Sowell, Lindsay Squeglia, Susan Tapert, and Deborah Yurgeluntodd. Adolescent Brain Cognitive Development Study (ABCD) ‑ Annual Release 1.0 #500. 2018. doi: 10.15154⁄1412097.
W. E. Johnson, C. Li, and A. Rabinovic. Adjusting batch effects in microarray expression data using empirical Bayes methods. Biostatistics, 8(1):118–127, 1 2007. ISSN 1465‑4644. doi:10.1093/biostatistics/kxj037
Seven Myths of Randomization in Clinical Trials by Stephen Senn https://onlinelibrary.wiley.com/doi/abs/10.1002/sim.5713 https://statistics.fas.harvard.edu/files/statistics/files/21_stephen_senn.pdf
Lucinda K. Southworth, Stuart K. Kim, and Art B. Owen. “Properties of Balanced Permutations.” Journal of Computational Biology 16.4 (2009): 625–638. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3148117/
Laurent Jacob, Johann A. Gagnon‑Bartsch, and Terence P. Speed. “Correcting Gene Expression Data When Neither the Unwanted Variation nor the Factor of Interest Are Observed.” Biostatistics (Oxford, England) 17.1 (2016): 16–28.