Enhancing Result Reporting in Neuroimaging

Enhancing Result Reporting in Neuroimaging

--The Hidden Pitfalls of Multiple Comparison Adjustment

Gang Chen


Result reporting might seem like a mundane aspect of research, but its methodology is deeply institutionalized and often ritualized, making it almost automatic. Due to concerns about multiple comparisons, stringent thresholding has become a routine practice in neuroimaging. Any deviation from this norm is often seen as a violation of traditional standards. Reviewer 2, equipped with eagle eyes and cluster rulers, meticulously scrutinizes manuscripts, eagerly seeking those that fail to meet the stringent criteria.

In this blog post, we reexamine this crucial component of result reporting: the impact of stringent thresholding on statistical evidence. Multiple comparisons in neuroimaging remain a persistent party crasher who overstays! But is it truly a statistical faux pas, or have we simply run out of methodological munchies? We argue that the current Band-Aid approach to handling multiple comparisons is an overblown problem in neuroimaging, ironically contributing to reproducibility issues. Our recommendations for maintaining evidence continuity, rooted in causal inference, aim to enhance reproducibility and challenge conventional practices.

Why is stringent thresholding harmful for reproducibility?

The issue of multiplicity, or the multiple comparison problem, arising from simultaneous modeling (such as in massive univariate analysis), is generally regarded as a grave concern. The "dead salmon" episode (Bennett et al., 2010) notably triggered a knee-jerk reaction in the neuroimaging community, prompting statisticians to address the challenge. Consequently, it is considered essential to substantially discount statistical evidence at the analytical unit level (e.g., voxel, region of interest, matrix element). This is usually achieved through a penalizing process that adjusts the statistical evidence at the overall level (e.g., whole brain) by leveraging the spatial relatedness at the local level.

In common practice, there has been a strong emphasis on the stringency of result reporting, imposing an artificial dichotomization through the combination of voxel-level statistical evidence and local spatial relatedness. Various sophisticated methods have been developed to address the multiplicity issue, including random field theory, Monte Carlo simulations, and permutations. At one point, fierce competitions among these methods unfolded like a drama, vying for supremacy in the name of rigor (Eklund et al., 2016). A portion of previous publications were pinpointed and dichotomously arbitrated as failing to pass the cluster stringency funnel. Since then, an imprimatur of stringency criterion appears to have been established. Nowadays, adjusting for multiple comparisons has devolved into an oversimplified ritual. Reviewer 2 now wields newfound power and a magnifying glass to scrutinize manuscripts for any "cluster failures."

We believe there are three fundamental flaws in popular neuroimaging multiple comparison adjustments for massive univariate analysis:

(1) Faustian Bargain: Sacrificing small regions for spatial leverage.
(2) Artificial Reporting: Ignoring valuable auxiliary information.
(3) Unrealistic Assumption: Assuming a global distribution that contradicts reality.

The Faustian Bargain: Discrimination against small regions

Traditional multiple comparison adjustments inherently penalize and overlook small brain regions based solely on their anatomical size, regardless of whether they exhibit comparable or even larger effect magnitudes. The selection of a specific threshold (e.g., threshold 1 or 2 in the figure below, Chen et al., 2020) is often considered arbitrary and raises other concerns. Apparently, smaller regions (e.g., region C) are disadvantaged relative to larger counterparts (e.g., regions A and B). The arbitrariness of threshold selection prompted the development of an alternative approach aimed at integrating spatial extent with statistical evidence strength, such as threshold-free cluster enhancement (TFCE). However, discrimination against small regions persists, as illustrated by the evaluation of region C using an integral approach (as depicted in the figure below).


Such a Faustian bargain aims to control the overall FWER but at the cost of sacrificing small regions. This raises an issue of neurological justice: Is this type of outright bias/discrimination/sacrifice justifiable?

Artificial Reporting: Ignoring valuable auxiliary information

Nowadays, visual representations of results in publications are often required in a discrete and artificial manner, as exemplified below.

cl2a colorBar

Two noteworthy issues arise from the figure above. Firstly, the striking cleanliness of the results may provoke skepticism regarding their alignment with real-world evidence. Could such pristine reporting practices undermine open science and transparency? Secondly, should statistical evidence alone dictate result reporting? Is there value in considering auxiliary information, such as anatomical structure or prior study results? For instance, the pronounced lack of bilateral symmetry in the figure raises a crucial question: Did the observed BOLD response truly occur asymmetrically, or does the strict adherence to artificial dichotomization, stemming from the demand by the "cluster failure" episode, create an illusion of asymmetry? Moreover, do these delineated cluster boundaries hold neurological significance?

The enduring impact of the "dead salmon" episode continues to reverberate the field to this day. It is worth questioning whether the subsequent reactions have been excessive. How "correct" and accurate are the common solutions for multiple comparison adjustment? Is the intense focus on rigor in one particular aspect causing the field to miss the forest for the trees?

An unrealistic assumption in massive univariate analysis

One problematic assumption in massive univariate analysis leads to substantial information loss. This method implicitly assumes that all analytical units (e.g., voxels, regions, correlations) are unrelated because the same model is applied simultaneously across all units. The approach assigns epistemic probabilities based on the principle of indifference (or the principle of insufficient reason), assuming a uniform distribution, where each unit can take any value with equal likelihood. In other words, the analyst assumes no prior knowledge about the distribution across the brain. In reality, the data are more likely to follow a centralized distribution, such as a Gaussian. If you are in doubt, plot a histogram of the response variable values in the brain using your own data. This will convince you whether the distribution is more accurately approximated by a uniform distribution (left below) or a centralized distribution (right below).

This hidden but unrealistic assumption of uniform distribution at the global level results in a significant loss of detection sensitivity. Common adjustment methods of controlling family-wise error rate (FWER) for multiplicity can only partially recover information loss at the local level, but not at the global level. Consequently, the excessive penalties imposed by conventional post hoc adjustment methods are like wrapping a Band-Aid around a wound—it's a superficial fix that may not be as thoroughly "correct" as the popular term multiple comparison correction implies (Chen et al., 2020; Chen et al., 2022).

Do we really need cluster boundaries as a pacifier? Besides the excessive penalties due to the unrealistic distributional assumption, there are several other issues with the conventional multiplicity adjustment methods commonly used in practice, as elaborated in Chen et al. (2020) and Chen et al. (2022):

  • Artificial and Arbitrary Dichotomy: Does statistical evidence necessarily render positive/negative dichotomy?
  • Disassociation with Neurology: Do cluster boundaries carry anatomical/neurological relevance?
  • Arbitrariness: Adjustment is sensitive to the size of data domain (e.g., whole brain, gray matter, a particular region).
  • Ambiguity: A cluster often straddles multiple regions.
  • Information Waste: A cluster is usually reduced to a single peak in result reporting and common meta-analysis.

As an alternative to spatially-leveraged FWER, the false discovery rate (FDR) is another traditional approach to handling multiplicity. Both methods share two key aspects: (1) they serve as post hoc solutions for multiplicity arising from inefficient modeling through massive univariate analysis under the assumption of uniform distribution, and (2) they discount statistical evidence without adjusting for estimation uncertainty (e.g., standard error). Compared to FWER, FDR does not discriminate against small regions. However, the less frequent use of FDR in voxel-level analysis empirically suggests its lower result survivorship in most practical applications.

Is result reporting a dichotomous decision-making process?

A single study cannot realistically reach a decisive conclusion. Scientific investigation has a strong cumulative element, characterized by the gradual accumulation of knowledge. Due to complex mechanisms and sample size limitations, uncertainty is an intrinsic component of typical data analysis. Therefore, each study should not be treated as an isolated report but rather as one of many collective efforts aimed at achieving a converging conclusion through methods like meta-analysis. It is imperative to maintain the integrity of statistical evidence in result reporting.

However, given the continuous nature of statistical evidence, imposing a threshold, regardless of its stringency and mathematical complexity, essentially involves drawing an arbitrary line in the sand. Clusters identified this way are as unstable as castles built in the desert. From a cognitive science perspective, this type of deterministic or black-and-white thinking through the emphasis on stringent thresholding has been described as "dichotomania" by Sander Greenland (2017) and as the "tyranny of the discontinuous mind" by Richard Dawkins (2011).

The focus on research reproducibility underscores the importance of effect estimation. The ultimate goal of scientific investigations is to understand the underlying mechanisms. From this perspective, statistical analysis should aim to assess the uncertainty of the effect under study, rather than striving for a definitive conclusion. Statistical evidence should not act as a decisive gatekeeper as in common practice, but rather play a suggestive or supporting role, helping to maintain information integrity through the uncertainty of effect estimation. Artificially dichotomizing analytical results is counterproductive and misleading, potentially creating serious reproducibility problems.

A demonstrative point about the detrimental impact of stringent thresholding due to the excessive emphasis on multiple comparison adjustment is highlighted by the NARPS project. Interestingly, the NARPS project attributed the inconsistent results across teams to disparate analytical pipelines and maintained a "reproducibility crisis" narrative. However, contrary to the NARPS project claims and prevailing beliefs on social media, more careful diagnosis indicates that different analytical pipelines were not the primary cause of the dramatically different results. Instead, it was the enforced dichotomization through multiple comparison adjustment that largely led to the so-called "reproducibility crisis" (Taylor et al., 2023).

Result reporting: Can we do better?

A causal inference perspective

From the perspective of causal inference, the problem of stringent thresholding can be conceptualized as selection bias (left figure below). When the full result is replaced with its thresholded counterpart, comparisons with other similar studies become distorted, leading to misleading conclusions in meta-analyses (right figure below). In other words, the thresholding process is akin to conditioning on a descendant of either the response or explanatory variable, resulting in selection bias from a causal inference perspective. This issue mirrors the selection bias problem of "double dipping," which the neuroimaging field was fervently trying to correct around 2010. However, the same type of selection bias continues to dominate result reporting in the field.

Improving reproducibility through nuanced result reporting

Massive univariate modeling offers a conceptually straightforward and computationally feasible approach for neuroimaging data analysis. Ironically, the effort to improve reproducibility by imposing stringent multiple comparison adjustments has backfired, becoming a source of reproducibility issues itself. This pursuit of rigor through rigorous thresholding has proven counterproductive to the field.

To reduce information loss and result distortion, it is crucial to avoid artificial dichotomization through stringent thresholding. In fact, the so-called reproducibility problem, highlighted by inconsistent results across different analytical pipelines in the NARPS project, can be largely mitigated through meta-analysis using full results, as demonstrated in Taylor et al. (2023).

We recommend a nuanced approach to result reporting that preserves the continuum of statistical evidence. For example, apply a voxel-level threshold (e.g., p-value of 0.01) and a minimum cluster size (e.g., 20 voxels). Then, adopt a "highlight, but don't hide" method (also see the associated paper) to visually present the results, as illustrated in the figure below middle.

cl2a hh2a hc2a colorBar

One may choose a particular stringency (i.e., the combination of a voxel-level threshold and a minimum cluster size) for highlighting and for addressing the multiplicity issue. While this choice might seem arbitrary, the "don't hide" aspect minimizes the arbitrariness by avoiding artificial cluster boundaries. Essentially, this method emphasizes regions with strong evidence while maintaining the continuity of information. The goal is not to make a dichotomous decision (arbitrarily labeled "valid" vs "invalid" results) but to transparently present the gradation pattern of effect patterns.

For those who prefer explicit cluster boundaries, you can still adopt your desired level of stringency (including common multiple comparison adjustment methods) and clearly mark the cluster contours, as shown in the figure above right. However, it is important to remember that such contours are not only arbitrary but also theoretically flawed, as we have argued (e.g., they can be inaccurate and discriminatory).

The benefits of our recommended result reporting are multifaceted. Compared to the common practice of artificially dichotomized results (figure above left), the "highlight, but don't hide" method offers a balanced approach that emphasizes evidence strength while preserving information integrity. The illusion of asymmetry, seen in the figure above left, immediately disappears when the full spectrum of statistical evidence is presented. Who would prefer to maintain the asymmetry illusion for the sake of adhering to an arbitrary threshold? Wouldn't the symmetry provide reassuring support for the presence of an effect, even if the statistical evidence fails to meet a particular threshold? Who would deny the importance of evidence transparency and open science, allowing everyone to make their own judgments? What harm could arise from maintaining information continuity? For example, some bilaterally symmetrical regions shown in shallow red or shallow blue in the middle/right figure above do not have strong statistical support in the current study. However, explicitly presenting these regions is crucial as it offers potential clues for more targeted studies in the future.

Result reporting should be an engaging and thoughtful process. Instead of blindly adhering to mechanical rules (e.g., voxel-level p-value of 0.001), investigators should take a more active role. Rather than allowing statistical evidence to unilaterally dictate reporting criteria, consider the continuum of statistical evidence alongside domain-specific knowledge, including prior research and anatomical structures. This approach leads to more well-informed and robust conclusions, striking a balance between evidence strength, information integrity, and transparency. The field is gradually embracing this perspective, as evidenced by studies like a recent one published in the American Journal of Psychiatry.

Another prevalent issue in neuroimaging result reporting is the lack of effect quantification. The literature often displays only statistical values in text, figures, and tables. Scientific maturity demands a shift towards quantifying effects; however, this reliance on statistical values alone highlights the field's immaturity. Ignoring effect magnitudes can lead to misrepresentation and reproducibility problems. For instance, quantifying BOLD responses using percent signal change is feasible despite its relative nature. Proper scaling enables meaningful comparisons across contexts, individuals, regions, and studies, supporting robust population-level analyses and meta-analyses (Chen et al., 2017).

Alternative approach: Incorporating spatial hierarchy into modeling

Was the lack of stringency in multiple comparison adjustment a "cluster failure" in neuroimaging as advertised, or was it a failure to recognize data hierarchies? As the brain is an integrative organ, it would be optimal to construct a single model for each effect of interest, rather than many simultaneous models as in massive univariate analysis. Under certain circumstances, the issue of multiple comparisons can be attributed to a poor modeling approach. In these cases, the model fails to effectively capture shared information across analytical units (e.g., the brain), leading to compromised statistical inference. This scenario is particularly relevant in neuroimaging.

Indeed, directly incorporating the centralized distribution across analytical units into a single model holds strong promise. The conventional massive univariate approach treats each analytical unit as an isolated entity, leading to the multiplicity issue due to the presence of as many separate models as there are analytical units. In contrast, the hierarchical modeling approach constructs a single unified model that inherently addresses the issue of multiplicity. This method integrates multiplicity into the model itself, rather than applying it as a post hoc band-aid adjustment as seen in the traditional massive univariate analysis. In addition, brain regions are treated on an equal footing based on their respective effect strength, and smaller regions are not penalized or discriminated against due to their anatomical size.

In essence, instead of viewing multiplicity as a nuisance in massive univariate analysis that necessitates discounting statistical evidence, a hierarchical perspective transforms it into an information regularization process, enhancing the integration of statistical evidence. This hierarchical modeling approach has been implemented in the RBA program in AFNI. Due to computational constraints, it currently works with a list of brain regions instead of individual voxels. The associated theory is discussed in Chen et al. (2019). Amanda Mejia's group has also developed a similar hierarchical approach on the cortical surface.

Concluding Remarks

Through this blog post, we hope to assert the following points:

  • The common approaches to multiple comparisons in neuroimaging are excessively conservative. Despite significant efforts to address the multiplicity issue in massive univariate analysis over the past three decades, all current post hoc methods, regardless of their stringency, fail to account for the global distribution. This oversight results in unduly harsh penalties. Therefore, there is little justification for fixating on cluster size minutiae and voxel-level statistical evidence stringency (e.g., p-value of 0.001) as is typical in current practice.

  • The requirement for stringent thresholding in result reporting is damaging reproducibility in neuroimaging. The NARPS project clearly demonstrates that the detrimental impact of artificial dichotomization is a major source of the "reproducibility crisis," contrary to the dominant opinion that attributes the crisis to analytical flexibility in the field.

  • Maintaining the continuous spectrum of statistical evidence is vital for achieving reproducibility. Single studies rarely achieve decisiveness in uncovering underlying mechanisms. The "highlight, but don't hide" method offers a way to preserve information integrity. Manuscripts should be evaluated holistically rather than being nitpicked for the stringency of their thresholding.

  • Quantifying effects allows for improved reproducibility. A long-standing issue in neuroimaging is the neglect of effect quantification due to improper implementation in software packages. Without quantifying effects, assessing reproducibility using only statistical values is compromised.

  • There is significant room for improvement in neuroimaging modeling. Massive univariate analysis has been the mainstay for the past three decades due to its straightforwardness, but its inefficiency in information calibration combined with the multiplicity challenge highlights the need for more efficient approaches. Hierarchical modeling is one small step in that direction.

P. S.

Due to the pervasive indoctrination in statistics education, for better or worse, the common practice in statistical analysis remains dominated by an obsession with p-values—a trend that may persist indefinitely. For those willing to break free from the traditional statistics doctrine and resist the alluring sirens of dichotomous thinking, thresholding, and rigid decision-making, this recent discussion by Andrew Gelman offers valuable insights.