comparing cluster sizes within an ROI

Dear AFNI team,

We are looking for guidance about how to compute something along the lines of a p value in the following situation.

Imagine that we have two conditions (A and B), and that for each we’ve established a 3D cluster map (relative to baseline), using an independent voxel threshold and a spatial extent threshold (following currently advised techniques; Cox et al., 2017). Further imagine that we interrogate each of these two maps within an ROI (e.g., L insula), and establish any cluster(s) for each map lying within it. Imagine, for example, that condition A activates a 70-voxel cluster within L insula above baseline, whereas condition B activates a 20-voxel cluster.

What we would like to establish is something like a p value for the probability of observing clusters of these two sizes within the ROI by chance (i.e., assuming that there is no underlying difference between the two conditions). In the example above, what is the probability of observing a 70-voxel cluster for condition A together with a 20-voxel cluster for condition B in L insula (again, when the two conditions are assumed not to differ)?

A potential wrinkle is that clusters could overlap within the ROI. In the example above, imagine that we assess overlap within L insula using a simple conjunction analysis, and find that there are 15 shared voxels between conditions A and B, along with 55 unique voxels for A and 5 unique voxels for B. Another form of our question could be what is the probability of observing this particular pair of unique voxel counts (55 and 5), again assuming no underlying difference between conditions?

Are there any AFNI tools for computing these kinds of probabilities? If not, do you have any suggestions for how to do so? Please let us know if any further information would be helpful.

Much thanks for your assistance! Best regards, Larry

Hi, Larry-

I am afraid I don’t see how that could be accomplished. Perhaps something involving huge amounts of Monte Carlo simulations/bootstrapping, but it seems hard to get an objective answer to that question that wouldn’t be highly assumption- or model-dependent, as well as one that would generalize.

Is there a way that you can reformulate that question of interest into something that would fit into a statistical test framework more directly? Can you not analyze those groups in a single test, say?

–pt

Larry,

imagine that we assess overlap within L insula using a simple conjunction analysis, and
find that there are 15 shared voxels between conditions A and B, along with 55 unique
voxels for A and 5 unique voxels for B. Another form of our question could be what is the
probability of observing this particular pair of unique voxel counts (55 and 5), again
assuming no underlying difference between conditions?

Why do you consider those 55 (out of the 70 voxels under condition A) and 5 voxels (out of 20 voxels under condition B) “unique”? The two hypothetical clusters are identified per some arbitrary strength of statistical evidence (e.g., overall FWE of 0.05). Suppose there is a voxel within L insula but outside of the 70-voxel (or 50-voxel) cluster, you could say that the statistical evidence for the voxel is weaker than the preset comfortable level, but you cannot claim that there is no effect under condition A (or B): lack of strong evidence for an effect is not necessarily an evidence for lack of effect.

What we would like to establish is something like a p value for the probability of observing
clusters of these two sizes within the ROI by chance (i.e., assuming that there is no underlying
difference between the two conditions).

I don’t know the big picture, but why not directly making inference about the difference between the two conditions (e.g., condition A minus condition B) across the brain or within L insula?

Much thanks for your comments! What follows is a detailed account of our situation. I don’t know how to present things any more succinctly than what follows, so I apologize for the length, and understand if you don’t have the time or inclination to read it. At the same time, you might find what’s described to be an interesting issue. We’re increasingly struck by how much more informative it is to assess breadth of activation instead of activation strength. And what we’re asking about is how to appropriate assess differences in breadth. We think that the solution to this problem could be of potential interest and use to other researchers (although it wouldn’t be the first time I’ve been wrong about something like this!).

Here’s the situation. This is now the third experiment where we have observed what I’m going to describe next. We have a task of interest (most recently, food choice) that is typically performed by 20-30 participants, in large numbers of trials (we have reasonable power). Critically, we use an active baseline that is well matched to the target task in the scanner, with respect to low-level visual processing of the stimuli, comparable cognitive processing, and a similar motor response. As a result, computing an activation map for the target task only produces voxels that become significantly more active than the closely matched active baseline task. The results are much better controlled and interpretable than if we had used a resting state baseline. We get a broad relatively complete sense of all the brain areas important for a task.

In the experiment that I’m asking about now, the critical task was a food choice task, where participants saw a food image, decided whether they would want to eat it now, and then responded yes or no. We know a lot about the brain areas that this kind of task activates. The active baseline involved viewing scrambled object images, detecting whether a target circle falls on the left or right of them, and then making a binary response to indicate which side. Thus, significant clusters for the food choice task, relative to the active baseline task, indicated areas that became more active for processing food choice than for processing which side a circle fell on in a scrambled image. We found significant activations above the active baseline all over the brain that are typically associated with processing food cues. Indeed, we find areas that other researchers typically haven’t (using signal intensity analyses), along with much larger areas of activation.

In the current experiment, we further included several manipulations of interest, such as whether the pictured foods were tasty or healthy, and whether participants had been asked to adopt a normal viewing strategy (the control condition) or an “observe” strategy (i.e., a simple mindfulness strategy). Of interest was whether these various manipulations affected activations in ROIs associated with eating (e.g., the insula for taste, OFC for predicted reward, etc.).

Take the insula, for example, which is the primary taste area. Often, relevant areas in the insula become more active for tasty foods than for healthy foods. This is now a widely obtained result (that Kyle Simmons, Alex Martin, and I initially reported in 2005). In our recent experiment, though, with 20 participants, linear contrasts found no difference for tasty vs. healthy foods. When, however, we compared the number of voxels significantly active above baseline, there were large differences in the predicted direction, with tasty foods activating the insula more broadly than healthy foods.

In our three most recent experiments, we have found what I just described repeatedly: Often we fail to observe differences in overall signal intensity in an ROI but observe large differences in the breadth of activation above a well-matched active baseline. The first attached slide illustrates this for two conjunction analyses, each contrasting activations in food-related ROIs for tasty vs. healthy foods. In the left two columns are the results for a conjunction analysis in the normal viewing (control) group. In the right two columns are the results for a conjunction analysis in the observe (mindfulness) group. As you can see, there are substantial differences in breadth of activation for tasty vs. healthy foods, with tasty foods activating the ROIs much more than the healthy foods. As you can also see, these differences were much larger for the observe condition than for the normal viewing condition. Additionally, the observe condition activated the ROIs overall across both food types much more than the normal viewing condition. All of these are predicted effects. There are substantial predicted differences in how much the two manipulations activated the ROIs, when we measure the breadth of activation.

Notably, many of these differences in breadth of activation for the ROIs don’t produce significant effects for linear contrasts of signal intensity. Often we fail to find clusters that differ in signal intensity, even when observing large differences in breadth of activation. The second attached slide illustrates what we have found from examining activations in these ROIs. Panel A at the top of this figure illustrates the most important case. The two conditions both activate an area significantly above baseline, each with considerable breadth of activation, but there are no differences in signal intensity between them, perhaps because of how the BOLD response gets squashed as it reaches asymptotic levels. Thus, no significant clusters emerge. What gets lost, however, is that both conditions have activated the ROI above baseline, with one condition activating it much more.

So, again, going back to my earlier message, what we’re looking for is a way to establish the probability of obtaining the number of total voxels that each of two conditions has activated above the active baseline in an ROI (or the number of unique voxels that the two conditions activated). If one were to assume that the two conditions didn’t actually differ, what would be the probability of seeing a given pair of voxel counts for the two conditions in the ROI.

Thanks again for your time, expertise, and patience :slight_smile: and please let me know if you’d like any further information.

Larry

Larry,

I may be oversimplifying your situation, but let me give it a shot. There are two effects of interest, A and B, plus a baseline effect C. You have strong statistical evidences of showing both A > C and B > C, but the statistic evidence for A > B is pretty weak or at least not strong enough to be convincing based on the commonly adopted criterion. For example, suppose that the effects for A, B and C are 0.9%, 0.8%, and 0.2% signal change. You managed to gather statistical evidence for both A > C and B > C; furthermore, you do see a bigger cluster for effect A than B, relative to C, when artificially dichotomizing the evidence with a preset threshold. However, it is no surprise that you have difficulty of showing A > B because the difference between A and B is relatively small, compared to the differences between A and C and between B and C. Is this a more or less accurate description about the situation?

In our recent experiment, though, with 20 participants, linear contrasts found no difference for
tasty vs. healthy foods.

The two conditions both activate an area significantly above baseline, each with considerable breadth of activation,
but there are no differences in signal intensity between them, perhaps because of how the
BOLD response gets squashed as it reaches asymptotic levels. Thus, no significant clusters emerge.

You probably do see some differences, but the crucial issue here is that the statistical evidence for those differences are not strong enough to reach the commonly accepted comfort zone. In the conventional statistical terminology, the “statistical power” is relatively low. Put it differently, if you set a voxel-wise two-sided p-threshold to 0.1 (and forget about FWE correction), do you see anything about those differences?

Much thanks for your time and expertise Gang.

Yes, I agree with your assessment of the situation, in terms of A-C and B-C being larger than A-B. I question, though, whether A-C and B-C differ so little, perhaps more than the .9% and .8% that you suggest. Given the substantial differences between A and B in the number of voxels that they activate above the active baseline (C), it might well be the case that the difference between these effects is quite a bit larger.

Another thing that I might disagree with is your phrase “artificially dichotomizing the evidence” in the sentence, “furthermore, you do see a bigger cluster for effect A than B, relative to C, when artificially dichotomizing the evidence with a preset threshold.” I’m not sure what you mean by this. When we assess A and B relative to C, we create condition maps for A and B within 3dLME in exactly the same as we would when computing A-B contrasts. We’re not doing anything different at all up to this point. The process is exactly the same for both. The only difference is that we take the maps that 3dLME produces for A-C and B-C and export them to a conjunction analysis in 3dcalc instead of contrasting the A-C and B-C maps with a GLT in 3dLME.

I tried what you suggested for the A-B contrast, lowering the p threshold to .1 and dropping FWE correction. Bits and pieces of various effects emerge in the ROIs, but things are still relatively weak and scattered, especially in key areas such as the insula, OFC, and amygdala. I do agree with your general point, though, that signal change for A-B is much less than for A-C and for B-C.

My main concern is that assessing signal change for A-B may be asking the wrong question. I totally get your point about the conventional comfort zone, but I increasingly wonder whether staying in this comfort zone is causing us to miss all sorts of important information about what’s happening in our experiments. I hasten to add that I’m not trying to defend weak effects. I’m totally on board with powering experiments appropriately, replicating effects, and ensuring that they’re real. I don’t want to be part of the problem of producing unreplicable effects. At the same, I want to measure things accurately, and I increasingly believe that linear contrasts like A-B here may not be doing so.

One problem that I have with A-B intensity contrasts is that they don’t tell us what becomes active for a task above baseline. They just tell us how two conditions for the same task differ from each other. When one uses a well-matched active baseline, the A-C and B-C contrasts become increasingly informative. Relative to a reference set of processes, we establish the brain areas that a task engages throughout the brain. When we look at the areas that emerged from our A-C and B-C contrasts, we see all the areas that the task engaged, above a well-designed reference task. I’ve attached two figures here for the kinds of results we see. There are two more figures that I can’t attach (because only two are allowed) that show still other interesting areas that our task engages. If we looked at A-B signal change contrasts in the same areas, we’d see very little in the way of activations, thereby not having a sense of the brain areas that the task engages. So this is one reason why I have a problem with the conventional comfort zone.

The other is that when conventional methods assess signal strength for A-B, they may be missing important differences that aren’t false positives but that are real effects. I increasingly believe that breadth of activation may be a more sensitive measure of differences between A and B than differences in activation intensity. I increasingly wonder why we believe that intensity of activation is a more informative and more accurate measure of processing than breadth of activation. Perhaps differences in breadth reflect greater differences in what’s being computed cognitively than differences in intensity.

One piece of evidence for this is that when we compute the voxel counts for individual participants (measuring activation breadth) and submit them to mixed-effects modeling in R (using lmer), we find large significant effects across participants (as random effects). So far, we have only done this at the whole brain level, not at the ROI level. When we do it at the whole brain level, however, we find large significant effects across participants. For example, tasty foods activate more voxels than healthy foods, robustly across the brain. Even though we don’t see evidence for this in A-B intensity contrasts in the group activation map, we see large effects from breadth analyses in A-C and B-C individual maps. You can see this overall effect in the first image of a voxel count table in my previous post. You can also see this overall effect in the two images attached to this post. There are even larger effects of contrasting individual voxel counts when comparing activation breadth for the normal viewing versus observe manipulation, which again doesn’t show up much in the A-B contrast intensity maps. Again, my point is that assessing activation breadth using voxel counts may be a more sensitive measure of how two conditions differ than contrasting their differences in intensity.

I realize that there may be no good ways to assess the statistical significance of breadth differences in the group level maps (as I asked in my original post). Perhaps the best approach is to pull out voxel counts from individual participants and then test them externally in lmer analyses, as just described above. We could easily do this at the ROI level, as well as for the whole brain.

One other thought is that it might be informative to create simulated data sets that vary systematically in signal strength and activation breadth, and then look at the implications for various kinds of tests, including A-B intensity contrasts and contrasts between A-C and B-C breadth. It would be interesting to see if breadth is indeed a more sensitive measure of detecting effects between A and B when false discovery rates are well controlled. If so, this might suggest considering an additional way to define our comfort zone.

Again, thank you so much for your help and expertise. We’re most grateful. We look forward to hearing any further thoughts and suggestions that you have.

Warm regards, Larry

Larry,

I question, though, whether A-C and B-C differ so little, perhaps more than the .9% and .8% that you suggest.
Given the substantial differences between A and B in the number of voxels that they activate above the active
baseline (C), it might well be the case that the difference between these effects is quite a bit larger.

Those numbers were simply made up for the convenience of discussion.

I might disagree with is your phrase “artificially dichotomizing the evidence” in the sentence, “furthermore, you
do see a bigger cluster for effect A than B, relative to C, when artificially dichotomizing the evidence with a
preset threshold.”

The popularly adopted approach of identifying a spatial cluster based on the overall FWE threshold of 0.05 is arbitrary in the following sense: 1) why is 0.05 so special but not 0.04 or 0.06? 2) The current correction methods may be rigorous under one particular framework, but their efficiency is debatable if we take a different perspective. What I’m trying to say here is this: you may feel comfortable enough about the strength of statistical evidence for each surviving cluster; however, I would not take seriously the boundary, extent (or breadth) or number of voxels of each cluster because of the arbitrariness involved in the whole process. In other words, it might well be the case that most or even all of the involved regions in your experiment are pretty much activated under both conditions A and B relative to C. However, you only showed those colored clusters in the attachment simply because you have to present the results within that “comfort zone” per the current publication filtering system. Still, artificially dichotomizing the evidence may be convenient for results reporting, but we should not forget about this fact: statistical evidence (e.g., t-statistic) is a continuum in essence.

I increasingly wonder whether staying in this comfort zone is causing us to miss all sorts of important
information about what’s happening in our experiments.

I do agree with this assessment of yours because the current correct methods tend to be overly penalizing to me.

breadth of activation may be a more sensitive measure of differences between A and B than differences
in activation intensity.

Again, the breadth of those clusters should not be taken seriously because of 1) artificial dichotomization, and 2) arbitrariness involved.

when we compute the voxel counts for individual participants (measuring activation breadth) and submit
them to mixed-effects modeling in R (using lmer), we find large significant effects across participants (as
random effects) …
Perhaps the best approach is to pull out voxel counts from individual participants and then test them
externally in lmer analyses, as just described above

Same problem for the number of voxels within each cluster as the breadth of the cluster.

As far as I can tell, A-B is still the robust way to go. Sample size (number of subjects) is probably too costly for you at the moment. When you set a voxel-wise two-sided p-threshold to 0.1 (and forget about FWE correction) for A-B, do you see at least half of the voxels surviving the thresholding among those anatomical regions (not statistically defined clusters) you’re interested?

Hi Gang,

I agree that the breadth of activation we observe for a specific cluster reflects the statistical choices that we make for detecting clusters. The absolute breadth of a cluster, however, is not the issue of interest. The critical issue is the relative breadth of two clusters, which I assume is likely to remain at least quite similar across varying statistical thresholds. I assume that the same issue applies to obtaining clusters via differences between two conditions in contrast intensity, with thresholding choices determining the size and average intensity of the clusters obtained.

Perhaps one direction we’ll explore next is to establish confidence intervals for a cluster’s breadth using bootstrapping methods. If we did this for the control condition, we could then see what the probability is of observing a cluster having this breadth in the experimental condition. A similar approach would be to apply permutation methods to the control and experimental conditions, thereby establishing the probability of the difference in cluster breadth observed between them. If we explore these approaches, we’ll vary the statistical thresholds to see how robust the findings are across them.

Thanks again for all your helpful comments and suggestions. If we have any further questions, we’ll let you know.

Best regards, Larry