Modeling the BOLD5000 dataset

I am exploring the BOLD5000 dataset (Chang et al., 2019; link to webpage[/url]; [url=https://www.nature.com/articles/s41597-019-0052-3]link to paper). I am considering how to model it to get a beta map for each stimulus, which would be the input for further analysis.

The dataset has four participants who each participated in a slow event related design over ~16 sessions of ~10 runs each. Over this time, they viewed ~5000 distinct images. If it were concatenated, the full dataset has on the order of 30,000 time points. While compressed and represented as 16-bit integers, the functional data for one subject is about 20GB as nii.gz–concatenation converts to 32-bit floats while uncompressing.

Of course, since it is slow and event related, I might just extract some volumes for analysis without modeling. But I am interested in thinking in how to approach the modeling problem at this scale. Since the images are all distinct, it seems that modelling all sessions/runs at once would be ideal.

If memory is a concern, I think it would be largely fine to analyze each session (or even each run) separately. Concatenation vs no concatenation: the difference is whether the residuals are pooled or separate, and the consequence is likely negligible.

I am considering how to model it to get a beta map for each stimulus, which would be the input for further analysis.

Just curious: what kind of further analysis?

I am glad for the perspective that separate residuals in this case would probably be fine. The plan is to do representational similarity analysis following on from a behavioral study looking at an interesting subset of ~700 images from the full set. So deriving one “representation” per image is the goal.

I would have been comfortable modeling each session separately if each session contained the same conditions—that would feel like a good fit for mixed effects analysis. But since the conditions are different across all sessions and runs, I was reluctant to circumvent the memory limitations by splitting the data by session.Thanks again.

If you wish to analyze such a large dataset as a whole, there are some options in AFNI. Most of them will require running on a system with enough RAM to hold the entire dataset easily. However, if you are industrious enough, you can break the dataset into pieces to reduce the memory footprint required at any given point.

No matter what you do, you will have to manage the analysis yourself, as I doubt that you can get afni_proc.py to do the work for you – unless you have access to a computer system with a LOT of memory.

The NIH Biowulf cluster, for example, has several systems with over 1 TB of RAM. Probably that would be more than you need, but a few hundred GB would be comforting to have. If you have access a system with 200+ GB of RAM, that is what I would try first, using the “Big Dataset” recommendations below.

Using a SSD (solid state drive) to store the data and all intermediate files will help with processing, as access to such data is a lot faster than to data stored on rotating physical disks. On the NIH cluster, each node has an SSD that can be used for temp storage as a job runs. In our usual processing script for running on the cluster, we copy data over to the SSD at the start of a job, process it there, and then copy the results back to permanent storage at the end of the job. This procedure can speed things up significantly.


Two pieces of advice for the case where you try to process this collection as one giant dataset:

One Big Dataset Recommendation 1: Create the dataset in AFNI .HEAD/.BRIK uncompressed format and process it when it is stored on a SSD. AFNI programs can load such datasets into memory using Unix memory-mapping (mmap) of the .BRIK file, which will reduce the RAM required. But you will still need a fair amount of memory.

One Big Dataset Recommendation 2: For time series regression, use 3dREMLfit with a mask to restrict processing to brain voxels, and also use the -usetemp option if the program complains about running out of memory – or if it dies with the cryptic message “Killed”, which is not from the program but from the Unix system itself, which will kill a program that is causing trouble – and that trouble is almost always caused by demanding way too much memory.


Two pieces of advice for how to deal with the collection in pieces. This will require a lot of management from you, as afni_proc.py will not manage the process outlined below. You’ll have to develop your own processing script (perhaps inspired by afni_proc.py’s script) to deal with the Recommendations below in the midst of the processing stream.

Smaller Datasets Recommendation 1: If you want to blur the dataset (spatially), this operation does not deal with data along the time axis – so you can create pieces that are full in space (the entire volume) but small in time (segments). Blur these pieces in a script separately. Any other spatial-only processing can be done on these pieces as well.

Smaller Datasets Recommendation 2: For the time series regression, one voxel at a time is processed. So you can break the dataset into individual slices, and process the full time series of 30000 points for each slice dataset separately. The programs 3dZcutup and 3dZcat can be used to manage the processes of slice cutting-up and slice re-glueing (of the 3dREMLfit outputs).


Best of luck. Unless you are adroit at scripting, the “Big Dataset” approach is going to be the way to go. All it requires is a big computer.

Bob and Gang had good replies. But to be sure, do you intend to get a single beta weight for each of the 700 stimulus images? And so each stimulus type would only have a single event across all runs, is that correct? I hope so, since the long subsequent thought process assumes it… :slight_smile:

If that is true, then the betas should be unaffected whether the regression is done per run, per session, or across all sessions at once. That is because there should be no regressor that spans (is non-zero for) more than a single run, and scaling is done per run (in afni_proc.py). The difference would be in the t-stats, as mentioned by others, based on whether noise is pooled.

Assuming you do not care about per beta t-stats, then there is no reason to require a concatenated regression.

The one aspect that would vary across these methods is registration, and that would indeed lead to different results. Choices:

  1. register each EPI volume to a single EPI base (bad here, would be used for concatenated runs)
  2. register each EPI volume to a per-run base, and then align bases across runs/sessions (see -volreg_post_vr_allin)
  3. register each EPI volume to a per run base, and then either:
    3a. align that base to the anatomy
    or
    3b. align that base to a session base, and then to the anatomy

Choices 1 and 2 are where (if done by afni_proc.py) the regression would be modeled on the concatenated data, which for registration purposes does not seem like a good idea in the first place (since the runs are so long and are many)

Which is preferable between 3a and 3b is not clear. It depends on quality of the EPI images, but either method seems reasonable.
But getting back to afni_proc.py and registration, method 3a could be done by running afni_proc.py once per session. Method 3b could be done by running afni_proc.py across all sessions at once, leading to a concatenated regression model and RAM issues.

That means choice 3a (running afni_proc.py once per session) might be both reasonable and feasible.

Anyway, is this true, that you will only have one event per class?

  • rick

Ha! Thanks for the comic at the end, that got a real laugh.

This is very helpful, thank you. I am comfortable with those scripting options, and I may also have access to larger computers without needing to wait to long, so my next steps will be contingent on those logistics I think. It’s very helpful to know how about the memory-mapping efficiency with uncompressed BRIK/HEAD.

Thank you again!

Hi Rick,

Yes, you are correct. There is just one event per class. Thank you for spelling this out for me—I slapped my head while being reminded that the difference with respect to concatenation in this specific case would be in the t-values, not the betas. This really simplifies and clarifies things, I appreciate it.

And thank you for the run-down on the registration issues. The 3a approach is what I had in mind, just to cut down on an additional transformation step.