3dDeconvolve fails when reading in a large amount of data

dillonplunkett · October 17, 2021, 9:11pm

Hi all,

I’m trying to run 3dDeconvolve on a large amount of data obtained from a single participant over many sessions (~250 runs of about 500MB each). When I do, 3dDeconvolve crashes with the following error (followed by a long stack trace and memory map):


*** Error in `3dDeconvolve': free(): invalid next size (normal): 0x0000000002bce690 ***

I can run the same command on the first (or second) 50% of the runs without issue (and the full version has considerably more than twice as much memory available to it when it fails). Based on this thread, my understanding is that there shouldn’t be a limit on the amount of data that AFNI can read in, so long as each volume is a reasonable size and there is adequate RAM available.

Anyone know what might be happening here or how I might work around it?

Thank you!

rickr · October 18, 2021, 2:39pm

Hello,

While it may use memory-mapping for the reading of data, datasets that are computed (errts, fitts, stats) will be allocated in full, and the first two are of the same size as the input.

Are you applying both -errts and -fitts? If so, try omitting the -fitts option and see if that allows for completion. That dataset can be computed after the fact.

rick

dillonplunkett · October 18, 2021, 4:05pm

Hi Rick,

Thanks for the quick reply! I’m not applying either -errts and -fitts (at least not intentionally). The only output I’m creating is a -bucket with an fstat, 7 coefficients, and 2 GLTs (plus a 1d file and a jpg with -x1D and -xjpeg).

Additionally (although I may be misunderstanding what you’re saying and this might not be relevant), I’m running the command in our HPC cluster and I request RAM > 3x the size of the input dataset (and do have access to it, as best I can tell).

Dillon

rickr · October 19, 2021, 2:31pm

Hi Dillon,

You might be able to see whether that process ran out of RAM. In any case, it seems like you should ask for more and see whether that works. I would expect 3x to be borderline.

rick

dillonplunkett · October 24, 2021, 7:43pm

Hi Rick,

Thanks again! I gave this a try with RAM of 20x the size of the data set (2.6TB for about 130GB of data) and had the same issue about 20 seconds into the job. I can’t find any indications that the process is running out of RAM (e.g., no complaints from SLURM, the cluster’s scheduler, for exceeding the requested memory). I’m in touch with the HPC team, but I can’t see any indications that the problem is with requesting that much RAM on our cluster. I recently ran a job that used 960GB of RAM on the same node and there’s 3TB of RAM available on that node. Is it possible it could be an AFNI issue?

Dillon

rickr · October 25, 2021, 12:14am

Hi Dillon,

So maybe is isn’t running out RAM. Would you mind sending me direct email with the stack trace, along with the output from “afni -ver”?

Thanks,

rick

TheBob · October 29, 2021, 4:51pm

I will have to try something, since it is weird that it works for “small” datasets and fails for “large” datasets.

Please give me the information about your datasets listed below:[ul]
[li] Dimensions (grid points in each direction, including time)
[/li][li] Dataset “type” (floats, shorts, ???)
[/li][li] Largest number of time points where 3dDeconvolve seems to work in your experience
[/li][li] Number of regression columns used (should be in the 3dDeconvolve stderr output)
[/li][/ul]
With this info, I can make up some fake data and try it. Since the datasets are so big, I’ll have to do this on the NIH’s cluster, using one of the “largemem” nodes. Once I can duplicate the failure, I can try to figure out what is causing it. Clearly it is some issue with memory allocation or misuse, but WHERE in the program it happens is kind of opaque – somewhere at the startup is all I can see now.

TheBob · October 30, 2021, 1:30pm

I made up some large data (120+ GB) in the form of 250 500 MB datasets, and tried to run 3dDeconvolve on them auto-catenated on the command line. The program was able to read all the data in, set up the regression matrix, and start the work – but it was taking so long, I killed the job after a while before going to bed.

However, this convinces me that I need more information from you.
The setup of your datasets, as described in my last post.
But also your other inputs - the command line used.
Did you have censoring? Motion regressors? How many stimuli, and of what type?
All I know is that YOUR run failed in the read_input_data() function, which is where all of this stuff is input and organized, so it could be anything that is causing the problem. So I need to know more about all the inputs.

TheBob · November 12, 2021, 11:24am

I have been unable to reproduce your errors with synthetic datasets.
My only 2 suggestions are
[ol]
[li] Make sure you are using the latest version of AFNI binaries.
[/li][li] Use 3dREMLfit instead of 3dDeconvolve if possible.
[/li][/ol]