Parallel Computing Double Output

AMReye · October 18, 2024, 9:44pm

Hello! I'm running 3ddeconvolve with the 'stim_times_IM' option on a naturalistic movie dataset, which means the output files are very large and the jobs take a while. We thus submit our jobs on HPCs with OpenMP. My latest job output two coefficient files (the latter being labeled _AA1) despite only asking for one. The job took 27 hours and the two output files were written 20 minutes apart at the end of the 27 hours.

The weird part is that the coefficient values in the two output files are different, despite them coming from the same script. For example, the first two lines of sub-brick info from 3dinfo on the two files are:

Output File 1
Number of values stored at each pixel = 8804
-- At sub-brick #0 'words#0' datum type is float: -1.67084e+09 to 1.37942e+09

Output File 2 (Labeled _AA1)
Number of values stored at each pixel = 8804
-- At sub-brick #0 'words#0' datum type is float: -5.68056e+08 to 6.1419e+08

The rest of the sub-bricks also have different values. Yet when you look at the history, the scripts that led to these different values were exactly the same. Any thoughts on what may be causing this? Thank you!

AMReye · October 21, 2024, 5:06am

Update: I re-ran the same script and now I have two more files. None of the 4 (2 from last post, two new ones) have matching beta value ranges from 3dinfo.

ptaylor · October 21, 2024, 1:13pm

Howdy-

I think others here will have more helpful comments, but two things that might help quickly:

Could you please copy+paste your actual 3dDeconvolve command here?
Those extreme values are quite large, and I would hazard a guess that the come from either unmasked voxels outside the brain or poor-fit locations within the brain (CSF, dropout, etc.?). I wonder if they might be being affected by precision considerations. What is the expected range of meaningful data (within brain) for those volumes, and are those values looking similar or different?

--pt

dglen · October 21, 2024, 7:40pm

Files with those names can result from setting AFNI_DECONFLICT to YES. This allows AFNI programs to write out a new dataset if there is already a dataset with the same file name. By default, the variable is set to NO. You can check this in your ~/.afnirc file or in your script. This command shows the current default.

@AfniEnv -get AFNI_DECONFLICT

Depending on how you have set up your parallel computing, you may run into some odd problems. If you are using a framework that splits across subjects with Dask, for example, you have to be careful that each process works in its own directory. Multiple processes that change directory might get confused as to which directory is current for a particular step. That will depend on the framework. You can override some confusion by specifying input and output paths. Outputs can often include the full path, as in -prefix /dsetpath/sub01/results/mydset.nii.gz .

I don't think OpenMP is relevant here, but if you want to increase or reduce the number of threads the 3dDeconvolve process uses, you can set "OMP_NUM_THREADS". The minimum value of 1 should work to effectively remove OpenMP.

ptaylor · October 21, 2024, 8:28pm

Just adding a quick question to Daniel's, and looping back to something I hadn't considered above: how did you run 3dDeconvolve here, and did you also use some other program/functionality (for parallelization or something) outside of AFNI?

thanks,
pt

zhengchencai · October 22, 2024, 4:03pm

Hi Daniel,

I added the following in the .afnirc file

***ENVIRONMENT

   AFNI_DECONFLICT = OVERWRITE // overwrite

// Most (not all) of the Unix environment variables that affect AFNI

......

and @AfniEnv -get AFNI_DECONFLICT returns -overwrite

❯ @AfniEnv -get AFNI_DECONFLICT
OVERWRITE

However, it seems not applied to 1d_tool.py, as I got warnning "** output file 'motion_demean.1D' exists and 'overwrite' not set..."

Is there a way to set overwrite globally to every function in the proc.xxx. Thanks a lot.

dglen · October 22, 2024, 6:33pm

The default is not to overwrite or deconflict. I think that's the easiest way also to debug this problem. Remove all the output to start, and then see if you get any doubling up of files showing a program has been run twice in the same directory.

If you do want to overwrite, most AFNI programs allow -overwrite for this. Some programs like align_epi_anat.py and @animal_warper allow for overwriting or even "ok_to_exist", where program output is checked before running a command, and the calling program continues to the next command. The simplest thing to do is to remove program output before starting.