Using AFNI on Biowulf

dante.picchioni · April 8, 2021, 11:29am

Dear Colleagues,

Hi. Let me ask a naive question about using AFNI on Biowulf. I know that things are much more complicated than the way I represent them, but let me oversimplify things so that I can get a rough understanding.

I ran a script that combining
/home/picchionid/afni/CD/AFNI_data6/FT_analysis/Qwarp/s00.warper
and
/home/picchionid/afni/CD/AFNI_data6/FT_analysis/Qwarp/s05.ap.Qwarp
with minimum changes on our raid with

(base) picchionid@raid ~> afni_system_check.py -check_all | grep -i cpu
number of CPUs: 8
(base) picchionid@raid ~> 3dQwarp -help | grep -i dante
++ OpenMP thread count = 8
++ 3dQwarp: AFNI version=AFNI_21.0.17 (Mar 13 2021) [64-bit]
++ Authored by: Zhark the (Hermite) Cubically Warped

, and it took about 2 hr. Then I ran it on Biowulf with

number of CPUs: 72
++ OpenMP thread count = 72

, and the processing time only decreased by about 25%. So, is the following basically a correct understanding? On p. 33 of
/home/picchionid/afni/CD/afni_handouts/afni_proc.pdf
, the bullet that reads “Takes a long time, so the script should be submitted to a multi-node cluster” should be interpreted as “Takes a long time, so [when processing multiple subjects, each subject] should be submitted to a multi-node cluster [as different subjobs].” In other words, “typically, AFNI on Biowulf is used to simultaneously process a large number of independent datasets via the swarm utility” (https://hpc.nih.gov/apps/afni.html).

Sincerely,

Dante

ptaylor · April 8, 2021, 12:58pm

Hi, Dante-

Well, having a cluster to run on is great, but there are some subtleties. (And I assume your reported times are for the job to start and finish; not including wait time from getting nodes on the system.) Actually, I’m curious about the amount of time it took for the jobs to finish?

Firstly, parallelization on a cluster is different than parallization on a single computer with multiple cores (–> ask The Bob for more details, surely!). You can get a sense of this with the list of “anecdotal” processing times recorded for FreeSurfer recon-all processing, which I did with various degrees of parallelization on both Biowulf and on my comp, reported under here:
https://afni.nimh.nih.gov/pub/dist/doc/htmldoc/tutorials/fs/fs_fsprep.html#run-recon-all-faster-parallel
See how the times are pretty different, even though Biowulf is a very impressive computing structure.

Secondly, the read/write steps on the cluster can be pretty slow (say, compared to a desktop) in the main directories; for some programs (like the scripts made by afni_proc.py), we tend to write to a temporary scratch disk, doing all work there, and copy everything back to the main directories when does—this has dramatically improved processing times in the past. (NB: the Biowulf folks are always improving things, and several related operations have improved in speed over time; we still tend to use temporary scratch disks when possible.) Some notes on using the Biowulf scratch disk for a FreeSurfer recon-all run (with script!) is provided here:
https://afni.nimh.nih.gov/pub/dist/doc/htmldoc/tutorials/fs/fs_biowulf.html

Thirdly, there are different sets of CPUs on biowulf (different partitions, such as “norm”, “quick”, etc.) and some might be faster or slower. Even within a given partition, speed can vary. NB: “quick” hear does not mean it processes faster, just that its jobs much be quick, < 4 hours. So, your testing might vary with this.

Another consideration is that the message passing that allows for parallelization also takes work—doubling the number of processors in a parallel job will not simply double the speed, but perhaps make it 60-70% of the time previous, perhaps (and this estimate is totally dependent on the kind of job being done, the method of parallelization, the systems being used—I really shouldn’t put any numbers here, but just to give a sense…). Also, as you increase the number of processors, that inter-organizational process becomes more expensive. So, your marginal benefit will likely keep decreasing. And at some point, you can actually reach a point of having too many processors for a task, because so much effort is spent on dividing and re-gathering. Finally, one can even get “process thrashing”, where the act of trying to parallelize causes the computer so much hassle it slows the work down to a crawl (or over-parallelizing becomes nearly-paralyzing).

I don’t know that 72 processors is causing thrashing, but it seems like overkill. Also, requesting that many processors could lead to a big delay in getting the resources from the slurm managing program, so you are effectively adding more wait time for yourself.

Finally, different programs need different levels of parallelization. The scripts from afni_proc.py have several steps—an important fraction of those (alignment ones and cluster simulation estimate ones) benefit from parallelization, but not all pieces do. So, having more cores will only speed up some pieces here. For @SSwarper, @animal_warper or 3dQwarp, having parallelization will help a lot, because the bulk of the steps of each benefit.

OK, so, if it were me running on biowulf, I would probably use 8-16 CPUs as a start. Consider if you are batch processing a lot of subjects—if you use a lot of CPUs per subj, then slurm might make some subjects wait to start processing until others are finished, whereas if you used less, you might be able to run them all starting around the same time. So, each subject might take a bit longer, but overall you benefit from getting them all running in parallel, and the group finishes earlier.

If you want, you could post the command you are using to submit the swarm/sbatch, too, and we could see about suggesting any improvements there, too.

-pt

dante.picchioni · April 8, 2021, 1:38pm

Paul,

Thanks so much for this detailed answer! I appreciate it. Your bottom-line advice is the same that Hendrik Mandelkow gave me, but I wanted to learn a little more from SSCC.

Yes, I accounted for queue time when comparing Jacco de Zwart’s raid to NIH’s Biowulf. Here are my notes.

RAID:
Sun Jan 3 08:20:12 EST 2021
Sun Jan 3 10:16:51 EST 2021
~2 hr

BIOWULF:
RUN TIME:
Tue Mar 30 06:44:29 EDT 2021
Tue Mar 30 08:19:04 EDT 2021
~1.5 hr (I do not understand why the difference is so small.)
TOTAL TIME:
Mon Mar 29 17:18:59 EDT 2021
Tue Mar 30 08:19:33 EDT 2021
Therefore, QUEUE TIME = ~14 hr, which is better than the last time, which was ~28 hr, so that is the range I can expect for queue times. Hm.

Sincerely,

Dante

ptaylor · April 8, 2021, 1:48pm

Hi, Dante-

Glad Hendrik and I are seeing eye-to-eye!

The parallelization can work on individual machines (e.g., on my desktop I used different numbers of CPUs for testing FreeSurfer runtimes, setting the OMP_NUM_THREADS shell variable to specify the number). How many CPUs does Jacco’s raid have, or more importantly how many did you use? (That is, what is the output of “afni_check_omp” there?)

For queue times, yes, that might not be so surprising with 72 CPUs requested, depending on the partition. Which Biowulf partition were you using, by the way? (That is, what is your sbatch or swarm command? I typically specify “–partition=norm,quick” for jobs I am reasonably certain will finish under 4 hours, which includes most cases of NL warping and afni_proc.py runs, unless you have bazillions of EPI time points like some people in your lab do… they might need more walltime.)

–pt

dante.picchioni · April 8, 2021, 2:38pm

Paul,

Let’s see, for CPUs,

(base) picchionid@raid ~> afni_check_omp
8

, and for the partition, I must assume the default because I did not specify it in my sbatch command,

sbatch -Wv --cpus-per-task=72 slurm_afni.sh

, but I don’t know what the default is because it is not listed here:

[picchionid@biowulf ~]$ sbatch -help | grep -i -A 10 partition
-p, --partition=partition partition requested
–parsable outputs only the jobid and cluster name (if present),
separated by semicolon, only on successful submission.
–power=flags power management options
–priority=value set the priority of the job to value
–profile=value enable acct_gather_profile for detailed data
value is all or none or any combination of
energy, lustre, network or task
–propagate[=rlimits] propagate all [or specific list of] rlimits
-q, --qos=qos quality of service
-Q, --quiet quiet mode (suppress informational messages)

.

Sincerely,

Dante

ptaylor · April 8, 2021, 6:21pm

Hi, Dante-

OK, both the raid and the biowulf are using multiple CPUs. My guess about the “not so big difference” is in due part to:

different architectures
different I/O speeds (also related to architecture)
the diminishing returns of more CPUs in parallelization after some point.

Probably the raid has faster file I/O (even with using the temporary scratch disk on biowulf, though that should be more similar).

Note that running on my desktop with multiple CPUs, I also ran faster than biowulf, so the architecture of the raid might be more similar to that.

If you wanted a detailed comparison, you could specify the partition on biowulf in your sbatch command, and run the same thing with 8 CPUs and, say, 24 CPUs (rather than 72, which might make you wait around a lot). I don’t think you can compare 8 on raid vs 72 on biowulf directly because of too many differences.

Note that when I sbatch/swarm processes that should last <4 hours, I run with “–partition=norm,quick”, so that I get a larger number of nodes to choose from (likely less waiting time). Here is an example of submitting a job with more specifications (you might leave out the “–gres=…” part if you aren’t using a temporary scratch disk to write to, which is described in the running-FS-on-biowulf page, earlier in this thread):


sbatch                                                            \
      --partition=norm                                               \
      --cpus-per-task=4                                              \
      --mem=4g                                                       \
      --time=12:00:00                                                \
      --gres=lscratch:10                                             \
      do_*.tcsh

The real power of biowulf comes from being able to start, say, 500 processes and just let the computers sort stuff out and run when available, even if it means each individual run is a bit slower. As biowulf help notes, sbatch/swarm isn’t reeeally meant for tiny small jobs, but really for larger jobs and many of them—that is where the largest benefit of it comes in.

–pt