Parallel Jobs on HPC
Life is short but jobs are infinite.
Tired of staring at a blank screen, waiting for your simulation results to materialize? Join the club! Learning parallel computing on an HPC is like discovering a secret shortcut to faster, more efficient research. It’s basically like having a personal time machine for your data.
In the following example, I am trying to conduct 300k simulations using R. However, our HPC only allows 1000 parallel tasks per job-id and 20 jobs at a time. So the general idea for parallel working is to write some codes in .bash that can automatically submit jobs for us and for each job, it can span all available resources to perform simulation simultaneously. To achieve this goal, I will give out a template for running such large-scale simulations.
For .R file
I usually pass two types of index into the R code:
- read/write path
- simulation related hyper-parameters
So I need the parse both numeric and path-like variables. You can safely copy and paste the following codes at the beginning of your R file:
new_sim.R
#!/bin/bash
options(echo=FALSE) # if you want to see commands in .Rout file
args=(commandArgs(TRUE))
print(args)
arguments = matrix(unlist(strsplit(args,"=")),ncol=2,byrow = T)
for (args_i in 1:length(args)) {
if(args_i %in% grep("/", args, value = FALSE, fixed = TRUE)){
assign(arguments[args_i,1],arguments[args_i,2])
}
else{
eval(parse(text = args[args_i]))
}
}
Caveat:
If you want to debug your code for certain simulation, you can set options(echo=TRUE)
temporately. Please remember to change it back to FALSE
when it comes back to a large-scale simulation.
Next, we can specify all scenarios using expand.grid()
in R for all possible combinations of parameters, here is an example:
# parameters of all simulations
params <- expand.grid(
seed = 1:1000,
n = c(25,50,100,200,400,800),
arg1 = c(...),
arg2 = c(...)
)
For .bash file
In our case, we have 300k jobs to submit while HPC only allows 1000 parallel computing at most using “array”. We thus need two .bash
files:
new_submit.slurm
: perform 1000 parallel simulations using “array”.new_sbatch.slurm
: successively submitnew_submit.slurm
with different array base number. (I will explain what it means later.)
Here are the example code for both .bash
file:
new_sbatch.slurm
#!/bin/bash
# in terminal, submit the following code
# sbatch new_sbatch.slurm
path_read="/..../" # read directory
path_save="/..../" # write directory
# array base number
for iter in {0..299}
do
sbatch -J sim -o ${path_read}out/o_%A_%a.out -e ${path_read}err/e_%A_%a.err new_submit.slurm $iter $path_read $path_save
done
new_submit.slurm
#!/bin/bash
#SBATCH --job-name=%J
#SBATCH --partition=preemptable
#SBATCH --array=1-1000
# in terminal, submit the following code (example for array base number 150)
# sbatch new_submit.slurm 150
##################### Change these constants ##############################
path_read="$2" # read directory
path_save="$3" # write directory
# pass in the array base number
let my_array_base="$1"
# calculate set index number: array_base + array_task_id
let sim=${my_array_base}*1000+${SLURM_ARRAY_TASK_ID}
# if path directory doesn't exist, make it
[ ! -d ${path_save} ] && mkdir ${path_save}
[ ! -d ${path_read}out ] && mkdir ${path_read}out
[ ! -d ${path_read}err ] && mkdir ${path_read}err
module purge
module load R/4.0.3 # specify R version
chmod +x ${path_read}new_sim.R
srun Rscript ${path_read}new_sim.R path_read=${path_read} path_save=${path_save} sim=${sim}
exit 0
Here’s a breakdown of how the array index works for the simulation:\
- Base Value:
my_array_base
is a starting point for each simulation. In this example, it ranges from 0 to 299. - Task ID:
SLURM_ARRAY_TASK_ID
is a unique identifier assigned to each individual simulation job. In this case, it starts at 1 and goes up to 1000. - Combined Index: To create a unique identifier for each simulation, we combine the base value and task ID using the formula:
sim = my_array_base * 1000 + SLURM_ARRAY_TASK_ID
. - Example: If
my_array_base
is 5 andSLURM_ARRAY_TASK_ID
is 999, the combined index (simulation number) would be 5 * 1000 + 999 = 5999. This value is then passed to the.R
file for further processing.
Caveat:
- turn off echo option in .R file: options(echo=FALSE)
- change array base number
iter
innew_sbatch.slurm
file manually (due to cluster capacity, usually 20 can be successfully submitted simultaneously, but we can adjust the range to a much larger number (saying 160..250), with submitting this file to a day-long-cpu partition to avoid kill.)
Well, one good news is: if we only need to perform one particular simulation 1000 times for a certain scenario, we can simply revise new_submit.slurm
:
- specify read/write path directly instead of using arguments passed from
new_sbatch.slurm
- submit using following code:
sbatch new_submit.slurm 0