LSF hookin' up with the CRIU
With the unpredicable spring weather here in Southern Ontario, weekend projects are the order of the day. Whether it’s fixing my bike for spring, repairing things in the home which I’ve neglected for far long or topics relating to IT which have been percolating in my head, I am a textbook busybody.
A few decades back, when I was a support engineer at Platform Computing, I had my first experience working with clients using both kernel-level and user-level checkpoint and restart through the HPC workload scheduler Platform LSF (now IBM Spectrum LSF). I distinctly recall that user-level library was a bit tricky as you had to link your home grown code against it - and it had numerous limitations which I can’t recall off the top of my head. Back then, like today, IBM Spectrum LSF provides a number of ways that administrators can extend capabilities using plug-ins. Checkpoint and restart is an example where plug-ins can be used. More about this later.
I’ve been keeping an eye on the project known as CRIU for some time. CRIU, which stands for Checkpoint/Restore In Userspace provides checkpoint and restart functionality on Linux. And I thought it may be an interesting weekend project to integrate CRIU with LSF. As it turns out, I was not blazing any trails here as I found that there are others already using CRIU with LSF today. Nevertheless, I decided to give it a try.
My system of choice for this tinkering was a dual-socket POWER9 based system running CentOS Stream 8 and IBM Spectrum LSF Suite for HPC v10.2.0.12. The LSF online documentation contains information on the specifications of the LSF plugins for checkpoint and restart. The plugins are known as echkpnt and erestart, where the “e” denotes external.
Here is a quick rundown on the steps to integrate CRIU with LSF.
- It turns out that my system already had criu installed. It’s a dependency on runc which was installed as part of podman. This step really depends on your distro. In my case, dnf provides criu was my friend.
# uname -a
Linux kilenc 4.18.0-373.el8.ppc64le #1 SMP Tue Mar 22 15:28:39 UTC 2022 ppc64le ppc64le ppc64le GNU/Linux
# criu
Usage:
criu dump|pre-dump -t PID [<options>]
criu restore [<options>]
criu check [--feature FEAT]
criu page-server
criu service [<options>]
criu dedup
criu lazy-pages -D DIR [<options>]
Commands:
dump checkpoint a process/tree identified by pid
pre-dump pre-dump task(s) minimizing their frozen time
restore restore a process/tree
check checks whether the kernel support is up-to-date
page-server launch page server
service launch service
dedup remove duplicates in memory dump
cpuinfo dump writes cpu information into image file
cpuinfo check validates cpu information read from image file
Try -h|--help for more info
- The criu command needs to be run as root to be able to checkpoint processes. As we are going to leverage criu directly in the LSF echkpnt and erestart scripts, I chose to enable sudo access for criu. To do this I simply added the following to /etc/sudoers.
gsamu ALL=NOPASSWD:/usr/sbin/criu
-
Next, I tested that the basic criu functionality was working. I found this to be a useful blog on how to perform a simple test.
-
With criu installed and working (see step 3), the next steps was to create the echkpnt and erestart scripts which would ultimately call the appropriate criu dump and criu restore commands. These scripts will be named echkpnt.criu and erestart.criu. The .criu extension denotes the checkpoint and restart method name in LSF. The checkpoint method is specified at the time of job submission in LSF.
The key for the echkpnt.criu script is to build out the list of PIDs for the job in question. For this I used an inelegant approach - simply scraping the output of the LSF bjobs -l command. This list of PIDs is then used as arguments to the criu dump command. The example echkpnt.criu script is included below.
Example echkpnt.criu script. Click to expand!
#!/bin/csh -f
# Example external check pointing routine for CRIU (https://criu.org/Main_Page)
# echkpnt [-c] [-f] [-k|-s] -d chkpnt_dir process-group-id"
# tasks:
#
# 1) Check parameters
# 2) Get job PIDS
# 3) Invoke appropriate criu command to checkpoint the job PIDS
setenv PATH /usr/bin:/bin:/usr/etc:$PATH
set usage="Usage: $0 [-k] -d chkpnt_dir process-group-id"
# 1) Check parameters
while (x$1 != x)
switch ($1)
case -k:
set killflag=TRUE
shift
breaksw
case -d:
set chkpntdir=$2
shift
shift
breaksw
case -c:
shift
breaksw
case -s:
set killflag=TRUE
shift
breaksw
case -f:
shift
breaksw
case -*:
echo "Illegal argument $1"
echo "$usage"
exit 1
breaksw
default:
break
endsw
end
if ($#argv != 1) then
echo "$usage"
exit 1
endif
set progrpid=$1
if ($?chkpntdir != 1) then
echo "$usage"
exit 1
endif
if (! -e $chkpntdir) then
echo "The check point directory does not exist."
exit 1
endif
if (! -d $chkpntdir) then
echo "The check point directory is not a directory."
exit 1
endif
#
# 2) Get job PIDS
# We scrape the output of bjobs to get the PGID, PIDS for the job.
# Right now this only considers a job with a single PGID.
#
set bjobs=`bjobs -l $LSB_JOBID |grep PGID`
set jobpids=`echo $bjobs | awk '{for(i=6;i<=NF;i++)printf "%s ",$i;printf "\n"}'
`
set chkpnt=`echo $chkpntdir|awk '{split($1,dir,".");print dir[1]}'`
#
# 3) Invoke appropriate criu command to checkpoint the job PIDS
# For the case when echkpnt -k is called (to checkpoint and terminate the job).
# Otherwise, checkpoint the job and leave it running.
#
foreach pid ($jobpids)
if ($?killflag == 1) then
sudo criu dump -t $pid -j -D $chkpnt --shell-job --file-locks --ext-unix-
sk --tcp-established
else
sudo criu dump -t $pid -j -D $chkpnt --leave-running --shell-job --file-l
ocks --ext-unix-sk --tcp-established;
endif
end
exit 0
I used a simple approach as well for erestart.criu. As per the specification for erestart, the key is to create a new LSF jobfile which contains the appropriate criu restore invocation, pointing to the checkpoint data. The example erestart.criu script is included below.
Example erestart.criu script. Click to expand!
#!/bin/sh
#
# Example external checkpoint restart routine for CRIU (https://criu.org/Main_Pa
ge).
# erestart [-c] [-f] chkpnt_dir
# tasks:
# 1) Check parameters
# 2) Check LSF env variables for checkpoint
# 3) Update the original command with addition option "-restart lsf"
# 4) Put the new job file in .restart_cmd.
# 5) exit 0 to tell erestart that erestart.criu succeeded.
#
PATH=/usr/bin:/bin:/usr/etc:$PATH
export PATH
usage="Usage: $0 [-c] [-f] chkpnt_dir"
#
# 1) Check parameters
# "chkpnt_dir" is the new job_id
#
while [ "$1" != "" ]
do
case $1 in
-c)
shift
;;
-f)
shift
;;
*)
break
;;
esac
done
#
# Save the chkpnt_dir for future
#
new_jobid="$1"
#
# 2) Check LSF env variables for checkpoint
#
if [ -f $LSB_CHKFILENAME ]
then
:
else
echo "Can not find $LSB_CHKFILENAME" 1>&2
exec 2<&-
exit 1
fi
# if LSB_CHKPNT_DIR is not defined, set it up (for LSF 3.1)
if [ _$LSB_CHKPNT_DIR = '_' ]; then
LSB_CHKPNT_DIR=`dirname $LSB_CHKFILENAME`
fi
if [ -d $LSB_CHKPNT_DIR ]
then
:
else
echo "Can not find $LSB_CHKPNT_DIR" 1>&2
exec 2<&-
exit 1
fi
#
# 3) Update the original command with addition option "-restart"
#
new_jobfile=$LSB_CHKFILENAME.criu.restart
if [ -f "$new_jobfile" ]; then
rm -rf $new_jobfile
fi
while IFS= read -r line
do
echo $line >> "$new_jobfile";
if [[ "$line" == "# LSBATCH: User input" ]]; then
break;
fi
done < "$LSB_CHKFILENAME"
echo "sudo criu restore -j -D $LSB_CHKPNT_DIR" --shell-job >> "$new_jobfile"
echo "ExitStat=$?" >> "$new_jobfile"
echo "wait" >> "$new_jobfile"
echo "# LSBATCH: End user input" >> "$new_jobfile"
echo "true" >> "$new_jobfile"
echo exit \`expr \$i\? \"\|\" \$ExitStat\` >> "$new_jobfile"
chmod 700 $new_jobfile
#
# 4) Put the new job file in .restart_cmd.
#
echo LSB_RESTART_CMD=$new_jobfile > $LSB_CHKPNT_DIR/.restart_cmd
echo LSB_USE_MY_JOBFILE=Y >> $LSB_CHKPNT_DIR/.restart_cmd
# 5) exit 0 to tell erestart that erestart.criu succeeded.
exit 0
-
With the echkpnt.criu and erestart.criu scripts in the $LSF_SERVERDIR directory, the process to perform a checkpoint and restart of LSF jobs is straight forward using bchkpnt and brestart commands respectively. Here is a simple example.
-
Submit a job as checkpointable. The checkpoint method criu is specified as well as the location where the checkpoint data will be written to.
$ bsub -k "/home/gsamu/checkpoint_data method=criu" ./criu_test
Job <12995> is submitted to default queue <normal>.
- The executable criu_test simply writes a message to standard out every 3 seconds.
$ bpeek 12995
<< output from stdout >>
0: Sleeping for three seconds ...
1: Sleeping for three seconds ...
2: Sleeping for three seconds ...
3: Sleeping for three seconds ...
4: Sleeping for three seconds ...
-
Next, we see that LSF has detected the job PIDS. Now we’re ready to perform the checkpoint.
$ bjobs -l 12995 Job <12995>, User <gsamu>, Project <default>, Status <RUN>, Queue <normal>, Com mand <./criu_test>, Share group charged </gsamu> Tue Apr 12 08:48:28: Submitted from host <kilenc>, CWD <$HOME>, C heckpoint directory </home/gsamu/checkpoint_data/12995>; Tue Apr 12 08:48:29: Started 1 Task(s) on Host(s) <kilenc>, Alloc ated 1 Slot(s) on Host(s) <kilenc>, Executio n Home </home/gsamu>, Execution CWD </home/gsamu>; Tue Apr 12 08:48:38: Resource usage collected. MEM: 12 Mbytes; SWAP: 0 Mbytes; NTHREAD: 4 PGID: 418130; PIDs: 418130 418131 418133 MEMORY USAGE: MAX MEM: 12 Mbytes; AVG MEM: 6 Mbytes SCHEDULING PARAMETERS: r15s r1m r15m ut pg io ls it tmp swp mem loadSched - - - - - - - - - - - loadStop - - - - - - - - - - - RESOURCE REQUIREMENT DETAILS: Combined: select[type == local] order[r15s:pg] Effective: select[type == local] order[r15s:pg]
-
Initiate the checkpoint using the LSF bchkpnt command. The -k option is specified which will result in the job being checkpointed and killed.
$ bchkpnt -k 12995 Job <12995> is being checkpointed
-
We see in the history of the job using the bhist command that the checkpoint was initiated and succeeded. The job was subsequently killed (TERM_CHKPNT).
$ bhist -l 12995 Job <12995>, User <gsamu>, Project <default>, Command <./criu_test> Tue Apr 12 08:48:28: Submitted from host <kilenc>, to Queue <norm al>, CWD <$HOME>, Checkpoint directory </home/gsamu/checkp oint_data/12995>; Tue Apr 12 08:48:29: Dispatched 1 Task(s) on Host(s) <kilenc>, Al located 1 Slot(s) on Host(s) <kilenc>, Effec tive RES_REQ <select[type == local] order[r15s:pg] >; Tue Apr 12 08:48:31: Starting (Pid 418130); Tue Apr 12 08:48:31: Running with execution home </home/gsamu>, Execution CWD < /home/gsamu>, Execution Pid <418130>; Tue Apr 12 08:54:14: Checkpoint initiated (actpid 419029); Tue Apr 12 08:54:15: Checkpoint succeeded (actpid 419029); Tue Apr 12 08:54:15: Exited with exit code 137. The CPU time used is 2.1 second s; Tue Apr 12 08:54:15: Completed <exit>; TERM_CHKPNT: job killed after checkpoint ing; MEMORY USAGE: MAX MEM: 12 Mbytes; AVG MEM: 11 Mbytes Summary of time in seconds spent in various states by Tue Apr 12 08:54:15 PEND PSUSP RUN USUSP SSUSP UNKWN TOTAL 1 0 346 0 0 0 347
-
Restart the job from the checkpoint data with the LSF brestart command. A new jobID is assigned.
$ brestart /home/gsamu/checkpoint_data/ 12995 Job <12996> is submitted to queue <normal>. $ bjobs -l 12996 Job <12996>, User <gsamu>, Project <default>, Status <RUN>, Queue <normal>, Com mand <./criu_test>, Share group charged </gsamu> Tue Apr 12 08:55:57: Submitted from host <kilenc>, CWD <$HOME>, R estart, Checkpoint directory </home/gsamu/checkpoint_data/ /12996>; Tue Apr 12 08:55:58: Started 1 Task(s) on Host(s) <kilenc>, Alloc ated 1 Slot(s) on Host(s) <kilenc>, Executio n Home </home/gsamu>, Execution CWD </home/gsamu>; Tue Apr 12 08:56:07: Resource usage collected. MEM: 14 Mbytes; SWAP: 0 Mbytes; NTHREAD: 5 PGID: 420069; PIDs: 420069 420070 420073 420074 420076 MEMORY USAGE: MAX MEM: 14 Mbytes; AVG MEM: 14 Mbytes SCHEDULING PARAMETERS: r15s r1m r15m ut pg io ls it tmp swp mem loadSched - - - - - - - - - - - loadStop - - - - - - - - - - - RESOURCE REQUIREMENT DETAILS: Combined: select[type == local] order[r15s:pg] Effective: select[type == local] order[r15s:pg]
-
Viewing the standard output of the job, we see the point where it was killed and that it has picked up from where it left off.
$ bpeek 12996 << output from stdout >> 0: Sleeping for three seconds ... 1: Sleeping for three seconds ... 2: Sleeping for three seconds ... 3: Sleeping for three seconds ... 4: Sleeping for three seconds ... …. …. 110: Sleeping for three seconds ... 111: Sleeping for three seconds ... 112: Sleeping for three seconds ... 113: Sleeping for three seconds ... /home/gsamu/.lsbatch/1649767708.12995: line 8: 418133 Killed ./criu_test 114: Sleeping for three seconds ... 115: Sleeping for three seconds ... 116: Sleeping for three seconds ... 117: Sleeping for three seconds ... 118: Sleeping for three seconds ... 119: Sleeping for three seconds ... 120: Sleeping for three seconds ... .... ....
We’ve demonstrated how one can integrate CRIU checkpoint and restart with IBM Spectrum LSF using the echkpnt and erestart interfaces. As highlighted earlier, LSF provides a number of plugin interfaces which provides flexibility to organizations looking to do site specific customizations.