Gábor Samu
Gábor Samu
Creator of this blog.
Oct 1, 2018 3 min read

GPU usage information for jobs in IBM Spectrum LSF

thumbnail for this post

In my last blog, we ran through an example showing how IBM Spectrum LSF now automatically detects the presence of NVIDIA GPUs on hosts in the cluster and performs the necessary configuration of the scheduler automatically.

In this blog, we take a closer look at the integration between Spectrum LSF and NVIDIA DCGM which provides GPU usage information for jobs submitted to the system.

To enable the integration between Spectrum LSF and NVIDIA DCGM, we need to specify the LSF_DCGM_PORT=<port number> parameter in LSF_ENVDIR/lsf.conf

root@kilenc:/etc/profile.d# cd $LSF_ENVDIR
root@kilenc:/opt/ibm/lsfsuite/lsf/conf# cat lsf.conf |grep -i DCGM
LSF_DCGM_PORT=5555

You can find more details about the variable LSF_DCGM_PORT and what it enables here.

Before continuing, please ensure that the DCGM daemon is up and running. Below we start DCGM on the default port and run a query command to confirm that it’s up and running.

root@kilenc:/opt/ibm/lsfsuite/lsf/conf# nv-hostengine
Started host engine version 1.4.6 using port number: 5555

root@kilenc:/opt/ibm/lsfsuite/lsf/conf# dcgmi discovery -l
1 GPU found.
+--------+-------------------------------------------------------------------+
| GPU ID | Device Information                                                |
+========+===================================================================+
| 0      |  Name: Tesla V100-PCIE-32GB                                       |
|        |  PCI Bus ID: 00000033:01:00.0                                     |
|        |  Device UUID: GPU-3622f703-248a-df97-297e-df1f4bcd325c            |
+--------+-------------------------------------------------------------------+ 

Next, let’s submit a GPU job to IBM Spectrum LSF to demonstrate the collection of GPU accounting. Note that the GPU job must be submitted to Spectrum LSF with the exclusive mode specified in order for the resource usage to be collected. As was the case in my previous blog, we submit the gpu-burn test job (formally known as Multi-GPU CUDA stress test).

test@kilenc:~/gpu-burn$ bsub -gpu "num=1:mode=exclusive_process" ./gpu_burn 120
Job <54086> is submitted to default queue <normal>

Job 54086 runs to successful completion and we use the Spectrum LSF bjobs command with the -gpu option to display the GPU usage information in the output below.

test@kilenc:~/gpu-burn$ bjobs -l -gpu 54086

Job <54086>, User <test>, Project <default>, Status <DONE>, Queue <normal>, Com
                     mand <./gpu_burn 120>, Share group charged </test>
Mon Oct  1 11:14:04: Submitted from host <kilenc>, CWD <$HOME/gpu-burn>, Reques
                     ted GPU <num=1:mode=exclusive_process>;
Mon Oct  1 11:14:05: Started 1 Task(s) on Host(s) <kilenc>, Allocated 1 Slot(s)
                      on Host(s) <kilenc>, Execution Home </home/test>, Executi
                     on CWD </home/test/gpu-burn>;
Mon Oct  1 11:16:08: Done successfully. The CPU time used is 153.0 seconds.
                     HOST: kilenc; CPU_TIME: 153 seconds
                        GPU ID: 0
                            Total Execution Time: 122 seconds
                            Energy Consumed: 25733 Joules
                            SM Utilization (%): Avg 99, Max 100, Min 64
                            Memory Utilization (%): Avg 28, Max 39, Min 9
                            Max GPU Memory Used: 30714888192 bytes


GPU Energy Consumed: 25733.000000 Joules


 MEMORY USAGE:
 MAX MEM: 219 Mbytes;  AVG MEM: 208 Mbytes

 SCHEDULING PARAMETERS:
           r15s   r1m  r15m   ut      pg    io   ls    it    tmp    swp    mem
 loadSched   -     -     -     -       -     -    -     -     -      -      -  
 loadStop    -     -     -     -       -     -    -     -     -      -      -  

 EXTERNAL MESSAGES:
 MSG_ID FROM       POST_TIME      MESSAGE                             ATTACHMENT
 0      test       Oct  1 11:14   kilenc:gpus=0;                          N     

 RESOURCE REQUIREMENT DETAILS:
 Combined: select[(ngpus>0) && (type == local)] order[gpu_maxfactor] rusage[ngp
                     us_physical=1.00]
 Effective: select[((ngpus>0)) && (type == local)] order[gpu_maxfactor] rusage[
                     ngpus_physical=1.00]

 GPU REQUIREMENT DETAILS:
 Combined: num=1:mode=exclusive_process:mps=no:j_exclusive=yes
 Effective: num=1:mode=exclusive_process:mps=no:j_exclusive=yes

 GPU_ALLOCATION:
 HOST             TASK ID  MODEL        MTOTAL  FACTOR MRSV    SOCKET NVLINK                           
 kilenc           0    0   TeslaV100_PC 31.7G   7.0    0M      8      -               

And to close, yours truly spoke at the HPC User Forum in April 2018 (Tucson, AZ) giving a short update in the vendor panel about Spectrum LSF, focusing on GPU support.