Debugging and Profiling

Arm-Forge

On ruche, you can use Arm Forge (Ultimate editation) to debug an MPI/OpenMP or CUDA codes with DDT, profile a code with MAP or analyze it with Arm performance reports. Some user guides and tutorials are available on https://developer.arm.com/tools-and-software/server-and-hpc/debug-and-profile/arm-forge. The most appropriate way to use Arm Forge on ruche is via remote connect with Forge remote client and reverse connect mechanism, as explained below.

Prerequisite

Install the Forge client on your remote laptop/workstation: download the 22.1 version, launch it and enter the following configuration for ruche:

  • Connection name: ruche
  • Host name: login@ruche.mesocentre.universite-paris-saclay.fr (where login is your login on ruche)
  • Remote installation directory: /gpfs/softs/debuggers_and_profilers/arm/forge/22.1.

Click on Ok and Close.

Reverse connect mechanism

1/ On ruche, load the arm forge environment module, arm-forge/22.1/intel-20.0.4.304, and those needed to compile your code.

2/ Compile the code with -O0 and -g options.

3/ Write the SLURM script to run the ddt or map command with --connect option. Here is an example of debugging an MPI application example.exe with ddt:

$ cat ruche_submit.sh
#!/bin/bash

#! SBATCH Directives
#SBATCH --job-name=cpujob
#SBATCH --output=%x.o%j
#SBATCH --time=01:00:00
#SBATCH --ntasks=4
# number of licenses should be equal to ntask
#SBATCH --licenses=arm-forge:4
#SBATCH --partition=cpu_short

# To clean and to load the same modules at the compilation phases
module purge
module load intel/19.0.3/gcc-4.8.5
module load arm-forge/22.1/intel-20.0.4.304
module load intel-mpi/2019.3.199/intel-19.0.3.199

# To compute in the submission directory
application="$SLURM_SUBMIT_DIR/example.exe"

#! Work directory (i.e. where will the job run)
workdir="$SLURM_SUBMIT_DIR"

# execution with 'ntasks' MPI processes
ddt --connect --mem-debug=balanced srun -n $SLURM_NTASKS $application

4/ Submit this SLURM script using the sbatch command.

5/ Once the code is running on ruche, launch the remote client on your laptop. In the "remote launch" section, select ruche. A new window appears and reads

"Connecting to login@ruche.mesocentre.universite-paris-saclay.fr ..." /Applications/Arm Forge Client 22.1.app/Contents/Resources/libexec/remote-exec -C login@ruche.mesocentre.universite-paris-saclay.fr /gpfs/softs/debuggers_and_profilers/arm/forge/22.1/libexec/ddt-remoted login@ruche.mesocentre.universite-paris-saclay.fr's password: "

Enter the password of your ruche account. Note that we recommend not using a gateway (via your ssh/.config file, for instance) to access ruche. The gateway may interfere with the reverse connect mechanism for the remote client.

A new window appears and reads:

"A new Reverse Connect request is available from nodeXXX for Arm DDT. Command Line: --connect --mem-debug=balanced srun -n 4 /gpfs/users/login/example.exe Do you want to accept this request?"

Accept and click on Run to debug your code.

6/ When you have finish debugging, click on "End Session" in File and quit the Forge client. Type squeue on ruche and check that your job has finished.

Intel Parallel Studio (Cluster Edition)

Intel Parallel Studio includes the following profiling tools:

Versions

$ module avail intel-parallel-studio
-------------------- /gpfs/softs/modules/modulefiles/tools ---------------------
   intel-parallel-studio/cluster.2019.3/intel-19.0.3.199
   intel-parallel-studio/cluster.2020.2/intel-20.0.2

Intel® Advisor

This tool provides:

  • Vectorization and Code Insights
  • CPU / Memory Roofline Insights and GPU Roofline Insights
  • Offload Modeling
  • Threading

Intel® Advisor User guide

Help command:

$ module load intel-parallel-studio/cluster.2020.2/intel-20.0.2 
$ advixe-cl --help
Intel(R) Advisor Command Line Tool
Copyright (C) 2009-2020 Intel Corporation. All rights reserved.
 Usage: advixe-cl <--action> [--action-option] [--global-option] [[--]
 <target> [target options]] 
<action> is one of the following:
    collect          Run the specified analysis and collect data. Tip: Specify the search-dir when collecting data.
    command          Issue a command to a running collection.
(...)

Example for SIMD vectorization

See SIMD training course at IDRIS, "Support de cours", part "Outils" and "Travaux pratiques du cours" (hands-on), tp3 with hydro code.

1/ Load the latest version of intel-parallel-studio module:

module load intel-parallel-studio/cluster.2020.2/intel-20.0.2

2/ Compile your code with intel compiler and -g

3/ Collect data

Script example (adapted from "SIMD training course" at IDRIS):

cat adv.slurm 
#!/bin/bash
#SBATCH --job-name=hydroadvice
#SBATCH --ntasks=1
#SBATCH --time=00:30:00
#SBATCH --output=hydroadvice.%j.out
#SBATCH --error=hydroadvice.%j.out
#SBATCH --partition=cpu_short

module load intel-parallel-studio/cluster.2020.2/intel-20.0.2 

advixe-cl -collect survey -interval=10 -data-limit=1024000 -project-dir ./Adv/ -- ./hydro input_sedov_noio_10000x10000.nml
advixe-cl -collect tripcounts -flop -interval=10 -data-limit=1024000 -project-dir ./Adv/ -- ./hydro input_sedov_noio_10000x10000.nml

Submit this SLURM script using the sbatch command.

4/ When the job is finished, use ruche visualization interface to launch advixe-gui

Maqao

MAQAO (Modular Assembly Quality Analyzer and Optimizer) is a performance analysis and optimization framework operating at binary level with a focus on core performance. Its main goal of is to guide application developpers along the optimization process through synthetic reports and hints.

Versions

$ module avail maqao

--------------------- /gpfs/softs/modules/modulefiles/debuggers_and_profilers ----------------------
   maqao/2.14.1/gcc-9.2.0

Maqao documentation

Help command:

$ module load maqao/2.14.1/gcc-9.2.0 
$ maqao --help
[mouronval@ruche02 hydro_maqao]$ maqao --help
Synopsis:
  maqao <command>|<script.lua> [...]

Description:
MAQAO (Modular Assembly Quality Analyzer and Optimizer) is a tool for application performance analysis 
that deals directly with the target code's binary file.
 The current version handles the following architectures:
   - x86_64
(...)

Example for SIMD vectorization

See SIMD training course at IDRIS, "Support de cours", part "Outils" and "Travaux pratiques du cours" (hands-on), tp3 with hydro code.

1/ Load the latest version of intel-parallel-studio module:

module load intel-parallel-studio/cluster.2020.2/intel-20.0.2

2/ Compile your code with intel compiler and -g

3/ Write and submit SLURM script

Script example (adapted from "SIMD training course" at IDRIS):

[mouronval@ruche02 hydro_maqao]$ cat maqao.slurm 
#!/bin/bash
#SBATCH --job-name=hydromaqao
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=10
#SBATCH --time=00:30:00
#SBATCH --output=hydromaqao.%j.out
#SBATCH --error=hydromaqao.%j.out
#SBATCH --partition=cpu_short

module load intel-parallel-studio/cluster.2020.2/intel-20.0.2

module load maqao/2.14.1/gcc-9.2.0

# ./hydro input_sedov_noio_10000x10000.nml
maqao lprof -- ./hydro input_sedov_noio_10000x10000.nml

Submit this SLURM script using the sbatch command.