Baltasar

Architecture

Baltasar Sete Sóis is accessible through the internet and will hop between three entry nodes, that give access to the cluster system.

Below is a summary of the specs for the computational nodes available.

Node name CPUs RAM (MB)
compute-1 48 256000
compute-2 48 256000
compute-3 48 256000
compute-4 48 248000
compute-5 48 120000
compute-6 48 128000
compute-7 48 128000
compute-8 48 128000
compute-9 48 128000
compute-10 48 128000
compute-11 96 256000
compute-12 96 256000

Totals:

  • 672 CPUs
  • 2610 GB RAM
  • 90 TB Storage

Access Baltasar

After your access request has been authorized, you should be able to connect to Baltasar via ssh using your username and ssh key as follows:

username@localmachine:~$ ssh username@baltasar.tecnico.ulisboa.pt

or, if you want X11 forwarding for graphical applications

username@localmachine:~$ ssh -X username@baltasar.tecnico.ulisboa.pt

This should get you connected to one of the entry nodes. To run processes in the computational nodes, Baltasar uses a batch queuing system named Slurm.

Never run your jobs directly in the entry nodes, any jobs caught will be killed.

Slurm Workload Manager

There is a small learning curve on how to use a cluster like Baltasar since programs now run as jobs in a shared Slurm queue.

Whereas in non queued servers you run your programs as you would on your regular computer. In a Slurm cluster you need to request resources, either interactively or by creating and submitting a Batch script.

A Batch Script consists on the definition of the amount of resources that you require to run your program. You can define the amount of nodes, cpus, memory, running hours limit, among others, in a structured comments section before scripting the steps to run your code.

Below follows a sample Batch script to run “program” that sits on “/home/username/program” with “parameters”.

Sample Batch Script

To use Slurm you should have a Batch script like the one below.

#!/bin/bash
#SBATCH --job-name=my-job-name
#SBATCH --output=/home/username/my-job-name-%j.out
#SBATCH --error=/home/username/my-job-name-%j.err
#SBATCH --mail-user=mail@example.com
#SBATCH --mail-type=ALL
#SBATCH --time=72:00:00
#SBATCH --mem=4G

RUNPATH=/home/username/
cd $RUNPATH

./program parameters

Batch Script Explained

--job-name=<name>
 Specify a name for the job allocation. The specified name will appear along with the job id number when querying running jobs on the system. The default is the name of the batch script, or just “sbatch” if the script is read on sbatch’s standard input.
--output=<filename pattern>
 Instruct Slurm to connect the batch script’s standard output directly to the file name specified in the “filename pattern”. By default both standard output and standard error are directed to the same file. For job arrays, the default file name is “slurm-%A_%a.out”, “%A” is replaced by the job ID and “%a” with the array index. For other jobs, the default file name is “slurm-%j.out”, where the “%j” is replaced by the job ID. See Filename Specifications for filename specification options.
--error=<filename pattern>
 Instruct Slurm to connect the batch script’s standard error directly to the file name specified in the “filename pattern”. See –output .
--nodes=<minnodes[-maxnodes]>
 Request that a minimum of minnodes nodes be allocated to this job. A maximum node count may also be specified with maxnodes. If only one number is specified, this is used as both the minimum and maximum node count.
--ntasks=<number>
 sbatch does not launch tasks, it requests an allocation of resources and submits a batch script. This option advises the Slurm controller that job steps run within the allocation will launch a maximum of number tasks and to provide for sufficient resources. The default is one task per node, but note that the –cpus-per-task option will change this default.
--cpus-per-task=<ncpus>
 Advise the Slurm controller that ensuing job steps will require ncpus number of processors per task. Without this option, the controller will just try to allocate one processor per task.
--mail-user=<mail@example.com>
 Specifies the email address to which the messages are sent.
--mail-type=<type>
 Notify user by email when certain event types occur. Valid type values are NONE, BEGIN, END, FAIL, REQUEUE, ALL (equivalent to BEGIN, END, FAIL, REQUEUE, and STAGE_OUT), STAGE_OUT (burst buffer stage out and teardown completed), TIME_LIMIT, TIME_LIMIT_90 (reached 90 percent of time limit), TIME_LIMIT_80 (reached 80 percent of time limit), TIME_LIMIT_50 (reached 50 percent of time limit) and ARRAY_TASKS (send emails for each array task). Multiple type values may be specified in a comma separated list. The user to be notified is indicated with –mail-user. Unless the ARRAY_TASKS option is specified, mail notifications on job BEGIN, END and FAIL apply to a job array as a whole rather than generating individual email messages for each task in the job array.
--time=<time-format>
 Specifies the amount of time (in minutes) that your program will run before being automatically killed. Acceptable time formats include “minutes”, “minutes:seconds”, “hours:minutes:seconds”, “days-hours”, “days-hours:minutes” and “days-hours:minutes:seconds”.
--mem=<memory-unit>
 Specify the real memory required per node in megabytes. Different units can be specified using the suffix [KMGT].

Filename Specifications

The filename pattern may contain one or more replacement symbols, which are a percent sign “%” followed by a letter (e.g. %j).

Supported replacement symbols are:

\\
Do not process any of the replacement symbols
%%
The character “%”
%A
Job array’s master job allocation number
%a
Job array ID (index) number
%j
Job allocation number
%N
Node name. Only one file is created, so %N will be replaced by the name of the first node in the job, which is the one that runs the script
%u
User name

Batch Script Environment Variables

Variable Description
SLURM_JOB_ID Job ID number given to this job.
SLURM_JOB_NAME Name of this job.
SLURM_JOB_NODELIST List of nodes allocated to the job.
SLURM_CPUS_PER_TASK Number of cores requested (per node)

Slurm Usage Examples

Job Submission via batch script

username@baltasar-1:~$ sbatch my_job.sbatch
Submitted batch job 402
username@baltasar-1:~$

See the queue status

username@baltasar-1:~$ squeue
         JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
           370      base     job9 username  R 1-02:19:36      1 compute-7
           401      base   job123   sysadm  R      14:07      1 compute-6
           402      base   my_job username  R      00:07      1 compute-7
username@baltasar-1:~$

Check nodes status and information (CLI)

username@baltasar-1:~$ sinfo
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
base*        up 3-00:00:00      1    mix compute-7
base*        up 3-00:00:00      1  alloc compute-6
base*        up 3-00:00:00     10   idle compute-[1-5,8-12]
username@baltasar-1:~$

Detailed Node status (GUI)

(Requires X Forwarding within SSH session)

username@baltasar-1:~$ sview &

Detailed Job status

username@baltasar-1:~$ scontrol show job 401
JobId=401 JobName=job123
   UserId=sysadm(1111) GroupId=sysadm(1111)
   Priority=4294901406 Nice=0 Account=(null) QOS=normal
   JobState=RUNNING Reason=None Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   RunTime=00:17:58 TimeLimit=3-00:00:00 TimeMin=N/A
   SubmitTime=2018-11-28T16:23:39 EligibleTime=2018-11-28T16:23:39
   StartTime=2018-11-28T16:23:39 EndTime=2018-12-01T16:23:39
   PreemptTime=None SuspendTime=None SecsPreSuspend=0
   Partition=base AllocNode:Sid=baltasar-1:24264
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=compute-6
   BatchHost=compute-6
   NumNodes=1 NumCPUs=48 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   TRES=cpu=48,mem=126976,node=1
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
   MinCPUsNode=1 MinMemoryNode=124G MinTmpDiskNode=0
   Features=(null) Gres=(null) Reservation=(null)
   Shared=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=<path to executable declared within the batch script>
   WorkDir=<path to working directory>
   StdErr=<path to stderr redirect file>
   StdIn=/dev/null
   StdOut=<path to stdout redirect file>
   Power= SICP=0
username@baltasar-1:~$

Cancel a Job

username@baltasar-1:~$ scancel 402
username@baltasar-1:~$

Re-queue a Job (job needs to be running)

username@baltasar-1:~$ scontrol requeue JOB_ID
username@baltasar-1:~$

or

username@baltasar-1:~$ scontrol requeue JOB_ID1,JOB_ID2,...,JOB_IDN
username@baltasar-1:~$

Interactive run of a bash shell

username@baltasar-1:~$ salloc
salloc: Granted job allocation 403
salloc: Waiting for resource configuration
salloc: Nodes compute-5 are ready for job
username@baltasar-1:~$ srun --pty /bin/bash
username@compute-5:~$ exit
username@baltasar-1:~$ exit
salloc: Relinquishing job allocation 403
username@baltasar-1:~$

Interactive run of a GUI program

(Requires X Forwarding within SSH session)

username@baltasar-1:~$ salloc
username@baltasar-1:~$ module load Mathematica
username@baltasar-1:~$ srun --pty --x11=first Mathematica

All other Slurm commands should work as in other computational clusters. Check the user documentation for details.

If you have any suggestions/questions please follow the FAQ at Baltasar’s home page.

MPI Computation

Baltasar has MPICH installed and configured. You must load the MPICH module whenever MPI is needed.

username@baltasar-1:~$ module load MPICH
username@baltasar-1:~$ mpicc mpi_code.c

or even better, you can compile your code in a computation node. Below is an example batch script:

username@baltasar-1:~$ cat program-v1.1-compile.batch
#!/bin/bash
#SBATCH --job-name=my-code-compile
#SBATCH --output=/home/username/program-v1.1/my-code-compile-%j.out
#SBATCH --error=/home/username/program-v1.1/my-code-compile-%j.err
#SBATCH --cpus-per-task=48
#SBATCH --mail-user=mail@example.com
#SBATCH --mail-type=ALL
#SBATCH --time=72:00:00
#SBATCH --mem=64G
module load MPICH
RUNPATH=/home/username/program-v1.1
cd $RUNPATH
mkdir -p /home/username/program-v1.1/bin
./configure --prefix=/home/username/program-v1.1/bin && make -j 48 && make install
username@baltasar-1:~$ module list
No modules loaded
username@baltasar-1:~$ sbatch program-v1.1-compile.batch

When running MPI enabled programs, you must have the MPICH module loaded beforehand or, like in the last script, explicitly declared. Here is an example:

username@baltasar-1:~$ cat program-v1.1-run.batch
#!/bin/bash
#SBATCH --ntasks=128
#SBATCH --time=00:01:00
#SBATCH --mem=1G
#SBATCH --job-name=my-code-run
#SBATCH --mail-user=mail@example.com
#SBATCH --workdir=/home/username/program-v1.1
#SBATCH --output=/home/username/program-v1.1/my-code-run-%j.out
#SBATCH --error=/home/username/program-v1.1/my-code-run-%j.err

srun bin/program-v1.1
username@baltasar-1:~$ module list
No modules loaded
username@baltasar-1:~$ module load MPICH
username@baltasar-1:~$ sbatch program-v1.1-run.batch

NOTE the absence of the option –cpus-per-task and the new –ntasks when running MPI jobs, each of the requested tasks will be assigned to a cpu as resources are available, using –cpus-per-task in MPI jobs can lead to unexpected results.

Compiler Optimization

Baltasar entry/storage nodes have different architectures from the computational nodes. If you were to use “native” detection of the architecture during compile time, code would be optimized to the wrong architecture and therefore your code would run slower in the computational nodes, and even have CPU instructions that are not supported by the compute nodes architecture, i.e: Illegal Instruction errors

Do not use -march=native or -mtune=native while compiling as it will make your program run slower or not run at all.

This happens because the Entry/Storage Nodes are Intel Xeons whereas the Computational Nodes are AMD, of different flavours.

Computational nodes can be thought of in three sets:

Compute 1-5:
AMD Opteron(tm) Processor 6180 SE
Compute 6-10:
AMD Opteron(tm) Processor 6344
Compute 11-12:
AMD EPYC Processor 7401

You should use the following compiler flags for code that will be run on all or specific computational nodes.

all nodes (generic)
-O3 -march=generic -mmmx -msse -msse2 -msse4a -mabm -mpopcnt
Compute 1-5
-O3 -march=amdfam10
Compute 6-10
-O3 -march=bdver2
Compute 11-12
-O3 -march=znver1

Unfortunately you cannot specify in which nodes you want to run, as if you do pass those through the slurm option –nodelist you would have to wait until all of the specified nodes be available, which is probably what you do not want if you are running a job in a single node.

Instead you can specify which nodes you do not want to run at, via the –except option. For example, if you want to run a program compiled specifically for the bdver2 architecture, you could use the following batch script:

username@baltasar-1:~$ cat program-v1.1-run-bdver2.batch
#!/bin/bash
#SBATCH --ntasks=128
#SBATCH --time=00:01:00
#SBATCH --except=compute-1,compute-2,compute-3,compute-4,compute-5,compute-11,compute-12
#SBATCH --mem=1G
#SBATCH --job-name=my-code-run
#SBATCH --mail-user=mail@example.com
#SBATCH --workdir=/home/username/program-v1.1
#SBATCH --output=/home/username/program-v1.1/my-code-run-%j.out
#SBATCH --error=/home/username/program-v1.1/my-code-run-%j.err

bin/program-v1.1
username@baltasar-1:~$

This way when the resources are being allocated the nodes listed in the –except list are not considered and what is left are the correct nodes we want to target.

Modules

By now you should have noticed the module command. This command is available in all BASH shells at be start of each session, or by reloading the profile configuration located at /etc/profile.d/lmod.sh

(For other shells or environments, take a look at the folder */home/share/lmod/lmod/init/** to find the corresponding init file)*

The use of modules allows us to have sets of compatible software, compiled and linked with each other with different versions and/or capabilities.

Listing loaded modules

username@baltasar-1:~$ module list

Listing available modules

username@baltasar-1:~$ module av

Loading a specific module/toolchain

username@baltasar-1:~$ module load <Module Name>

Unloading a specific module/toolchain

username@baltasar-1:~$ module unload <Module Name>

Unload all loaded modules

username@baltasar-1:~$ module purge

Default loaded modules

In the beginning of each bash session, a toolchain named setesois is loaded for you, it loads the following libraries/tools:

  1. GCCcore/7.3.0
  2. binutils/2.30-GCCcore-7.3.0
  3. GCC/7.3.0-2.30
  4. MPICH/3.2.1-GCC-7.3.0-2.30-large
  5. BLIS/0.4.1-GCC-7.3.0-2.30-generic
  6. FLAME/5.0.0-GCC-7.3.0-2.30-generic
  7. FFTW/3.3.8-GCC-7.3.0-2.30-generic
  8. HDF5/1.8.20-GCC-7.3.0-2.30-generic
  9. amdlibm/3.2.2
  10. setesois/2018.10-generic

the generic version means the libraries/tools were compiled to run on all computational nodes. If you wish you can load a specific version of the toolchain that will run on the architectures described in Compiler Optimization. The available versions are:

  • setesois/2018.10-amdfam10
  • setesois/2018.10-bdver2
  • setesois/2018.10-znver1

If, however, you do not want any of these and want to manage your own set, you can execute module purge and load them one by one.

AMD Core Math Library

You may be familiar with ACML, which implements optimized BLAS and LAPACK routines for AMD CPUs. ACML was discontinued from its private development and has been split in three separate, open-source, libraries:

  • AMDLibM
  • BLIS
  • LibFlame

AMDLibM

AMD LibM is a software library containing a collection of basic math functions optimized for x86-64 processor based machines. It provides many routines from the list of standard C99 math functions. AMD LibM is a C library, which users can link into their applications to replace compiler-provided math functions. Generally, programmers access basic math functions through their compiler, but those who want better accuracy or performance than their compiler’s math functions can use this library to help improve their applications.

This library can be linked against, after loading the appropriate module via module load amdlibm, with the compiler options:

-I$EBROOTAMDLIBM/include -L$EBROOTAMDLIBM/lib/dynamic -lamdlibm -lm

for dynamic version of the library, or

-I$EBROOTAMDLIBM/include -L$EBROOTAMDLIBM/lib/static -lamdlibm -lm

for the static version of the library.

For more information see the official website and the example files located in the cluster at $EBROOTAMDLIBM/examples.

BLIS

BLIS is a portable software framework for instantiating high-performance BLAS-like dense linear algebra libraries. The framework was designed to isolate essential kernels of computation that, when optimized, enable optimized implementations of most of its commonly used and computationally intensive operations. Select kernels have been optimized for the AMD EPYCTM processor family. The optimizations are done for single and double precision routines.

If you are using -lblas in your code compiling, be it script or Makefile, you should switch that with:

-L$EBROOTBLIS/lib -lblis

and you should now have BLIS functions available, as well as its BLAS compatibility layer.

LibFlame

libFLAME is a portable library for dense matrix computations, providing much of the functionality present in LAPACK. It includes a compatibility layer, FLAPACK, which includes complete LAPACK implementation. The library provides scientific and numerical computing communities with a modern, high-performance dense linear algebra library that is extensible, easy to use, and available under an open source license. libFLAME is a C-only implementation and does not depend on any external FORTRAN libraries including LAPACK. There is a backward compatibility layer, lapack2flame, that maps LAPACK routine invocations to their corresponding native C implementations in libFLAME. This allows legacy applications to start taking advantage of libFLAME with virtually no changes to their source code.

In combination with the AMD optimized BLIS library, libFLAME enables running high performing LAPACK functionalities on AMD platforms. The performance of libFLAME on AMD platforms can be improved by just linking with the AMD optimized BLIS.

If you are using -llapack in your code compiling, be it script or Makefile, you should switch that with:

-L$EBROOTFLAME/lib -lflame

and you should now have LAPACK functions available.

Support Request

Feel free to contact us if you run into trouble while using Baltasar. We’ll be glad to help.

Note that if you are contacting on behalf of someone, please add in the e-mail message who that person is and a contact for that person.

E-mail Structure

This is an e-mail request form, you should copy this template into your e-mail as is in order to have feedback as quickly as possible. Thank you, the IT crowd.

Subject

Baltasar Help

Include the folowing information

  • Username
  • Any relevant code/script folder location
  • Any relevant log files/error messages

You can add extra information you may find relevant concerning your request.

As an alternative to manually copy this template, you can click the Baltasar Help Template to use your local e-mail client mailto: capability.