Running SAS – DataCrunch | a QAC student blog

Taken from Wesleyan Cluster Wiki: https://dokuwiki.wesleyan.edu/doku.php?id=cluster:103

Some general information for SAS users.

SAS

SAS, the statistical analysis software (http://sas.com), and much more, frequently used in the social sciences, is available on the High Performance Academic Computing Cluster. It is not a parallel version of SAS, but we do offer an unlimited linux license for Teaching and Research.

SAS is typically invoked in batch mode by submitting a script (*.sas text file). SAS will generate a log file (*.log) and a listing file (*.lst). The former shows you what is going on, the latter contains the output of invoked procedures.

SAS can be invoked in interactive mode on the head node for debugging and code development, if needed. However, this is not supported on compute nodes. Hence if you need to generate graphical output you will have to use SAS/Graphics or the Output Delivery System (SAS/ODS). Examples of code can be found at a variety of locations:

at sas.com http://support.sas.com/kb/?ct=51000&col=suppprd
SAS/ODS examples http://sas.wesleyan.edu/statgraphs/
SAS/Graphics examples Data Driven Images, Code

A tutor application is available at http://sas.wesleyan.edu/SASOnlineTutor/sot91/index.htm

Program

So lets generate a little SAS program using a Unix editor like vi/vim, emacs or pico.

First we generate the input data file test.dat

1234567890
0987654321
2468097531

Next a simple SAS file test.sas which does the obvious

options nocenter;
filename test './test.dat';

data one;
  infile test;
  input @2 x 3.1 @6 y 3.1;
  total = x * y;
run;

proc print; run;

Lets test it by submitting on head node

[root@greentail sas]# ll
total 8
-rw-r--r-- 1 root root  33 Dec 21 10:16 test.dat
-rw-r--r-- 1 root root 140 Dec 21 10:22 test.sas
[root@greentail sas]# sas test
[root@greentail sas]# cat test.lst
The SAS System   10:24 Wednesday, December 21, 2011   1

Obs      x       y      total

 1     23.4    67.8    1586.52
 2     98.7    54.3    5359.41
 3     46.8    97.5    4563.00

Submit

Ok, so now we have a program that works. Now we want to submit it maybe dozen times (maybe with different input data or different calculations, whatever). In order to do that we will write a shell script that invokes this SAS program and hands it off to the scheduler (Lava). The scheduler will figure out for us which compute nodes are idle and submit your programs on your behalf.

Create a shell script for example with the name run for submission
Set execute permissions chmod u+x run
Submit (see below)

#!/bin/bash
# submit via 'bsub < run'

#BSUB -q hp12
#BSUB -J test
#BSUB -o stdout
#BSUB -e stderr

time sas test

The leading ‘#’ is a comment in shell scripting but the scheduler specifically looks for leading ‘#BSUB’ tags and interprets the line: -q (define queue), -J (set job name), -o (save STDOUT to a filename), -e (same for STDERR). man bsub for more information. Then the job defines what to run, here we prefix it with the unix utility time which reports run time to STDERR.

[hmeij@greentail sas]$ bsub < run
Job <492637> is submitted to queue <hp12>.

[hmeij@greentail sas]$ bjobs
JOBID   USER    STAT  QUEUE      FROM_HOST   EXEC_HOST   JOB_NAME   SUBMIT_TIME
492637  hmeij   RUN   hp12       greentail   n10         test       Dec 21 10:49

[hmeij@greentail sas]$ bqueues
QUEUE_NAME      PRIO STATUS          MAX JL/U JL/P JL/H NJOBS  PEND   RUN  SUSP 
hp12             50   Open:Active    256    -    -    -   219     0   219     0
matlab           50   Open:Active      8    8    -    8     0     0     0     0
stata            50   Open:Active      6    6    -    6     0     0     0     0
elw              50   Open:Active     60    -    -    -     0     0     0     0
emw              50   Open:Active     32    -    -    -     8     0     8     0
ehw              50   Open:Active     32    -    -    -     8     0     8     0
ehwfd            50   Open:Active     32    -    -    -     8     0     8     0
imw              50   Open:Active    128    -    -    -    32     0    32     0
bss24            50   Open:Active     90    -    -    -     0     0     0     0

[hmeij@greentail sas]$ bjobs
JOBID   USER    STAT  QUEUE      FROM_HOST   EXEC_HOST   JOB_NAME   SUBMIT_TIME
492637  hmeij   RUN   hp12       greentail   n10         test       Dec 21 10:49
[hmeij@greentail sas]$ bjobs
No unfinished job found

[hmeij@greentail sas]$ ll
total 28
-rwxr--r-- 1 hmeij its  115 Dec 21 10:48 run
-rw-r--r-- 1 hmeij its   42 Dec 21 10:49 stderr
-rw-r--r-- 1 hmeij its  838 Dec 21 10:49 stdout
-rw-r--r-- 1 hmeij its   33 Dec 21 10:16 test.dat
-rw-r--r-- 1 hmeij its 2565 Dec 21 10:49 test.log
-rw-r--r-- 1 hmeij its  258 Dec 21 10:49 test.lst
-rw-r--r-- 1 hmeij its  140 Dec 21 10:22 test.sas

And so the job was dispatched to host n10 for execution. Results are posted in my home directory, in fact the entire job ran in my home directory while on the remote compute node. I may not want to do that if I process or generate a lot of data. So we’re going to add some statements to the script next. Also, I may want to reserve some memory so the scheduler does not submit the job to hosts that have insufficient memory available or some other job is dispatched later that causes memory conflicts with my job.

The hp12 queue is the cluster greentail’s default queue where each compute node has a 12 GB memory footprint. Memory footprints of hosts for the other queues differ, please consult this link (there is some old data…) http://petaltail.wesleyan.edu/cgi-bin/bqueues_web.cgi for information about other queues.

Submit 2

On the back end compute nodes, unless specified, the job runs inside your home directory. That job competes with other activities inside /home. Compute nodes have two other areas where the jobs could be submitted: /localscratch and /sanscratch. The former is a local filesystem for each node and should be used if file locking is essential. The later is a filesystem from greentails diskarray (5 TB) served vi IPoIB (that is NFS traffic over fast interconnects switches, the performance should be much better than gigabit ethernet switches). It is comprised of disks and spindles that are not impacted by what happens on /home. So we’re going to use that.

In the SAS program we add the following lines

%let jobpid = %sysget(LSB_JOBID);
libname here "/sanscratch/&jobpid";

And change this line to use local disks for storage

data here.one;

In the submission script we change the following

new submission file with edits
-n implies reserve job slots (cpu cores) for job (not necesssary, SAS jobs will always use only one)
-R reserves memory, for example, reserve 200 MB of memory on target compute node
scheduler creates unique dirs in scratch by JOBPID for you, so we’ll stage the job there
but now we must copy relevant files to scratch dir and results back to home dir

#!/bin/bash
# submit via 'bsub < run'

#BSUB -q hp12
#BSUB -J test
#BSUB -o stdout
#BSUB -e stderr
#BSUB -n 1
#BSUB -R "rusage[mem=200]"

# unique job dir in scratch 
export MYSANSCRATCH=/sanscratch/$LSB_JOBID
cd $MYSANSCRATCH

cp ~/sas/test.dat ~/sas/test.sas .
time sas test
cp test.log test.lst ~/sas

you can monitor the progress of your jobs from greentail while it runs

[hmeij@greentail sas]$ ll /sanscratch/492667/
total 16
-rw-r--r-- 1 hmeij its   33 Dec 21 14:31 test.dat
-rw-r--r-- 1 hmeij its 2568 Dec 21 14:31 test.log
-rw-r--r-- 1 hmeij its  258 Dec 21 14:31 test.lst
-rw-r--r-- 1 hmeij its  140 Dec 21 14:31 test.sas

Best Practices

You may submit as many SAS jobs as you like, just leave enough resources available for others to also get work done
Because SAS submission are serial, non-parallel jobs your -n flag is always 1
Reserve resources if you know what you need, especially memory
Use /sanscratch for large data jobs with heavy read/write operations
Queue ehwfd is preferentially for Gaussian users and stay off the stata and matlab queues please
Write smart SAS code, for example, use data set indexes and PROC SQL (this can be your best friend)
… suggestions will be added to this page