Understanding SAS Data Steps

To go further with SAS DATA steps, it helps to understand three features of DATA step programming.

  • Compile phase versus execute phase
  • the central role of the progam data vector (PDV)
  • DATA steps loop through observations

The typical DATA step completes these tasks:

  • names an output data set
  • reads an observation
  • manipulates the observation
  • outputs the observation
  • returns to the top for the next observation
data new-data;
  set old-data;
  /* manipulations */
  * output ;
  * return ;
run;

If a DATA step does not include explicitly coded OUTPUT or RETURN statements, these are assumed to be the final two statements. So the above example would typically be written

data new-data;
  set old-data;
  /* manipulations */
run;

For a more detailed flow chart of DATA step processing see the SAS Programmer’s Guide

Compile Phase and Execution Phase

When SAS receives a DATA step, it first goes through a compile phase in which it checks your code for syntax errors (typos, missing semicolons, etc. - however, not all errors are syntax errors), and prepares space in memory to process your data. If SAS finds a syntax error, it skips execution, but continues to check for syntax errors. No output data set is created.

Warning

In interactive mode, a DATA step error means that single step is not executed, but SAS will attempt to process subsequent steps. In batch mode, a DATA step error means that no subsequent steps are processed.

If no errors are found, SAS then begins to execute the DATA step.

Compile Phase

In addition to checking for syntax errors, the main task accomplished in the compile phase it to define variables for processing, creating the program data vector (PDV). SAS scans through you code, including taking a peek at the variables already defined in input data sets. SAS assembles the PDV, with the variables defined in the order in which they are encountered in your code. The PDV also includes several automatic variables. Two variables that are always present in the PDV (but not saved to your output data set) are _N_ (current observation number) and _ERROR_ (errors encountered).

The attributes of each variable (type, storage length, etc.) are defined in the compile phase as well. A number of DATA step statements are only used during the compile phase: LENGTH, FORMAT, LABEL, ATTRIB. While the order of the variables in the PDV depends on the order in which they are first encountered in the code, their attributes depend on the value last encountered in the code.

Initial PDV

You can see the current state of the PDV during the execution phase by adding PUT _all_ statements to your DATA step, which adds text output to your log or to a specified “outfile”. To see the PDV add the statement

put _all_;

(You could also name just a few variables to PUT, rather than writing out the whole PDV.)

For example, we could see which variables are in the PDV of this DATA step:

2          data new;
3            put _all_;
4            set sashelp.class(obs=0);
5          run;

Name=  Sex=  Age=. Height=. Weight=. _ERROR_=0 _N_=1
NOTE: There were 0 observations read from the data set SASHELP.CLASS.
NOTE: The data set WORK.NEW has 0 observations and 5 variables.

By placing the PUT statement first, we can see what values the PDV holds before any data are read from the input data set. The very first time SAS executes this statement, it effectively tells us how the PDV is defined at the end of the compile phase. (And the obs=0 data set option prevents SAS from reading any data.)

(SAS also has a DATA step debugger for interactive interfaces, which is a little more involved to use.)

Variable Order and Attributes

In the next example, we can see how both the order of the variables and their attributes are changed during the compile phase.

2          data new;
3           if _n_ eq 1 then put _all_;
4           label weight = "Weight (lbs.)";
5           set sashelp.class;
6           length age 3.;
7           run;
weight=. Name=  Sex=  Age=. Height=. _ERROR_=0 _N_=1

The variable weight is encountered first, in the LABEL statement. The rest of the class variables are encountered next (including weight again), implicit in the SET statement. And finally, the length attribute of age - already set to length 8 from the SET statement - is reset to length 3.

Here the executable PUT statement is limited to the first observation, so we only see the PDV at the end of the compile phase.

Part of the output from PROC CONTENTS shows us the variable attributes.

ods select position;
proc contents data=new position;
    run;
                          The CONTENTS Procedure

                        Variables in Creation Order
 
               #    Variable    Type    Len    Label

               1    weight      Num       8    Weight (lbs.)
               2    Name        Char      8                 
               3    Sex         Char      1                 
               4    Age         Num       3                 
               5    Height      Num       8                 

Execution Phase

In execution phase, your data is loaded one observation at a time, the actual data manipulation is done, and the output is written to disk. Execution phase very much depends on the actual data, and can include conditional execution (IFs) and within-observation loops (DOs).

Order of Execution / PDV

Once you reach execution phase, SAS will load one observation at a time into the PDV and execute the DATA step statements in order before loading the next observation.

To see this in action, we can add several PUT _ALL_ statements, and look at the log output.

In this example we can examine - the values in the PDV at the beginning (“top”) of the DATA step - the values in the PDV after a new observation is read (SET) - the values in the PDV after the assignment statement (bmi)

This final set of values (at the “bottom” of the DATA step) is what is OUTPUT to the new data set. SAS then RETURNs to the top and tries to read another observation.

(We’ll limit this to just three observations.)

2          data new;
3            put "Observation " _n_;
4            put _all_;
5          
6            set sashelp.class(obs=3);
7            put _all_;
8          
9            bmi = (weight/height**2)*703;
10           put _all_;
11           put; *add a blank line;
12         run;

Observation 1
Name=  Sex=  Age=. Height=. Weight=. bmi=. _ERROR_=0 _N_=1
Name=Alfred Sex=M Age=14 Height=69 Weight=112.5 bmi=. _ERROR_=0 _N_=1
Name=Alfred Sex=M Age=14 Height=69 Weight=112.5 bmi=16.611531191 _ERROR_=0
_N_=1

Observation 2
Name=Alfred Sex=M Age=14 Height=69 Weight=112.5 bmi=. _ERROR_=0 _N_=2
Name=Alice Sex=F Age=13 Height=56.5 Weight=84 bmi=. _ERROR_=0 _N_=2
Name=Alice Sex=F Age=13 Height=56.5 Weight=84 bmi=18.498551179 _ERROR_=0
_N_=2

Observation 3
Name=Alice Sex=F Age=13 Height=56.5 Weight=84 bmi=. _ERROR_=0 _N_=3
Name=Barbara Sex=F Age=13 Height=65.3 Weight=98 bmi=. _ERROR_=0 _N_=3
Name=Barbara Sex=F Age=13 Height=65.3 Weight=98 bmi=16.156788436 _ERROR_=0
_N_=3

Observation 4
Name=Barbara Sex=F Age=13 Height=65.3 Weight=98 bmi=. _ERROR_=0 _N_=4
NOTE: There were 3 observations read from the data set SASHELP.CLASS.
NOTE: The data set WORK.NEW has 3 observations and 6 variables.

There are a few details worth observing here.

  • At the top of the first observation, all the data values are missing. This is the PDV at the end of the compile phase.

  • In the next two steps, one observation is read and bmi is calculated.

  • At the “bottom” (after the final PUT), the observation in OUTPUT and the loop RETURNs.

  • At the top of the second observation, some values are carried over from the previous observation, while bmi is reset to missing. We’ll come back to this, below.

  • Data values from the next observation (SET) replace those from the previous observation. Then bmi is calculated, etc.

  • SAS goes back to the top for a fourth observation, but finding no next observation it exits the DATA step at the SET statement (without executing further statements or writing to the output data set).

The Implicit Loop

As seen in the previous example, a SAS DATA step is an implicit loop-over-observations, where observations are read or constructed one at a time, output, and the loop begins again. The loop ends when SAS no longer finds a new observation to read.

OUTPUT and RETURN

Implicit in most DATA steps are an OUTPUT statement and a RETURN statement. Making these explicit gives us more control over these features of the DATA step.

Suppose you write a data step that says:

data copy;
  set sashelp.class;
  bmi = (weight/height**2)*703;
run;

This simply creates a data set named copy from the data set class located in the sashelp library. We can rewrite this as

data copy;
  set sashelp.class;
  bmi = (weight/height**2)*703;
  output;
  return;
run;

This is just the same as the preceding code.

It would be possible to have more than one OUTPUT statement per DATA step. We use this when we want to create more than one data set with one DATA step (an efficient way to create disjoint subsets), when we want to turn one input observation into multiple output observations (“reshaping” data), and when we are creating purely simulated data.

For example, to create two subsets we would use two conditional OUTPUT statements:

data girls boys;
  set sashelp.class;
  bmi = (weight/height**2)*703;
  if sex eq "F" then output girls;
    else if sex eq "M" then output boys;
run;
proc print data=girls(obs=5);
  var name sex weight height bmi;
run;
           Obs    Name       Sex    Weight    Height      bmi

            1     Alice       F       84.0     56.5     18.4986
            2     Barbara     F       98.0     65.3     16.1568
            3     Carol       F      102.5     62.8     18.2709
            4     Jane        F       84.5     59.8     16.6115
            5     Janet       F      112.5     62.5     20.2464
proc print data=boys(obs=5);
  var name sex weight height bmi;
run;
           Obs    Name       Sex    Weight    Height      bmi

             1    Alfred      M      112.5     69.0     16.6115
             2    Henry       M      102.5     63.5     17.8703
             3    James       M       83.0     57.3     17.7715
             4    Jeffrey     M       84.0     62.5     15.1173
             5    John        M       99.5     59.0     20.0944

Notice that where we place the OUTPUT statement matters. In the next example, we output the data and then calculate the bmi!

data copy;
  set sashelp.class;
  output;
  bmi = (weight/height**2)*703;
  return;
run;

proc print data=copy(obs=5);
  var name weight height bmi;
  run;
                 Obs     Name      Weight    Height    bmi

                   1    Alfred      112.5     69.0      . 
                   2    Alice        84.0     56.5      . 
                   3    Barbara      98.0     65.3      . 
                   4    Carol       102.5     62.8      . 
                   5    Henry       102.5     63.5      . 

This has written the data to copy before each bmi value has been calculated. Not very useful! (But also, SAS produces no error message. If you add a PUT _ALL_ at the bottom of the DATA step, you will see that SAS indeed calculates bmi … it just never gets into the output data set!)

We could also move the RETURN statement (but we seldom see this in practice).

The Retain Flag

As seen in the previous examples, some values are retained from one observation to the next, while other are not. By default variables that are input from other data sets (SET) are retained, while newly created variables are reset to missing at the top of each DATA step.

It is sometimes useful to carry over the values of a new variable from a previous observation - this is useful in calculating cumulative values, change values, rank orders, etc. SAS has a variety of methods for these sorts of calculations, and retaining values across iterations of the PDV is one such method.

An attribute of each variable in the PDV is it’s retain flag. The retain flag is automatically set to yes for all variables that come from the input data set. For new variables, you can set it using the RETAIN statement. (To “un-retain” a variable, you must explicitly reset it to missing in the DATA step.)

retain x;

will set the Retain flag to yes for the variable x. You can also set an initial value:

retain x 5;

For example, to calculate the “cumulative years of experience” (sum of the ages) in the class data we might write

data class;
  set sashelp.class;
  retain total_years 0; /*initialize at 0*/
  total_years = total_years + age;
run;

proc means data=class max;
  var total_years;
run;
                            The MEANS Procedure

                     Analysis Variable : total_years 
 
                                    Maximum
                               ------------
                                253.0000000
                               ------------

The Sum Operator

SAS gives you an easy shortcut for cumulative sums: the sum operator. The syntax is simply:

var + expression;

Note that there is no equals sign. This is equivalent to our RETAIN example

retain var;
var = var + expression;

But if you use the sum operator, SAS will do several things for you automatically. First, it will set the Retain flag for the variable to yes, and give it an initial value of zero.

It will also set the Missing Protect flag to yes. Normally if you add a missing value to anything the result is a missing value. But if the Missing Protect flag is set to yes, missing values are treated like zeroes. You'll have to decide if this is appropriate for your analysis or not. But without this protection, a single missing value will make the sum missing for all subsequent observations.

Last Revised: 9/03/2024