Reading Text Data

SAS provides a great deal of flexibility for reading text data. This is accomplished through a DATA step. The basic elements are

  • an INFILE statement, usually specifying a file and some of how to parse it, and
  • an INPUT statement, naming the variables to create and how to find specific data values.
DATA mydata;
    INFILE file-specification;
    INPUT variables-specifications;
    /* any other DATA step statements */
    RUN;

Text Data Sources

In practice, text data is most often read from an external file, either on a drive attached to your computer, or on the internet. In the SAS documentation, the data is often included in the SAS program (in-line data).

External files

Suppose I had a text file in my Z drive named “class.txt”. It has one observation per line (record), and data values are separated within each line by a space. Each record/observation includes 5 data values.

----- class.txt -----
Alfred M 14 69 112.5
Alice F 13 56.5 84
Barbara F 13 65.3 98
---------------------

The code to read in these data values, and to calculate bmi from them would look like this:

data class;
   infile "Z:/class.txt";
   input Name $ Sex $ Age Height Weight;
   bmi=(weight/height**2)*703;
   run;

This code employs many assumptions about the data, which we will begin to discuss below. They key elements here are the INFILE statement, which simply names a file to read, and the INPUT statement. The input statement gives names to the variables in the output data set, class, and declares which variables are character variables. The type designation is made by adding a $ symbol after the names of character variables.

To read a text file from the internet, you first define a FILENAME reference to the URL. This takes the form

FILENAME filealias URL "url-specification";

A SAS filename, much like a library name, is used as an alias (shortcut), and in general they serve many purposes beyond defining URLs. SAS filenames (unquoted) may be used wherever you can use a quoted file name.

filename w3 url "https://www.ssc.wisc.edu/sscc/pubs/data/dwr-read/class_noheader.csv";

data class;
   infile w3 dlm=',';
   input Name $ Sex $ Age Height Weight;
   bmi=(weight/height**2)*703;
   run;

proc means data=class n mean lclm uclm; 
run;
                            The MEANS Procedure

                                           Lower 95%       Upper 95%
      Variable     N            Mean     CL for Mean     CL for Mean
      --------------------------------------------------------------
      Age         19      13.3157895      12.5963445      14.0352344
      Height      19      62.3368421      59.8656709      64.8080133
      Weight      19     100.0263158      89.0496312     111.0030004
      bmi         19      17.8632519      16.8546417      18.8718621
      --------------------------------------------------------------

In-line Data

In the SAS documentation it is very common to see example data sets included in-line with a DATA step. You can skip the INFILE statements or use INFILE DATALINES. The DATALINES statement is the last statement in a DATA step (before any RUN statement). Data follow on subsequent lines until a semi-colon is found.

data class;
   input Name $ Sex $ Age Height Weight;
   bmi=(weight/height**2)*703;
   datalines;
Alfred M 14 69 112.5
Alice F 13 56.5 84
Barbara F 13 65.3 98
;

proc means data=class n mean stddev; 
run;
                            The MEANS Procedure

               Variable    N            Mean         Std Dev
               ---------------------------------------------
               Age         3      13.3333333       0.5773503
               Height      3      63.6000000       6.4210591
               Weight      3      98.1666667      14.2507310
               bmi         3      17.0889569       1.2417386
               ---------------------------------------------

Text Data Layout

Text data comes in many forms. It is always a good idea to look at any documentation you have first. Then it can be informative to look at the text file itself, preferably in a dedicated text editor (on SSCC computers, use Notepad++ or VS Code).

You are looking for a few things when you examine the file.

  • Distinguish data, metadata, and extra text (notes)

    The file includes data values. Does it also include variable names or other information that helps define the data? Is there a header or a footer with explanatory text about the file contents?

  • Observation delimiter

    What separates one observation from the next? Commonly, each observation has a separate line in the text file, but it is possible to have multiple observations per line, or multiple lines per observation.

  • Data value delimiter

    Within an observation, what separates one data value from the next? Most commonly the data value delimiter will be a space or a comma. Tabs used to be common, and are hard to distinguish visually from spaces.

    Especially in older data sets, it used to be common for data values to appear in specified columns - e.g. state in columns 3-4 and county in columns 5-7 - with no character separating data values.

  • Character value quote

    Where data value delimiters are used, how are the same characters included in character data values? For instance, if the data values are separated by spaces, how do you include a space within a data value? The typical answer is, character data values are enclosed in quotes, either double (“) or single (’) quotes.

  • Missing value string

    How are missing values indicated? This might be by having two data value delimiters with no data value between them. Or there might be a special character string that denotes missing data, such as a period, ., -99, or BBBBBBB. There may be more than one missing value indicator as well, such as -98 and -99.