SAS provides a great deal of flexibility for reading text data. This is accomplished through a DATA step. The basic elements are
Text Data Sources
In practice, text data is most often read from an external file, either on a drive attached to your computer, or on the internet. In the SAS documentation, the data is often included in the SAS program (in-line data).
External files
Suppose I had a text file in my Z drive named “class.txt”. It has one observation per line (record), and data values are separated within each line by a space. Each record/observation includes 5 data values.
----- class.txt -----
Alfred M 14 69 112.5
Alice F 13 56.5 84
Barbara F 13 65.3 98
---------------------
The code to read in these data values, and to calculate bmi
from them would look like this:
data class;
infile "Z:/class.txt";
input Name $ Sex $ Age Height Weight;
bmi=(weight/height**2)*703;
run;
This code employs many assumptions about the data, which we will begin to discuss below. They key elements here are the INFILE statement, which simply names a file to read, and the INPUT statement. The input statement gives names to the variables in the output data set, class
, and declares which variables are character variables. The type designation is made by adding a $
symbol after the names of character variables.
To read a text file from the internet, you first define a FILENAME reference to the URL. This takes the form
FILENAME filealias URL "url-specification";
A SAS filename, much like a library name, is used as an alias (shortcut), and in general they serve many purposes beyond defining URLs. SAS filenames (unquoted) may be used wherever you can use a quoted file name.
filename w3 url "https://www.ssc.wisc.edu/sscc/pubs/data/dwr-read/class_noheader.csv";
data class;
infile w3 dlm=',';
input Name $ Sex $ Age Height Weight;
bmi=(weight/height**2)*703;
run;
proc means data=class n mean lclm uclm;
run;
The MEANS Procedure
Lower 95% Upper 95%
Variable N Mean CL for Mean CL for Mean
--------------------------------------------------------------
Age 19 13.3157895 12.5963445 14.0352344
Height 19 62.3368421 59.8656709 64.8080133
Weight 19 100.0263158 89.0496312 111.0030004
bmi 19 17.8632519 16.8546417 18.8718621
--------------------------------------------------------------
In-line Data
In the SAS documentation it is very common to see example data sets included in-line with a DATA step. You can skip the INFILE statements or use INFILE DATALINES. The DATALINES statement is the last statement in a DATA step (before any RUN statement). Data follow on subsequent lines until a semi-colon is found.
data class;
input Name $ Sex $ Age Height Weight;
bmi=(weight/height**2)*703;
datalines;
Alfred M 14 69 112.5
Alice F 13 56.5 84
Barbara F 13 65.3 98
;
proc means data=class n mean stddev;
run;
The MEANS Procedure
Variable N Mean Std Dev
---------------------------------------------
Age 3 13.3333333 0.5773503
Height 3 63.6000000 6.4210591
Weight 3 98.1666667 14.2507310
bmi 3 17.0889569 1.2417386
---------------------------------------------
Text Data Layout
Text data comes in many forms. It is always a good idea to look at any documentation you have first. Then it can be informative to look at the text file itself, preferably in a dedicated text editor (on SSCC computers, use Notepad++ or VS Code).
You are looking for a few things when you examine the file.
Distinguish data, metadata, and extra text (notes)
The file includes data values. Does it also include variable names or other information that helps define the data? Is there a header or a footer with explanatory text about the file contents?
Observation delimiter
What separates one observation from the next? Commonly, each observation has a separate line in the text file, but it is possible to have multiple observations per line, or multiple lines per observation.
Data value delimiter
Within an observation, what separates one data value from the next? Most commonly the data value delimiter will be a space or a comma. Tabs used to be common, and are hard to distinguish visually from spaces.
Especially in older data sets, it used to be common for data values to appear in specified columns - e.g. state in columns 3-4 and county in columns 5-7 - with no character separating data values.
Character value quote
Where data value delimiters are used, how are the same characters included in character data values? For instance, if the data values are separated by spaces, how do you include a space within a data value? The typical answer is, character data values are enclosed in quotes, either double (“) or single (’) quotes.
Missing value string
How are missing values indicated? This might be by having two data value delimiters with no data value between them. Or there might be a special character string that denotes missing data, such as a period, .
, -99, or BBBBBBB. There may be more than one missing value indicator as well, such as -98 and -99.