6Variable Transformations with Text Data (Strings)
Some researchers never have to work with text data, known as strings in programming jargon, but for others it’s a regular part of their research. Feel free to skip this chapter if you are in the former group. While string transformations can get very complicated, it’s possible to do a lot with just a few simple functions. In this chapter we’ll carry out some common tasks using those functions.
Note: when we talk about “text data” in this chapter we mean data made up of letters and words rather than numbers. (There is a section on removing non-numeric text from numeric data.) That’s different from text files, like CSVs, that can contain both text and numeric data.
6.1 Setting Up
Start a do file called text_data that loads the auto_strings data set. The make variable is a text variable giving the make of the car, and price is the price of the car–but currently stored as a string.
-------------------------------------------------------------------------------
name: <unnamed>
log: /home/r/rdimond/kb/dws/text_data.log
log type: text
opened on: 23 Feb 2024, 11:37:39
(1978 automobile data)
+------------------------+
| make price |
|------------------------|
1. | AMC Concord $4,099 |
2. | AMC Pacer $4,749 |
3. | AMC Spirit $3,799 |
4. | Buick Century $4,816 |
5. | Buick Electra $7,827 |
+------------------------+
6.2 Standardizing Case
Stata’s string functions are all case sensitive, but in many data sets case is not important. For example “AMC Concord”, “amc concord” and “AMC CONCORD” would presumably all refer to the same car. A valuable step in preparing many text data sets for analysis is to standardize the case, usually by converting everything to lower case. This is especially true if your data contains answers written by subjects themselves.
6.2.1 strlower()
The strlower() function will standardize case for you: you pass in a string and it returns the string converted to lower case. Like almost all string functions, the argument can be either a variable containing strings or a string literal, i.e. just some text in quotes:
While standardizing to lower case is the most common, the strupper() function converts strings to upper case, while strproper capitalizes the first letter of each word:
displaystrupper("AMC Concord")
AMC CONCORD
displaystrproper("AMC Concord")
Amc Concord
6.3 String Cleaning
Often strings need to be cleaned up before they are used, such as standardizing abbreviations or correcting misspellings.
6.3.1 subinstr()
subinstr() takes four arguments: a string to act on, a string to replace, a string to replace it with, and the number of replacements to make. (The fourth argument is very easy to forget.) For example:
Note that only the first instance of “cat” was replaced by “fish” because the fourth argument said to only replace one instance of “cat”. If the fourth argument is a missing value (.) then all instances will be replaced:
If you want to use the full company name instead, you can do that with subinstr():
replace make = subinstr(make, "vw", "volkswagen", 1)list make in 70/73
variable make was str18 now str19
(4 real changes made)
+---------------------+
| make |
|---------------------|
70. | volkswagen dasher |
71. | volkswagen diesel |
72. | volkswagen rabbit |
73. | volkswagen scirocco |
+---------------------+
Six of the cars in this data set were built by Chevrolet, but there are several different abbreviations used:
list make in 14/19
+-------------------+
| make |
|-------------------|
14. | chev chevette |
15. | chev impala |
16. | chev. malibu |
17. | chev. monte carlo |
18. | chevy monza |
|-------------------|
19. | chevy nova |
+-------------------+
This is common in real-world data, especially data collected directly from subjects. If you’re interested in talking about the companies in this data set, an important step in cleaning it will be to standardize the company names. That means each company should only appear in one form: either the full name or a standard abbreviation.
In this case, let’s replace all the various abbreviations with the full name (Chevrolet). Here’s how not do do that:
gen make_oops = subinstr(make, "chev", "chevrolet", .)list make_oops in 14/19
+-------------------------+
| make_oops |
|-------------------------|
14. | chevrolet chevroletette |
15. | chevrolet impala |
16. | chevrolet. malibu |
17. | chevrolet. monte carlo |
18. | chevrolety monza |
|-------------------------|
19. | chevrolety nova |
+-------------------------+
Oops.
There are actually two problems here. The first is using a missing value for the last subinstr() argument, meaning that all instances of “chev” will be replaced. That’s what turned the car model “chevette” into “chevroletette”. The company name only appears once in make, so you only want to replace one instance of its abbreviation. Get in the habit of thinking how many times a replace should happen and use that number for the last argument of subinstr() rather than lazily always passing in ..
The second is that “chev” was replaced by “chevrolet” wherever it appeared, so “chev.” became “chevrolet.” and “chevy” became “chevrolety”.
6.3.2 subinword()
We want to replace “chev” with “chevrolet” only when it appears as a complete word, not when it’s part of a word. The subinword() function can do that for us. It takes the exact same arguments as subinstr() and does the same thing, except that it will only replace complete words:
replace make = subinword(make, "chev", "chevrolet", 1)list make in 14/19
(2 real changes made)
+--------------------+
| make |
|--------------------|
14. | chevrolet chevette |
15. | chevrolet impala |
16. | chev. malibu |
17. | chev. monte carlo |
18. | chevy monza |
|--------------------|
19. | chevy nova |
+--------------------+
Now you can finish the job:
replace make = subinword(make, "chev.", "chevrolet", 1)replace make = subinword(make, "chevy", "chevrolet", 1)list make in 14/19
variable make was str19 now str21
(2 real changes made)
(2 real changes made)
+-----------------------+
| make |
|-----------------------|
14. | chevrolet chevette |
15. | chevrolet impala |
16. | chevrolet malibu |
17. | chevrolet monte carlo |
18. | chevrolet monza |
|-----------------------|
19. | chevrolet nova |
+-----------------------+
You can also use subinstr() or subinword() to correct misspellings. Unfortunately, there’s no Stata function that can identify all the misspelled words in a data set and tell you what they are supposed to be. Nor is there a function that can identify all the words that are abbreviations for “Chevrolet.” AI language models may be relevant here, but they are well beyond the scope of this chapter. For now, you’ll have to read your data to find all the things that need to be changed.
6.4 Cleaning Numeric Data
Another common task is cleaning up numeric data that’s been stored as text so it can be converted into numbers. The price variable in this data set is a typical example
price should be a numeric variable. But if you try to use destring on it, it fails because of the $ and , characters:
destring price, replace
price: contains nonnumeric characters; no replace
The solution is to remove those characters. You can do this with subinstr(): you just need to replace $ and , with nothing. If the third argument for subinstr() is "", an opening quote immediately followed by a closing quote with not even a space in between, then the second argument will simply be removed.
replace price = subinstr(price, "$", "", 1)list make price in 1/5
All of the prices in this data set only have one comma in them, but if a car’s price had two commas (i.e. it cost a million dollars or more) you’d want to remove both of them. In fact you want to remove all commas no matter how many there are. That makes this a time when you legitimately should use a missing value for the last subinstr() argument.
replace price = subinstr(price, ",", "", .)list make price in 1/5
You can now successfully destring price as it now only contains numbers:
destring price, replace
price: all characters numeric; replaced as int
6.5 Is String X in String Y?
A common task with text data is identifying whether one string contains another. For example, if make contains the word amc (after standardizing case to lower) then the car was built by AMC.
6.5.1 strpos()
The strpos() (string position) function takes two strings as arguments. The result will be the position of the second string within the first string, or zero if the first string does not contain the second string. This lets you use the result of a strpos() to say if one string contains another. Some examples:
displaystrpos("amc concord", "amc")
displaystrpos("amc concord", "buick")
0
displaystrpos("amc concord", "concord")
5
Note that the position of “concord” within “amc concord” is defined as the position of the first character of “concord” (i.e. the first “c”).
If strpos(string1, string2) is greater than 0, then string2 is in string1. This lets you easily create indicator variables:
gen amc = (strpos(make, "amc") > 0)list make amc in 1/5
This gives you an indicator variable for “This car was made by AMC.”
6.6 Parsing
Frequently strings contain multiple pieces of information. Parsing is the task of identifying the meaning of each piece of information, and often splitting the string up into separate variables. For example, make contains first the name of the company that built the car and then the model of the car.
6.6.1 word()
The word() function takes a string, and the number of the word you want out of it. It then returns that word:
displayword("amc concord", 1)
amc
displayword("amc concord", 2)
concord
This makes it very easy to parse strings by word. For example, you can create a variable containing the manufacturer of each car with:
gen manufacturer = word(make, 1)list make manufacturer in 1/5, ab(15)
If all the model names were just one word, you could easily use word() to identify model as well. Unfortunately they are not:
list make in 17
+-----------------------+
| make |
|-----------------------|
17. | chevrolet monte carlo |
+-----------------------+
displayword(make[17], 2)
monte
You can easily create a model variable if you recognize that the car model is everything in make except the manufacturer, so all you need to do is remove the manufacturer.
(1 missing value generated)
+-----------------------------------------------------+
| make manufacturer model |
|-----------------------------------------------------|
17. | chevrolet monte carlo chevrolet monte carlo |
+-----------------------------------------------------+
6.6.2 String Concatenation (Addition)
All done? Not quite. There’s a problem with model, one that’s common but very hard to see. One way to make it visible is to put a star at the beginning and end of model. Putting one string after another is called concatenation, but in Stata it’s done using the plus sign:
gen model_stars = "*" + model + "*"list model_stars in 1/5, ab(15)
Note how there’s a space after the first star, but not before the last star. That’s because model actually starts with a space. The original make variable contains the name of the manufacturer followed by a space followed by the name of the model, so removing manufacturer left the space. This can sometimes cause major headaches, as things that seem like they ought to match just don’t. For example, you might think you could identify the AMC Concord with if model=="concord" but it doesn’t work:
list make ifmodel=="concord"
This gives no results because no observation has “concord” for model. You actually need:
Don’t leave your data like this, just waiting to trip up an unwary programmer (like yourself six months from now). The strtrim() function will remove any spaces at the beginning or end of a string:
replacemodel = strtrim(model)replace model_stars = "*" + model + "*"list model_stars in 1/5, ab(15)
After using strtrim(), recreating model_stars shows us there are no extraneous spaces around model.
Close relatives of strtrim() include strltrim(), which only removes spaces before the string (i.e. on the left), strrtrim() which only removes spaces after the string (i.e. on the right) and stritrim() which replaces multiple spaces inside a string with a single space:
display stritrim("Too many spaces")
Too many spaces
What makes parsing make into manufacturer and model relatively easy is that manufacturer is the first word of make, so we can rely on the word() function to identify it. But what if the string you’re parsing isn’t made up of words separated by spaces?
6.6.4 subinstr() Combined With word()
Consider the problem of trying to parse out the area code from a telephone number. Start by creating a variable phone containing the phone number of the SSCC Help Desk:
gen phone = "608-262-9917"
The area code is separated from the rest of the phone number by a dash. But if you replace all dashes with spaces, you can use word() again:
gen temp = subinstr(phone, "-", " ", .)gen area_code = word(temp, 1)list phone temp area_code in 1, ab(15)
This method, replacing the actual separator with spaces and using word(), is easy to use but will cause trouble if the bits of information themselves contain spaces. So we’ll look at a couple of alternatives that don’t rely on special characters later.
6.6.5 split
The split command breaks up a string into all its component words, creating a new variable for each:
We could now rename temp1 to area_code if we wanted to. But split can also break up a string based on a character other than spaces using the parse() option, allowing us to split phone directly:
Since the resulting strings contain numbers, we could add the destring option and split would create the new variables as numeric.
6.6.6 substr()
Area codes are exactly three digits long, so another easy method is to just extract the first three characters of the phone number. This can be done with substr() (substring). The substr() function takes three aarguments: the string to act on, the starting point of the substring to extract, and the number of characters to extract. Thus:
gen area_code2 = substr(phone, 1, 3)list phone area_code2 in 1, ab(15)
The combination of substr() and strpos() is very powerful because you can select a substring based on the content of the original string. That allows you parse just about anything.
Consider the famous opening of Charles Dickens’ A Tale of Two Cities:
It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, it was the epoch of belief, it was the epoch of incredulity, it was the season of Light, it was the season of Darkness, it was the spring of hope, it was the winter of despair…
That’s a lot of things that “it was”! If you’re reading this web book you are presumably a data geek who would much prefer to have all those things in a nice neat data set. So let’s do that.
The easy way would be to do a bit of clean up and then use split with the parse string , it was. But you already know how to use split so we’ll use a more general method.
[Load the data set]
clearallsetobs 1gen tale = "It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, it was the epoch of belief, it was the epoch of incredulity, it was the season of Light, it was the season of Darkness, it was the spring of hope, it was the winter of despair"
Number of observations (_N) was 0, now 1.
The first step is to make a copy of the tale variable called temp. As you’ll see, it will be highly convenient to parse out one item from the list and then delete it before moving on to the next. Consuming each item in turn makes for easy repetition, eventually in the form of a loop. But you want to be able to check your work, so keep tale around in its original form. Making a copy of the original string is also a convenient time to standardize the case to lower.
gen temp = lower(tale)
Now take a look at the text:
list temp
+------------------------------------------------------------------------+
| temp |
|------------------------------------------------------------------------|
1. | it was the best of times, it was the worst of times, it was the age .. |
+------------------------------------------------------------------------+
We have here a list of items separated by the phrase “it was.” But there’s no need for the string to start with a separator and it will cause problems for what we do next, so remove the first instance of “it was” by using subinstr() to replace it with nothing. Be sure to include the space after “was” in the text to remove.
replace temp = subinstr(temp, "it was ", "", 1)list temp
Now the first item in the list goes from the start of the string to the first instance of “it was”. Put it in the variable it_was1 with:`
gen it_was1 = substr(temp, 1, strpos(temp, "it was")-3)list it_was1
The first two arguments of substr() specify that the result will be part of the string temp, starting from character 1. The third argument is then the number of characters to include, which is calculated by strpos(). strpos(temp, "it was") gives the location of the start of the first instance of “it was” within temp. We subtract three from that because we don’t want our substring to include the “i” from “it was” or the comma and space that come before it. The result is the first item, it_was1.
Now remove it_was1 plus the following comma and space from temp:
Now temp looks just like it did when we started, other than the second item now being first. So we can repeat the exact same commands to get it_was2: remove the initial “it was”, select out it_was2, and then remove it from temp.
We could continue repeating this code until we ran out of items. Note that the last item does not end with “it was” so you need to just store it as-is.
6.6.8 Parsing With a Loop
This task is practically begging to be done using a loop. If you haven’t learned about loops and macros, come back to this section once you have.
Since we don’t know ahead of time how many items there are to parse, this is a job for a while loop. Have it keep going until temp is empty with while (temp!="").
But we also need to number the items, so we’ll create a macro i and increment it (increase it by 1) each time through the loop. It is then added to the it_was variable name.
If temp still contains “it was”, then we have not yet reached the last item and the next item can be parsed out just like we’ve done it before. If temp does not contain “it was”, then the final item can be set to the remaining content of temp.
clearallsetobs 1gen tale = "It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, it was the epoch of belief, it was the epoch of incredulity, it was the season of Light, it was the season of Darkness, it was the spring of hope, it was the winter of despair"gen temp = lower(tale)local i 1while (temp!="") {replace temp = subinstr(temp, "it was ", "", 1)if (strpos(temp, "it was")>0) {gen it_was`i' = substr(temp, 1, strpos(temp, "it was")-3) }else {gen it_was`i' = temp }replace temp = substr(temp, strpos(temp, "it was"), .)local i = `i'+1}list
Wide form data sets are always ugly, so let’s reshape this to long form (if you haven’t read about reshape yet, come back when you have). Reshape needs an level one identifier even though the data set has just one observation, so we’ll need to create one and just set it to 1. Then we can reshape and drop the variables that are no longer needed to get the nice, neat data set promised.
gen i = 1reshapelong it_was, i(i) j(j)drop i j tale templist
The advantage of this method of looping over the string and consuming one item at a time is that the conditions for identifying an item can be complex. If they’re not complex, then split can do it much more easily.
Exercise
The names of Santa’s reindeer are given in Clement C. Moore’s Account of a Visit from St. Nicholas (better known as The Night Before Christmas) when Santa calls:
Now Dasher! now, Dancer! now, Prancer and Vixen!
On, Comet! On, Cupid! on, Donner and Blitzen!
Turn this into a data set of reindeer with eight observations. The easy way is to remove all punctuation and words other than reindeer names (you can use loops to make this easier) and then use split.
Make sure the reindeer names all end up with the proper case!
Solution
Start by loading the data, standardizing the case, and removing everything from the text that isn’t a reindeer name. That last step can be done using two loops: one that loops over all the punctuation characters and removes them with subinstr(), and one that loops over all the extraneous words and removes them with subinword().
clearsetobs 1gen reindeer = "Now Dasher! now, Dancer! now, Prancer and Vixen! On, Comet! On, Cupid! on, Donner and Blitzen!"replace reindeer = lower(reindeer)foreach remove in ! , {replace reindeer = subinstr(reindeer, "`remove'", "", .)}foreach remove in now and on {replace reindeer = subinword(reindeer, "`remove'", "", .)}list
In removing the word “on” you need to be careful not to turn “donner” into “dner”. That’s why you need to use subinword() to remove only the complete word “on” and not “on” within other words. Remove the punctuation first so “on” (and “now”) always appears by itself.
Now you’re ready to split what’s left into eight reindeer variables, and then reshape from wide to long.
split reindeerdrop reindeergen i = 1reshapelong reindeer, i(i) j(j)drop i jreplace reindeer = proper(reindeer)list
If you don’t standardize the case with lower() at the beginning you don’t need to fix it with proper() at the end–but then you have to remove “Now”, “now”, “On”, and “on” separately.