6 Variable Transformations with Text Data (Strings)

Some researchers never have to work with text data, known as strings in programming jargon, but for others it’s a regular part of their research. Feel free to skip this chapter if you are in the former group. While string transformations can get very complicated, it’s possible to do a lot with just a few simple functions. In this chapter we’ll carry out some common tasks using those functions.

Note: when we talk about “text data” in this chapter we mean data made up of letters and words rather than numbers. (There is a section on removing non-numeric text from numeric data.) That’s different from text files, like CSVs, that can contain both text and numeric data.

6.1 Setting Up

Start a do file called text_data that loads the auto_strings data set. The make variable is a text variable giving the make of the car, and price is the price of the car–but currently stored as a string.

capture log close
log using text_data.log, replace

clear all
use auto_strings

list in 1/5

-------------------------------------------------------------------------------
      name:  <unnamed>
       log:  /home/r/rdimond/kb/dws/text_data.log
  log type:  text
 opened on:  23 Feb 2024, 11:37:39
(1978 automobile data)

     +------------------------+
     | make             price |
     |------------------------|
  1. | AMC Concord     $4,099 |
  2. | AMC Pacer       $4,749 |
  3. | AMC Spirit      $3,799 |
  4. | Buick Century   $4,816 |
  5. | Buick Electra   $7,827 |
     +------------------------+

6.2 Standardizing Case

Stata’s string functions are all case sensitive, but in many data sets case is not important. For example “AMC Concord”, “amc concord” and “AMC CONCORD” would presumably all refer to the same car. A valuable step in preparing many text data sets for analysis is to standardize the case, usually by converting everything to lower case. This is especially true if your data contains answers written by subjects themselves.

6.2.1 strlower()

The strlower() function will standardize case for you: you pass in a string and it returns the string converted to lower case. Like almost all string functions, the argument can be either a variable containing strings or a string literal, i.e. just some text in quotes:

display strlower("AMC Concord")

amc concord

replace make = lower(make)
list make in 1/5

(74 real changes made)

     +---------------+
     | make          |
     |---------------|
  1. | amc concord   |
  2. | amc pacer     |
  3. | amc spirit    |
  4. | buick century |
  5. | buick electra |
     +---------------+

While standardizing to lower case is the most common, the strupper() function converts strings to upper case, while strproper capitalizes the first letter of each word:

display strupper("AMC Concord")

AMC CONCORD

display strproper("AMC Concord")

Amc Concord

6.3 String Cleaning

Often strings need to be cleaned up before they are used, such as standardizing abbreviations or correcting misspellings.

6.3.1 subinstr()

subinstr() takes four arguments: a string to act on, a string to replace, a string to replace it with, and the number of replacements to make. (The fourth argument is very easy to forget.) For example:

display subinstr("dog cat cat bird", "cat", "fish", 1)

dog fish cat bird

Note that only the first instance of “cat” was replaced by “fish” because the fourth argument said to only replace one instance of “cat”. If the fourth argument is a missing value (.) then all instances will be replaced:

display subinstr("dog cat cat bird", "cat", "fish", .)

dog fish fish bird

Four of the cars in this data set were built by Volkswagen, but the company name is abbreviated:

list make in 70/73


     +-------------+
     | make        |
     |-------------|
 70. | vw dasher   |
 71. | vw diesel   |
 72. | vw rabbit   |
 73. | vw scirocco |
     +-------------+

If you want to use the full company name instead, you can do that with subinstr():

replace make = subinstr(make, "vw", "volkswagen", 1)
list make in 70/73

variable make was str18 now str19
(4 real changes made)

     +---------------------+
     | make                |
     |---------------------|
 70. | volkswagen dasher   |
 71. | volkswagen diesel   |
 72. | volkswagen rabbit   |
 73. | volkswagen scirocco |
     +---------------------+

Six of the cars in this data set were built by Chevrolet, but there are several different abbreviations used:

list make in 14/19


     +-------------------+
     | make              |
     |-------------------|
 14. | chev chevette     |
 15. | chev impala       |
 16. | chev. malibu      |
 17. | chev. monte carlo |
 18. | chevy monza       |
     |-------------------|
 19. | chevy nova        |
     +-------------------+

This is common in real-world data, especially data collected directly from subjects. If you’re interested in talking about the companies in this data set, an important step in cleaning it will be to standardize the company names. That means each company should only appear in one form: either the full name or a standard abbreviation.

In this case, let’s replace all the various abbreviations with the full name (Chevrolet). Here’s how not do do that:

gen make_oops = subinstr(make, "chev", "chevrolet", .)
list make_oops in 14/19


     +-------------------------+
     |               make_oops |
     |-------------------------|
 14. | chevrolet chevroletette |
 15. |        chevrolet impala |
 16. |       chevrolet. malibu |
 17. |  chevrolet. monte carlo |
 18. |        chevrolety monza |
     |-------------------------|
 19. |         chevrolety nova |
     +-------------------------+

Oops.

There are actually two problems here. The first is using a missing value for the last subinstr() argument, meaning that all instances of “chev” will be replaced. That’s what turned the car model “chevette” into “chevroletette”. The company name only appears once in make, so you only want to replace one instance of its abbreviation. Get in the habit of thinking how many times a replace should happen and use that number for the last argument of subinstr() rather than lazily always passing in ..

The second is that “chev” was replaced by “chevrolet” wherever it appeared, so “chev.” became “chevrolet.” and “chevy” became “chevrolety”.

6.3.2 subinword()

We want to replace “chev” with “chevrolet” only when it appears as a complete word, not when it’s part of a word. The subinword() function can do that for us. It takes the exact same arguments as subinstr() and does the same thing, except that it will only replace complete words:

replace make = subinword(make, "chev", "chevrolet", 1)
list make in 14/19

(2 real changes made)

     +--------------------+
     | make               |
     |--------------------|
 14. | chevrolet chevette |
 15. | chevrolet impala   |
 16. | chev. malibu       |
 17. | chev. monte carlo  |
 18. | chevy monza        |
     |--------------------|
 19. | chevy nova         |
     +--------------------+

Now you can finish the job:

replace make = subinword(make, "chev.", "chevrolet", 1)
replace make = subinword(make, "chevy", "chevrolet", 1)
list make in 14/19

variable make was str19 now str21
(2 real changes made)
(2 real changes made)

     +-----------------------+
     | make                  |
     |-----------------------|
 14. | chevrolet chevette    |
 15. | chevrolet impala      |
 16. | chevrolet malibu      |
 17. | chevrolet monte carlo |
 18. | chevrolet monza       |
     |-----------------------|
 19. | chevrolet nova        |
     +-----------------------+

You can also use subinstr() or subinword() to correct misspellings. Unfortunately, there’s no Stata function that can identify all the misspelled words in a data set and tell you what they are supposed to be. Nor is there a function that can identify all the words that are abbreviations for “Chevrolet.” AI language models may be relevant here, but they are well beyond the scope of this chapter. For now, you’ll have to read your data to find all the things that need to be changed.

6.4 Cleaning Numeric Data

Another common task is cleaning up numeric data that’s been stored as text so it can be converted into numbers. The price variable in this data set is a typical example

list make price in 1/5


     +------------------------+
     | make             price |
     |------------------------|
  1. | amc concord     $4,099 |
  2. | amc pacer       $4,749 |
  3. | amc spirit      $3,799 |
  4. | buick century   $4,816 |
  5. | buick electra   $7,827 |
     +------------------------+

price should be a numeric variable. But if you try to use destring on it, it fails because of the $ and , characters:

destring price, replace

price: contains nonnumeric characters; no replace

The solution is to remove those characters. You can do this with subinstr(): you just need to replace $ and , with nothing. If the third argument for subinstr() is "", an opening quote immediately followed by a closing quote with not even a space in between, then the second argument will simply be removed.

replace price = subinstr(price, "$", "", 1)
list make price in 1/5

(74 real changes made)

     +-----------------------+
     | make            price |
     |-----------------------|
  1. | amc concord     4,099 |
  2. | amc pacer       4,749 |
  3. | amc spirit      3,799 |
  4. | buick century   4,816 |
  5. | buick electra   7,827 |
     +-----------------------+

All of the prices in this data set only have one comma in them, but if a car’s price had two commas (i.e. it cost a million dollars or more) you’d want to remove both of them. In fact you want to remove all commas no matter how many there are. That makes this a time when you legitimately should use a missing value for the last subinstr() argument.

replace price = subinstr(price, ",", "", .)
list make price in 1/5

(74 real changes made)

     +-----------------------+
     | make            price |
     |-----------------------|
  1. | amc concord      4099 |
  2. | amc pacer        4749 |
  3. | amc spirit       3799 |
  4. | buick century    4816 |
  5. | buick electra    7827 |
     +-----------------------+

You can now successfully destring price as it now only contains numbers:

destring price, replace

price: all characters numeric; replaced as int

6.5 Is String X in String Y?

A common task with text data is identifying whether one string contains another. For example, if make contains the word amc (after standardizing case to lower) then the car was built by AMC.

6.5.1 strpos()

The strpos() (string position) function takes two strings as arguments. The result will be the position of the second string within the first string, or zero if the first string does not contain the second string. This lets you use the result of a strpos() to say if one string contains another. Some examples:

display strpos("amc concord", "amc")

display strpos("amc concord", "buick")

display strpos("amc concord", "concord")

Note that the position of “concord” within “amc concord” is defined as the position of the first character of “concord” (i.e. the first “c”).

If strpos(string1, string2) is greater than 0, then string2 is in string1. This lets you easily create indicator variables:

gen amc = (strpos(make, "amc") > 0)
list make amc in 1/5


     +---------------------+
     | make            amc |
     |---------------------|
  1. | amc concord       1 |
  2. | amc pacer         1 |
  3. | amc spirit        1 |
  4. | buick century     0 |
  5. | buick electra     0 |
     +---------------------+

This gives you an indicator variable for “This car was made by AMC.”

6.6 Parsing

Frequently strings contain multiple pieces of information. Parsing is the task of identifying the meaning of each piece of information, and often splitting the string up into separate variables. For example, make contains first the name of the company that built the car and then the model of the car.

6.6.1 word()

The word() function takes a string, and the number of the word you want out of it. It then returns that word:

display word("amc concord", 1)

amc

display word("amc concord", 2)

concord

This makes it very easy to parse strings by word. For example, you can create a variable containing the manufacturer of each car with:

gen manufacturer = word(make, 1)
list make manufacturer in 1/5, ab(15)


     +------------------------------+
     | make            manufacturer |
     |------------------------------|
  1. | amc concord              amc |
  2. | amc pacer                amc |
  3. | amc spirit               amc |
  4. | buick century          buick |
  5. | buick electra          buick |
     +------------------------------+

If all the model names were just one word, you could easily use word() to identify model as well. Unfortunately they are not:

list make in 17


     +-----------------------+
     | make                  |
     |-----------------------|
 17. | chevrolet monte carlo |
     +-----------------------+

display word(make[17], 2)

monte

You can easily create a model variable if you recognize that the car model is everything in make except the manufacturer, so all you need to do is remove the manufacturer.

gen model = subinstr(make, manufacturer, "", 1)
list make manufacturer model in 17, ab(15)

(1 missing value generated)

     +-----------------------------------------------------+
     | make                    manufacturer          model |
     |-----------------------------------------------------|
 17. | chevrolet monte carlo      chevrolet    monte carlo |
     +-----------------------------------------------------+

6.6.2 String Concatenation (Addition)

All done? Not quite. There’s a problem with model, one that’s common but very hard to see. One way to make it visible is to put a star at the beginning and end of model. Putting one string after another is called concatenation, but in Stata it’s done using the plus sign:

gen model_stars = "*" + model + "*"
list model_stars in 1/5, ab(15)


     +-------------+
     | model_stars |
     |-------------|
  1. |  * concord* |
  2. |    * pacer* |
  3. |   * spirit* |
  4. |  * century* |
  5. |  * electra* |
     +-------------+

Note how there’s a space after the first star, but not before the last star. That’s because model actually starts with a space. The original make variable contains the name of the manufacturer followed by a space followed by the name of the model, so removing manufacturer left the space. This can sometimes cause major headaches, as things that seem like they ought to match just don’t. For example, you might think you could identify the AMC Concord with if model=="concord" but it doesn’t work:

list make if model=="concord"

This gives no results because no observation has “concord” for model. You actually need:

list make if model==" concord"


     +-------------+
     | make        |
     |-------------|
  1. | amc concord |
     +-------------+

6.6.3 strtrim()

Don’t leave your data like this, just waiting to trip up an unwary programmer (like yourself six months from now). The strtrim() function will remove any spaces at the beginning or end of a string:

replace model = strtrim(model)
replace model_stars = "*" + model + "*"
list model_stars in 1/5, ab(15)

(73 real changes made)
(73 real changes made)

     +-------------+
     | model_stars |
     |-------------|
  1. |   *concord* |
  2. |     *pacer* |
  3. |    *spirit* |
  4. |   *century* |
  5. |   *electra* |
     +-------------+

After using strtrim(), recreating model_stars shows us there are no extraneous spaces around model.

Close relatives of strtrim() include strltrim(), which only removes spaces before the string (i.e. on the left), strrtrim() which only removes spaces after the string (i.e. on the right) and stritrim() which replaces multiple spaces inside a string with a single space:

display stritrim("Too     many     spaces")

Too many spaces

What makes parsing make into manufacturer and model relatively easy is that manufacturer is the first word of make, so we can rely on the word() function to identify it. But what if the string you’re parsing isn’t made up of words separated by spaces?

6.6.4 subinstr() Combined With word()

Consider the problem of trying to parse out the area code from a telephone number. Start by creating a variable phone containing the phone number of the SSCC Help Desk:

gen phone = "608-262-9917"

The area code is separated from the rest of the phone number by a dash. But if you replace all dashes with spaces, you can use word() again:

gen temp = subinstr(phone, "-", " ", .)
gen area_code = word(temp, 1)
list phone temp area_code in 1, ab(15)


     +-----------------------------------------+
     |        phone           temp   area_code |
     |-----------------------------------------|
  1. | 608-262-9917   608 262 9917         608 |
     +-----------------------------------------+

This method, replacing the actual separator with spaces and using word(), is easy to use but will cause trouble if the bits of information themselves contain spaces. So we’ll look at a couple of alternatives that don’t rely on special characters later.

6.6.5 split

The split command breaks up a string into all its component words, creating a new variable for each:

split temp
list temp* in 1

variables created as string: 
temp1  temp2  temp3

     +--------------------------------------+
     |         temp   temp1   temp2   temp3 |
     |--------------------------------------|
  1. | 608 262 9917     608     262    9917 |
     +--------------------------------------+

We could now rename temp1 to area_code if we wanted to. But split can also break up a string based on a character other than spaces using the parse() option, allowing us to split phone directly:

split phone, parse("-")
list phone* in 1

variables created as string: 
phone1  phone2  phone3

     +-----------------------------------------+
     |        phone   phone1   phone2   phone3 |
     |-----------------------------------------|
  1. | 608-262-9917      608      262     9917 |
     +-----------------------------------------+

Since the resulting strings contain numbers, we could add the destring option and split would create the new variables as numeric.

6.6.6 substr()

Area codes are exactly three digits long, so another easy method is to just extract the first three characters of the phone number. This can be done with substr() (substring). The substr() function takes three aarguments: the string to act on, the starting point of the substring to extract, and the number of characters to extract. Thus:

gen area_code2 = substr(phone, 1, 3)
list phone area_code2 in 1, ab(15)


     +---------------------------+
     |        phone   area_code2 |
     |---------------------------|
  1. | 608-262-9917          608 |
     +---------------------------+

If you pass in a missing value for the number of characters to extract, you’ll get the rest of the string:

gen local = substr(phone, 5, .)
list phone local in 1


     +-------------------------+
     |        phone      local |
     |-------------------------|
  1. | 608-262-9917   262-9917 |
     +-------------------------+

If the starting point is negative, it will is interpreted as that many characters from the end of the string:

gen last_four = substr(phone, -4, .)
list phone last_four in 1, ab(15)


     +--------------------------+
     |        phone   last_four |
     |--------------------------|
  1. | 608-262-9917        9917 |
     +--------------------------+

6.6.7 substr() Combined With strpos()

The combination of substr() and strpos() is very powerful because you can select a substring based on the content of the original string. That allows you parse just about anything.

Consider the famous opening of Charles Dickens’ A Tale of Two Cities:

It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, it was the epoch of belief, it was the epoch of incredulity, it was the season of Light, it was the season of Darkness, it was the spring of hope, it was the winter of despair…

That’s a lot of things that “it was”! If you’re reading this web book you are presumably a data geek who would much prefer to have all those things in a nice neat data set. So let’s do that.

The easy way would be to do a bit of clean up and then use split with the parse string , it was. But you already know how to use split so we’ll use a more general method.

[Load the data set]

clear all
set obs 1
gen tale = "It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, it was the epoch of belief, it was the epoch of incredulity, it was the season of Light, it was the season of Darkness, it was the spring of hope, it was the winter of despair"

Number of observations (_N) was 0, now 1.

The first step is to make a copy of the tale variable called temp. As you’ll see, it will be highly convenient to parse out one item from the list and then delete it before moving on to the next. Consuming each item in turn makes for easy repetition, eventually in the form of a loop. But you want to be able to check your work, so keep tale around in its original form. Making a copy of the original string is also a convenient time to standardize the case to lower.

gen temp = lower(tale)

Now take a look at the text:

list temp


     +------------------------------------------------------------------------+
     | temp                                                                   |
     |------------------------------------------------------------------------|
  1. | it was the best of times, it was the worst of times, it was the age .. |
     +------------------------------------------------------------------------+

We have here a list of items separated by the phrase “it was.” But there’s no need for the string to start with a separator and it will cause problems for what we do next, so remove the first instance of “it was” by using subinstr() to replace it with nothing. Be sure to include the space after “was” in the text to remove.

replace temp = subinstr(temp, "it was ", "", 1)
list temp

Now the first item in the list goes from the start of the string to the first instance of “it was”. Put it in the variable it_was1 with:`

gen it_was1 = substr(temp, 1, strpos(temp, "it was")-3)
list it_was1

The first two arguments of substr() specify that the result will be part of the string temp, starting from character 1. The third argument is then the number of characters to include, which is calculated by strpos(). strpos(temp, "it was") gives the location of the start of the first instance of “it was” within temp. We subtract three from that because we don’t want our substring to include the “i” from “it was” or the comma and space that come before it. The result is the first item, it_was1.

Now remove it_was1 plus the following comma and space from temp:

replace temp = subinstr(temp, it_was1+", ", "", 1)
list temp

Now temp looks just like it did when we started, other than the second item now being first. So we can repeat the exact same commands to get it_was2: remove the initial “it was”, select out it_was2, and then remove it from temp.

replace temp = subinstr(temp, "it was ", "", 1)
gen it_was2 = substr(temp, 1, strpos(temp, "it was")-3)
replace temp = subinstr(temp, it_was2+", ", "", 1)
list it_was2
list temp

And we can do it again for it_was3:

replace temp = subinstr(temp, "it was ", "", 1)
gen it_was3 = substr(temp, 1, strpos(temp, "it was")-3)
replace temp = subinstr(temp, it_was3+", ", "", 1)
list it_was3
list temp

We could continue repeating this code until we ran out of items. Note that the last item does not end with “it was” so you need to just store it as-is.

6.6.8 Parsing With a Loop

This task is practically begging to be done using a loop. If you haven’t learned about loops and macros, come back to this section once you have.

Since we don’t know ahead of time how many items there are to parse, this is a job for a while loop. Have it keep going until temp is empty with while (temp!="").

But we also need to number the items, so we’ll create a macro i and increment it (increase it by 1) each time through the loop. It is then added to the it_was variable name.

If temp still contains “it was”, then we have not yet reached the last item and the next item can be parsed out just like we’ve done it before. If temp does not contain “it was”, then the final item can be set to the remaining content of temp.

clear all
set obs 1
gen tale = "It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, it was the epoch of belief, it was the epoch of incredulity, it was the season of Light, it was the season of Darkness, it was the spring of hope, it was the winter of despair"
gen temp = lower(tale)

local i 1
while (temp!="") {

    replace temp = subinstr(temp, "it was ", "", 1)

    if (strpos(temp, "it was")>0) {
        gen it_was`i' = substr(temp, 1, strpos(temp, "it was")-3)
    }
    else {
        gen it_was`i' = temp
    }
    
    replace temp = substr(temp, strpos(temp, "it was"), .)
    local i = `i'+1
}

list

Wide form data sets are always ugly, so let’s reshape this to long form (if you haven’t read about reshape yet, come back when you have). Reshape needs an level one identifier even though the data set has just one observation, so we’ll need to create one and just set it to 1. Then we can reshape and drop the variables that are no longer needed to get the nice, neat data set promised.

gen i = 1
reshape long it_was, i(i) j(j)
drop i j tale temp
list

The advantage of this method of looping over the string and consuming one item at a time is that the conditions for identifying an item can be complex. If they’re not complex, then split can do it much more easily.

Exercise

The names of Santa’s reindeer are given in Clement C. Moore’s Account of a Visit from St. Nicholas (better known as The Night Before Christmas) when Santa calls:

Now Dasher! now, Dancer! now, Prancer and Vixen!

On, Comet! On, Cupid! on, Donner and Blitzen!

Turn this into a data set of reindeer with eight observations. The easy way is to remove all punctuation and words other than reindeer names (you can use loops to make this easier) and then use split.

Make sure the reindeer names all end up with the proper case!

Solution

Start by loading the data, standardizing the case, and removing everything from the text that isn’t a reindeer name. That last step can be done using two loops: one that loops over all the punctuation characters and removes them with subinstr(), and one that loops over all the extraneous words and removes them with subinword().

clear
set obs 1
gen reindeer = "Now Dasher! now, Dancer! now, Prancer and Vixen! On, Comet! On, Cupid! on, Donner and Blitzen!"

replace reindeer = lower(reindeer)

foreach remove in ! , {
    replace reindeer = subinstr(reindeer, "`remove'", "", .)
}

foreach remove in now and on {
    replace reindeer = subinword(reindeer, "`remove'", "", .)
}

list

In removing the word “on” you need to be careful not to turn “donner” into “dner”. That’s why you need to use subinword() to remove only the complete word “on” and not “on” within other words. Remove the punctuation first so “on” (and “now”) always appears by itself.

Now you’re ready to split what’s left into eight reindeer variables, and then reshape from wide to long.

split reindeer

drop reindeer
gen i = 1
reshape long reindeer, i(i) j(j)
drop i j

replace reindeer = proper(reindeer)

list

If you don’t standardize the case with lower() at the beginning you don’t need to fix it with proper() at the end–but then you have to remove “Now”, “now”, “On”, and “on” separately.