Python Examples for Soc 693

Introducing Objects

While R or Stata were designed specifically for data wrangling and statistical analysis, Python is a general-purpose programming language used for a wide variety of tasks. Python is an Object Oriented Language, meaning you’ll spend most of your time working with objects. (R is an object oriented language too, but in R you can kind of ignore that for a while. Not in Python.) In the programming world, an object is a collection of data and/or functions. Each object is an instance of a class, with the class defining what data and functions the instance will contain.

Objects can contain other objects. For example, a DataFrame object, which stores a classic data set with observations as rows and variables as columns, has one or more Series objects, each storing a column. If you extract a subset of a DataFrame that contains two columns, the result will be another DataFrame. However, if you extract a subset that contains one column, the result will be a Series.

Similar objects will often have the same functions. Both DataFrame and Series have sort_values() functions, for example. That means you can call the sort_values() function of an object without knowing or caring if the object is a DataFrame or Series. That’s a good thing, because sometimes when you extract a subset from a DataFrame (say, all the columns whose names match a certain pattern) you don’t know which one the result will be.

The Pandas package (short for Panel Data, but “cute” names are par for the course in Python) contains functions and class definitions related to working with data, including DataFrame and Series. So the first step is to import it, but since we’ll have to type its name a lot, we’ll give it the nickname pd:

import pandas as pd

Reading Data

You can read Stata data sets into Python with the read_stata() function. This is a Pandas function, so you need to refer to it as pd.read_stata() (i.e. package.function()). It can read from a file or directly from a URL, which we pass in as an argument to the function by putting it in the parentheses.

If you refer to an object without doing anything with it, JupyterLab will print it for you (“implicit print”). That lets us see some of the data and get a sense of what it looks like.

gss = pd.read_stata('https://www.ssc.wisc.edu/~hemken/soc693/dta/soc693_GSSex_step1v2.dta')
gss

	year	age	agekdbrn	degree	sex	race	RES16	FAMILY16	RELIG16	PASEI10	cohort	wtssall
0	1972	23.0	NaN	bachelor	female	white	BIG-CITY SUBURB	father	NaN	62.0	1949.0	0.444600
1	1972	70.0	NaN	LT HIGH SCHOOL	male	white	CITY GT 250000	M AND F RELATIVES	NaN	67.7	1902.0	0.889300
2	1972	48.0	NaN	HIGH SCHOOL	female	white	CITY GT 250000	MOTHER & FATHER	NaN	31.6	1924.0	0.889300
3	1972	27.0	NaN	bachelor	female	white	CITY GT 250000	MOTHER & FATHER	NaN	69.3	1945.0	0.889300
4	1972	61.0	NaN	HIGH SCHOOL	female	white	TOWN LT 50000	MOTHER & FATHER	NaN	38.3	1911.0	0.889300
...	...	...	...	...	...	...	...	...	...	...	...	...
64809	2018	37.0	29.0	HIGH SCHOOL	female	white	TOWN LT 50000	MOTHER & STPFATHER	catholic	77.4	1981.0	0.471499
64810	2018	75.0	19.0	HIGH SCHOOL	female	white	farm	MOTHER & FATHER	none	14.0	1943.0	0.942997
64811	2018	67.0	22.0	HIGH SCHOOL	female	white	COUNTRY,NONFARM	MOTHER & FATHER	protestant	59.1	1951.0	0.942997
64812	2018	72.0	31.0	HIGH SCHOOL	male	white	TOWN LT 50000	MOTHER & FATHER	protestant	68.6	1946.0	0.942997
64813	2018	79.0	21.0	HIGH SCHOOL	female	white	farm	MOTHER & FATHER	catholic	39.9	1939.0	0.471499

64814 rows × 12 columns

Note the column on the far left without a name. This is the index of the DataFrame. Every data set in every language should have one, but in Python it’s mandatory and it will make one for you if needed.

Problem: Categorical Variables

We can get a list of the variables in the data set and their data types by printing the dtypes attribute of the gss DataFrame:

gss.dtypes

year           int16
age         category
agekdbrn    category
degree      category
sex         category
race        category
RES16       category
FAMILY16    category
RELIG16     category
PASEI10     category
cohort      category
wtssall     category
dtype: object

You can tell dtypes is an attribute and not a function (it doesn’t do anything) because there are no parentheses.

The result is not just a printout: it’s an actual Series object you could do things with. The first column, with the names of all the variables, is the index of the Series. Attributes can also contain functions though.

But why are most of the variables categories? Some of them should be, like sex, race, and RELIG16, but age and agekdbrn should be numbers. The problem is, pd.read_stata() assumes anything with value labels must be categorical. The GSS data applied value labels to the missing values of the numeric values, so now Python thinks they’re categorical. Fortunately that’s easy to fix, with one minor wrinkle we can see if we look at the values of age.

gss['age'] selects just the age column from gss. That’s one of several ways of subsetting a DataFrame. The value_counts() function gives you frequencies–and lists all the values. By default it will be sorted such that the most common category comes first. But if you pass in the key word argument sort=False it will leave them in their original order.

When we passed a URL to pd.read_stata() the argument was understood by its position as the first (and only) argument. For many functions the first one or two arguments are “obvious” (like the thing to read for pd.read_stata()). But using key word arguments makes it clear what each argument means.

gss['age'].value_counts(sort=False)

18.0            241
19.0            861
20.0            885
21.0           1014
22.0           1082
               ... 
85.0            197
86.0            186
87.0            148
88.0            121
89 OR OLDER     364
Name: age, Length: 72, dtype: int64

Python isn’t going to be able to convert 89 OR OLDER to a number. We’ll replace it with just 89.0. The first step is to add it as an allowable category for the variable. The cat attribute of a categorical Series stores information and functions related to the categories, including the add_categories() function. Pass in ['89.0']. This is technically a one-item list–more on lists later.

Now we need to change all the rows where age is 89 OR OLDER to 89.0. To do this we will select the rows and column that need to be changed and set them to 89.0. This is done with the loc[] function, short for location, another method of subsetting. If you put two things in loc[]’s brackets, the first specifies rows and the second columns. Here we’ll specify the rows with a condition, gss['age']=='89 OR OLDER', meaning rows where that condition is true. This is just like saying if 'age'=='89 OR OLDER' in Stata. In Python you need to specify the DataFrame, gss, because it’s quite possible to write conditions based on a completely separate DataFrame or Series. Then select the age column. All the rows and columns selected will be set to '89.0'.

Check you work with value_counts().

gss['age'] = gss['age'].cat.add_categories(['89.0'])

gss.loc[gss['age']=='89 OR OLDER', 'age'] = '89.0'

gss['age'].value_counts(sort=False)

18.0            241
19.0            861
20.0            885
21.0           1014
22.0           1082
               ... 
86.0            186
87.0            148
88.0            121
89 OR OLDER       0
89.0            364
Name: age, Length: 73, dtype: int64

Now all the numeric variables are ready to be converted. You can do that with the pd.to_numeric() function, but it only works on one variable (Series) at a time. So we need to loop over the numeric variables. Fortunately, looping is easy in Python (especially compared to Stata–see Stata Programming Essentials).

A for loop executes some code once for each item in a list. A Python list is one or more items in square brackets (yes, Python uses square brackets for too many things) separated by columns. A placeholder variable (we’ll call it var) stores each item in turn. The loop starts with a colon (:). The code that is part of the loop is indented, and the loop ends when the indentation ends. In Python indentation doesn’t just make code easier to read–it’s essential to the syntax!

for var in ['age', 'agekdbrn', 'PASEI10', 'cohort', 'wtssall']:
    gss[var] = pd.to_numeric(gss[var])
gss.dtypes

year           int16
age          float64
agekdbrn     float64
degree      category
sex         category
race        category
RES16       category
FAMILY16    category
RELIG16     category
PASEI10      float64
cohort       float64
wtssall      float64
dtype: object

Summary Statistics

Now that the numeric variables are actually numbers, you can call the DataFrame describe() function to get summary statistics:

gss.describe()

	year	age	agekdbrn	PASEI10	cohort	wtssall
count	64814.000000	64586.000000	23658.000000	51104.000000	64586.000000	64814.000000
mean	1994.939180	46.099356	23.911024	44.105606	1948.846128	1.000015
std	13.465368	17.534703	5.560892	19.370755	21.262659	0.468172
min	1972.000000	18.000000	9.000000	9.000000	1883.000000	0.391825
25%	1984.000000	31.000000	20.000000	28.700000	1934.000000	0.550100
50%	1996.000000	44.000000	23.000000	39.900000	1951.000000	0.970900
75%	2006.000000	59.000000	27.000000	57.500000	1964.000000	1.098500
max	2018.000000	89.000000	65.000000	93.700000	2000.000000	8.739876

It automatically skips categorical variables. For them you want frequencies, so loop over a list of all the categorical variables and run value_counts() on each. Implicit print will only print one thing per Jupyter cell, so have the loop do an explicit print with the print() function. Pass in a second argument, '\n' (new line) to put a blank line between each variable’s frequencies:

for var in ['degree', 'sex', 'race', 'RES16', 'FAMILY16', 'RELIG16']:
    print(gss[var].value_counts(dropna=False), '\n')

HIGH SCHOOL       33195
LT HIGH SCHOOL    13587
bachelor           9475
graduate           4716
JUNIOR COLLEGE     3668
NaN                 173
Name: degree, dtype: int64 

female    36200
male      28614
Name: sex, dtype: int64 

white    52033
black     9187
other     3594
Name: race, dtype: int64 

TOWN LT 50000      20076
CITY GT 250000      9851
50000 TO 250000     9805
farm                9440
BIG-CITY SUBURB     7078
COUNTRY,NONFARM     6935
NaN                 1629
Name: RES16, dtype: int64 

MOTHER & FATHER       44838
mother                 8264
MOTHER & STPFATHER     3202
other                  1606
NaN                    1547
father                 1533
M AND F RELATIVES      1434
FATHER & STPMOTHER     1199
FEMALE RELATIVE         980
MALE RELATIVE           211
Name: FAMILY16, dtype: int64 

protestant                 37042
catholic                   17974
NaN                         3438
none                        3421
jewish                      1266
other                        648
christian                    417
MOSLEM/ISLAM                 161
ORTHODOX-CHRISTIAN           138
buddhism                     128
hinduism                     114
INTER-NONDENOMINATIONAL       31
NATIVE AMERICAN               24
OTHER EASTERN                 12
Name: RELIG16, dtype: int64

Example: Frequencies and Proportions

Following Doug’s example R code, our next task is to make a table with the frequencies and proportions for agekdbrn. You can get proportions by passing normalize=True to value_counts():

gss['agekdbrn'].value_counts(normalize=True)

21.0    0.090963
20.0    0.084792
19.0    0.077564
22.0    0.070758
18.0    0.066109
25.0    0.065263
23.0    0.064502
24.0    0.060656
26.0    0.050131
27.0    0.047426
17.0    0.043114
28.0    0.042058
30.0    0.038887
29.0    0.033477
16.0    0.026756
32.0    0.023037
31.0    0.021726
33.0    0.016443
35.0    0.013653
34.0    0.011962
15.0    0.009215
36.0    0.008412
37.0    0.006340
38.0    0.005241
39.0    0.004016
40.0    0.003762
14.0    0.003635
41.0    0.002071
42.0    0.001944
13.0    0.001522
43.0    0.000761
45.0    0.000634
44.0    0.000549
47.0    0.000507
46.0    0.000465
12.0    0.000338
50.0    0.000296
48.0    0.000169
49.0    0.000169
51.0    0.000169
52.0    0.000127
9.0     0.000085
65.0    0.000042
54.0    0.000042
11.0    0.000042
53.0    0.000042
56.0    0.000042
55.0    0.000042
57.0    0.000042
Name: agekdbrn, dtype: float64

To combine the proportions with the frequencies, use pd.concat(). It takes a list of Series and puts them together in a DataFrame. We could create the Series ahead of time, but we can also just pass in the function calls that create them.

By default, pd.concat() will add the new Series as new rows. We want to add a new column, so pass in the key word argument axis=1. Python starts counting from 0, so rows are axis 0 and columns are axis 1.

The result is DataFrame, with the values of agekdbrn as an index. We can call functions of the DataFrame just like we call functions of gss. To put the table in order by age, add the sort_index() function. This is an example of function chaining: we’re calling a function of the result of a previous function. (This is what R does with its pipe.)

pd.concat([gss['agekdbrn'].value_counts(), gss['agekdbrn'].value_counts(normalize=True)], axis=1).sort_index()

	agekdbrn	agekdbrn
9.0	2	0.000085
11.0	1	0.000042
12.0	8	0.000338
13.0	36	0.001522
14.0	86	0.003635
15.0	218	0.009215
16.0	633	0.026756
17.0	1020	0.043114
18.0	1564	0.066109
19.0	1835	0.077564
20.0	2006	0.084792
21.0	2152	0.090963
22.0	1674	0.070758
23.0	1526	0.064502
24.0	1435	0.060656
25.0	1544	0.065263
26.0	1186	0.050131
27.0	1122	0.047426
28.0	995	0.042058
29.0	792	0.033477
30.0	920	0.038887
31.0	514	0.021726
32.0	545	0.023037
33.0	389	0.016443
34.0	283	0.011962
35.0	323	0.013653
36.0	199	0.008412
37.0	150	0.006340
38.0	124	0.005241
39.0	95	0.004016
40.0	89	0.003762
41.0	49	0.002071
42.0	46	0.001944
43.0	18	0.000761
44.0	13	0.000549
45.0	15	0.000634
46.0	11	0.000465
47.0	12	0.000507
48.0	4	0.000169
49.0	4	0.000169
50.0	7	0.000296
51.0	4	0.000169
52.0	3	0.000127
53.0	1	0.000042
54.0	1	0.000042
55.0	1	0.000042
56.0	1	0.000042
57.0	1	0.000042
65.0	1	0.000042

Calculating a Summary Statistic

The mean() function calculates means. Similar functions are available for other summary statistics.

gss['agekdbrn'].mean()

23.911023755177954

Doug’s example does a t-test next, but in Python that requires a whole other library (SciPy) so we’ll skip it. Python’s not that great for statistics anyway.

Data Visualization

Doug cheated and used R’s quick-and-ugly plot() function rather than teaching you ggplot(). Python doesn’t have a quick-and-ugly way to make graphs, but it does have a port of ggplot called plotnine. So, ironically enough, I’m going to teach you how to use R’s standard graphics tool in a Python lecture.

ggplot() is based on the Grammar of Graphics (hence the name). In the Grammar of Graphics, and aesthetic defines the relationship between a variable and a feature of the graph, such as the x-coordinate or the color of the points. A geometry defines what the relationship will look like.

Histograms

First, import plotnine as p9. To create a plot, call the p9.ggplot() function, passing in the DataFrame to be plotted. Then add to it the result of a p9.aes() function that defines the aesthetic, then one of the p9.geom() functions with the desired geometry. To make a histogram of agekdbrn, make x='agekdbrn' the aesthetic and the geometry geom_histogram(). Pass in binwidth=1 to get one bin for each value.

import plotnine as p9

p9.ggplot(gss) + p9.aes(x='agekdbrn') + p9.geom_histogram(binwidth=1)

/home/r/rdimond/.local/lib/python3.9/site-packages/plotnine/layer.py:333: PlotnineWarning: stat_bin : Removed 41156 rows containing non-finite values.

<ggplot: (8759070341973)>

The warning just means it’s ignoring the missing values, which is good. Now compare that with age:

p9.ggplot(gss) + p9.aes(x='age') + p9.geom_histogram(binwidth=1)

/home/r/rdimond/.local/lib/python3.9/site-packages/plotnine/layer.py:333: PlotnineWarning: stat_bin : Removed 228 rows containing non-finite values.

<ggplot: (8759060933759)>

Scatterplots

To crate a scatterplot, set both an x and a y coordinate in the aesthetic and use p9.geom_point().

p9.ggplot(gss) + p9.aes(x='age', y='agekdbrn') + p9.geom_point()

/home/r/rdimond/.local/lib/python3.9/site-packages/plotnine/layer.py:411: PlotnineWarning: geom_point : Removed 41209 rows containing missing values.

<ggplot: (8759060321470)>

You could almost copy this code and paste it into R and have it run. The only difference is that in R you don’t have to specify which package a function comes from (which is great until two packages try to define the same function) so you’d need to remove p9. from all the function calls. By the same token, you can read our guide to Data Visualization in R with ggplot2, or take the class, and apply it in Python.

Recoding

Next we’ll recode degree and create ed3 with just three values. Start by taking a look at the values of degree:

gss['degree'].value_counts(sort=False)

LT HIGH SCHOOL    13587
HIGH SCHOOL       33195
JUNIOR COLLEGE     3668
bachelor           9475
graduate           4716
Name: degree, dtype: int64

The categories we want for ed3 are ‘Less Than High School’, ‘High School or Junior College’, and ‘Bachelor Degree or Higher’. To create ed3 you just use it in a subset as if it already existed, assign it a value, and it will be created automatically. If the subset does not contain all rows, the other rows will get missing (NaN).

Thus the recoding process works just like changing 89 OR OLDER to 89.0: select the proper rows and the new ed3 column using loc[] and set them to the appropriate value. Use the pipe character (|) for logical or, just like in Stata, but Python is really picky about having each part of the condition in parentheses.

gss.loc[gss['degree']=='LT HIGH SCHOOL', 'ed3'] = 'Less Than High School'
gss.loc[(gss['degree']=='HIGH SCHOOL') |  (gss['degree']=='JUNIOR COLLEGE'), 'ed3'] = 'High School or Junior College'
gss.loc[(gss['degree']=='bachelor') | (gss['degree']=='graduate'), 'ed3'] = 'Bachelor Degree or Higher'
gss

	year	age	agekdbrn	degree	sex	race	RES16	FAMILY16	RELIG16	PASEI10	cohort	wtssall	ed3
0	1972	23.0	NaN	bachelor	female	white	BIG-CITY SUBURB	father	NaN	62.0	1949.0	0.444600	Bachelor Degree or Higher
1	1972	70.0	NaN	LT HIGH SCHOOL	male	white	CITY GT 250000	M AND F RELATIVES	NaN	67.7	1902.0	0.889300	Less Than High School
2	1972	48.0	NaN	HIGH SCHOOL	female	white	CITY GT 250000	MOTHER & FATHER	NaN	31.6	1924.0	0.889300	High School or Junior College
3	1972	27.0	NaN	bachelor	female	white	CITY GT 250000	MOTHER & FATHER	NaN	69.3	1945.0	0.889300	Bachelor Degree or Higher
4	1972	61.0	NaN	HIGH SCHOOL	female	white	TOWN LT 50000	MOTHER & FATHER	NaN	38.3	1911.0	0.889300	High School or Junior College
...	...	...	...	...	...	...	...	...	...	...	...	...	...
64809	2018	37.0	29.0	HIGH SCHOOL	female	white	TOWN LT 50000	MOTHER & STPFATHER	catholic	77.4	1981.0	0.471499	High School or Junior College
64810	2018	75.0	19.0	HIGH SCHOOL	female	white	farm	MOTHER & FATHER	none	14.0	1943.0	0.942997	High School or Junior College
64811	2018	67.0	22.0	HIGH SCHOOL	female	white	COUNTRY,NONFARM	MOTHER & FATHER	protestant	59.1	1951.0	0.942997	High School or Junior College
64812	2018	72.0	31.0	HIGH SCHOOL	male	white	TOWN LT 50000	MOTHER & FATHER	protestant	68.6	1946.0	0.942997	High School or Junior College
64813	2018	79.0	21.0	HIGH SCHOOL	female	white	farm	MOTHER & FATHER	catholic	39.9	1939.0	0.471499	High School or Junior College

64814 rows × 13 columns

Creating Categorical Variables

If you look at the dtypes of gss, you’ll see ed3 is an object. While that could mean just about anything, in practice if a variable shows up as object it’s almost always a string. We want it to be a categorical variable.

gss.dtypes

year           int16
age          float64
agekdbrn     float64
degree      category
sex         category
race        category
RES16       category
FAMILY16    category
RELIG16     category
PASEI10      float64
cohort       float64
wtssall      float64
ed3           object
dtype: object

Like in Stata, a Python categorical variable has a string description of the category and an underlying number. But in Python you never need to know or care what the number is. You just write code using the string descriptions, like we’ve done so far. To make a column a categorical variable, call its astype() function and pass in category. Python will assign numeric values for each string, without any further intervention by you.

Why not just leave them as string? Some tasks need them to be categories, but also storing one number per observation takes a lot less memory than one string per observation.

gss['ed3'] = gss['ed3'].astype('category')
gss.dtypes

year           int16
age          float64
agekdbrn     float64
degree      category
sex         category
race        category
RES16       category
FAMILY16    category
RELIG16     category
PASEI10      float64
cohort       float64
wtssall      float64
ed3         category
dtype: object

ed3 is really an ordered categorical variable, but Python doesn’t know that yet. To tell it, call the column’s cat.reorder_categories() function, passing in a list of the categories in order and the argument ordered=True.

gss['ed3'] = gss['ed3'].cat.reorder_categories(
    [
        'Less Than High School',
        'High School or Junior College',
        'Bachelor Degree or Higher'
    ],
    ordered=True
)

Once a categorical variable is ordered, you can use things like greater than, less than, max() and min() even though the values are all strings (as far as you’re concerned):

gss['ed3'].max()

'Bachelor Degree or Higher'

Boxplots

Now make a boxplot showing the distribution of agekdbrn by level of ed3. The aesthetic for this is that ed3 is the x-coordinate and agekdbrn the y-coordinate. The geometry is p9.geom_boxplot().

p9.ggplot(gss) + p9.aes(x='ed3', y='agekdbrn') + p9.geom_boxplot()

/home/r/rdimond/.local/lib/python3.9/site-packages/plotnine/layer.py:333: PlotnineWarning: stat_boxplot : Removed 41156 rows containing non-finite values.

<ggplot: (8759060258036)>

That’s ugly because the x-axis labels overlap. Fortunately there’s a simple solution: turn the graph sideways. You can’t do that by just swapping the x and y coordinate because p9.geom_boxplot() always plots the distribution of y across the levels of x. Instead, add p9.coord_flip() to the chain. This makes the logical x axis vertical and the logical y axis horizontal:

p9.ggplot(gss) + p9.aes(x='ed3', y='agekdbrn') + p9.geom_boxplot() + p9.coord_flip()

/home/r/rdimond/.local/lib/python3.9/site-packages/plotnine/layer.py:333: PlotnineWarning: stat_boxplot : Removed 41156 rows containing non-finite values.

<ggplot: (8759060242270)>

Much better! Most of the time bar graphs, boxplots, or anything else with categorical labels will work better if the categorical variable is vertical and the bars/boxes/whatever are horizontal.

It may be very important to your analysis to recognize that the people with NaN for ed3 are most like the people with less than a high school education when it comes to agekdbrn. It looks like the data are not missing completely at random and that could bias your analysis. But for a paper you probably don’t want to include that category in your graph. That’s okay: just pass a subset of gss to p9.ggplot().

You just can’t select that subset the way you’re used to in Stata. In Python, any condition involving NaN (missing) returns False. So if you wrote gss['ed3']==NaN the result would be False whether ed3 was missing or not. (Actually, that code wouldn’t work. We won’t go into how to write it in a valid way because it still wouldn’t give the result you need.) Instead, use the functions notna() or isna() to detect whether a value is missing or not. In this case we want notna().

p9.ggplot(gss[gss['ed3'].notna()]) + p9.aes(x='ed3', y='agekdbrn') + p9.geom_boxplot() + p9.coord_flip()

/home/r/rdimond/.local/lib/python3.9/site-packages/plotnine/layer.py:333: PlotnineWarning: stat_boxplot : Removed 41015 rows containing non-finite values.

<ggplot: (8759058393338)>