import pandas as pd
Python Examples for Soc 693
Introducing Objects
While R or Stata were designed specifically for data wrangling and statistical analysis, Python is a general-purpose programming language used for a wide variety of tasks. Python is an Object Oriented Language, meaning you’ll spend most of your time working with objects. (R is an object oriented language too, but in R you can kind of ignore that for a while. Not in Python.) In the programming world, an object is a collection of data and/or functions. Each object is an instance of a class, with the class defining what data and functions the instance will contain.
Objects can contain other objects. For example, a DataFrame object, which stores a classic data set with observations as rows and variables as columns, has one or more Series objects, each storing a column. If you extract a subset of a DataFrame that contains two columns, the result will be another DataFrame. However, if you extract a subset that contains one column, the result will be a Series.
Similar objects will often have the same functions. Both DataFrame and Series have sort_values()
functions, for example. That means you can call the sort_values()
function of an object without knowing or caring if the object is a DataFrame or Series. That’s a good thing, because sometimes when you extract a subset from a DataFrame (say, all the columns whose names match a certain pattern) you don’t know which one the result will be.
The Pandas package (short for Panel Data, but “cute” names are par for the course in Python) contains functions and class definitions related to working with data, including DataFrame and Series. So the first step is to import it, but since we’ll have to type its name a lot, we’ll give it the nickname pd
:
Reading Data
You can read Stata data sets into Python with the read_stata()
function. This is a Pandas function, so you need to refer to it as pd.read_stata()
(i.e. package.function()
). It can read from a file or directly from a URL, which we pass in as an argument to the function by putting it in the parentheses.
If you refer to an object without doing anything with it, JupyterLab will print it for you (“implicit print”). That lets us see some of the data and get a sense of what it looks like.
= pd.read_stata('https://www.ssc.wisc.edu/~hemken/soc693/dta/soc693_GSSex_step1v2.dta')
gss gss
year | age | agekdbrn | degree | sex | race | RES16 | FAMILY16 | RELIG16 | PASEI10 | cohort | wtssall | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1972 | 23.0 | NaN | bachelor | female | white | BIG-CITY SUBURB | father | NaN | 62.0 | 1949.0 | 0.444600 |
1 | 1972 | 70.0 | NaN | LT HIGH SCHOOL | male | white | CITY GT 250000 | M AND F RELATIVES | NaN | 67.7 | 1902.0 | 0.889300 |
2 | 1972 | 48.0 | NaN | HIGH SCHOOL | female | white | CITY GT 250000 | MOTHER & FATHER | NaN | 31.6 | 1924.0 | 0.889300 |
3 | 1972 | 27.0 | NaN | bachelor | female | white | CITY GT 250000 | MOTHER & FATHER | NaN | 69.3 | 1945.0 | 0.889300 |
4 | 1972 | 61.0 | NaN | HIGH SCHOOL | female | white | TOWN LT 50000 | MOTHER & FATHER | NaN | 38.3 | 1911.0 | 0.889300 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
64809 | 2018 | 37.0 | 29.0 | HIGH SCHOOL | female | white | TOWN LT 50000 | MOTHER & STPFATHER | catholic | 77.4 | 1981.0 | 0.471499 |
64810 | 2018 | 75.0 | 19.0 | HIGH SCHOOL | female | white | farm | MOTHER & FATHER | none | 14.0 | 1943.0 | 0.942997 |
64811 | 2018 | 67.0 | 22.0 | HIGH SCHOOL | female | white | COUNTRY,NONFARM | MOTHER & FATHER | protestant | 59.1 | 1951.0 | 0.942997 |
64812 | 2018 | 72.0 | 31.0 | HIGH SCHOOL | male | white | TOWN LT 50000 | MOTHER & FATHER | protestant | 68.6 | 1946.0 | 0.942997 |
64813 | 2018 | 79.0 | 21.0 | HIGH SCHOOL | female | white | farm | MOTHER & FATHER | catholic | 39.9 | 1939.0 | 0.471499 |
64814 rows × 12 columns
Note the column on the far left without a name. This is the index of the DataFrame. Every data set in every language should have one, but in Python it’s mandatory and it will make one for you if needed.
Problem: Categorical Variables
We can get a list of the variables in the data set and their data types by printing the dtypes
attribute of the gss
DataFrame:
gss.dtypes
year int16
age category
agekdbrn category
degree category
sex category
race category
RES16 category
FAMILY16 category
RELIG16 category
PASEI10 category
cohort category
wtssall category
dtype: object
You can tell dtypes
is an attribute and not a function (it doesn’t do anything) because there are no parentheses.
The result is not just a printout: it’s an actual Series object you could do things with. The first column, with the names of all the variables, is the index of the Series. Attributes can also contain functions though.
But why are most of the variables categories? Some of them should be, like sex
, race
, and RELIG16
, but age
and agekdbrn
should be numbers. The problem is, pd.read_stata()
assumes anything with value labels must be categorical. The GSS data applied value labels to the missing values of the numeric values, so now Python thinks they’re categorical. Fortunately that’s easy to fix, with one minor wrinkle we can see if we look at the values of age
.
gss['age']
selects just the age
column from gss
. That’s one of several ways of subsetting a DataFrame. The value_counts()
function gives you frequencies–and lists all the values. By default it will be sorted such that the most common category comes first. But if you pass in the key word argument sort=False
it will leave them in their original order.
When we passed a URL to pd.read_stata()
the argument was understood by its position as the first (and only) argument. For many functions the first one or two arguments are “obvious” (like the thing to read for pd.read_stata()
). But using key word arguments makes it clear what each argument means.
'age'].value_counts(sort=False) gss[
18.0 241
19.0 861
20.0 885
21.0 1014
22.0 1082
...
85.0 197
86.0 186
87.0 148
88.0 121
89 OR OLDER 364
Name: age, Length: 72, dtype: int64
Python isn’t going to be able to convert 89 OR OLDER
to a number. We’ll replace it with just 89.0
. The first step is to add it as an allowable category for the variable. The cat
attribute of a categorical Series stores information and functions related to the categories, including the add_categories()
function. Pass in ['89.0']
. This is technically a one-item list–more on lists later.
Now we need to change all the rows where age
is 89 OR OLDER
to 89.0
. To do this we will select the rows and column that need to be changed and set them to 89.0
. This is done with the loc[]
function, short for location, another method of subsetting. If you put two things in loc[]
’s brackets, the first specifies rows and the second columns. Here we’ll specify the rows with a condition, gss['age']=='89 OR OLDER'
, meaning rows where that condition is true. This is just like saying if 'age'=='89 OR OLDER'
in Stata. In Python you need to specify the DataFrame, gss
, because it’s quite possible to write conditions based on a completely separate DataFrame or Series. Then select the age
column. All the rows and columns selected will be set to '89.0'
.
Check you work with value_counts()
.
'age'] = gss['age'].cat.add_categories(['89.0'])
gss[
'age']=='89 OR OLDER', 'age'] = '89.0'
gss.loc[gss[
'age'].value_counts(sort=False) gss[
18.0 241
19.0 861
20.0 885
21.0 1014
22.0 1082
...
86.0 186
87.0 148
88.0 121
89 OR OLDER 0
89.0 364
Name: age, Length: 73, dtype: int64
Now all the numeric variables are ready to be converted. You can do that with the pd.to_numeric()
function, but it only works on one variable (Series) at a time. So we need to loop over the numeric variables. Fortunately, looping is easy in Python (especially compared to Stata–see Stata Programming Essentials).
A for loop executes some code once for each item in a list. A Python list is one or more items in square brackets (yes, Python uses square brackets for too many things) separated by columns. A placeholder variable (we’ll call it var
) stores each item in turn. The loop starts with a colon (:
). The code that is part of the loop is indented, and the loop ends when the indentation ends. In Python indentation doesn’t just make code easier to read–it’s essential to the syntax!
for var in ['age', 'agekdbrn', 'PASEI10', 'cohort', 'wtssall']:
= pd.to_numeric(gss[var])
gss[var] gss.dtypes
year int16
age float64
agekdbrn float64
degree category
sex category
race category
RES16 category
FAMILY16 category
RELIG16 category
PASEI10 float64
cohort float64
wtssall float64
dtype: object
Summary Statistics
Now that the numeric variables are actually numbers, you can call the DataFrame describe()
function to get summary statistics:
gss.describe()
year | age | agekdbrn | PASEI10 | cohort | wtssall | |
---|---|---|---|---|---|---|
count | 64814.000000 | 64586.000000 | 23658.000000 | 51104.000000 | 64586.000000 | 64814.000000 |
mean | 1994.939180 | 46.099356 | 23.911024 | 44.105606 | 1948.846128 | 1.000015 |
std | 13.465368 | 17.534703 | 5.560892 | 19.370755 | 21.262659 | 0.468172 |
min | 1972.000000 | 18.000000 | 9.000000 | 9.000000 | 1883.000000 | 0.391825 |
25% | 1984.000000 | 31.000000 | 20.000000 | 28.700000 | 1934.000000 | 0.550100 |
50% | 1996.000000 | 44.000000 | 23.000000 | 39.900000 | 1951.000000 | 0.970900 |
75% | 2006.000000 | 59.000000 | 27.000000 | 57.500000 | 1964.000000 | 1.098500 |
max | 2018.000000 | 89.000000 | 65.000000 | 93.700000 | 2000.000000 | 8.739876 |
It automatically skips categorical variables. For them you want frequencies, so loop over a list of all the categorical variables and run value_counts()
on each. Implicit print will only print one thing per Jupyter cell, so have the loop do an explicit print with the print()
function. Pass in a second argument, '\n'
(new line) to put a blank line between each variable’s frequencies:
for var in ['degree', 'sex', 'race', 'RES16', 'FAMILY16', 'RELIG16']:
print(gss[var].value_counts(dropna=False), '\n')
HIGH SCHOOL 33195
LT HIGH SCHOOL 13587
bachelor 9475
graduate 4716
JUNIOR COLLEGE 3668
NaN 173
Name: degree, dtype: int64
female 36200
male 28614
Name: sex, dtype: int64
white 52033
black 9187
other 3594
Name: race, dtype: int64
TOWN LT 50000 20076
CITY GT 250000 9851
50000 TO 250000 9805
farm 9440
BIG-CITY SUBURB 7078
COUNTRY,NONFARM 6935
NaN 1629
Name: RES16, dtype: int64
MOTHER & FATHER 44838
mother 8264
MOTHER & STPFATHER 3202
other 1606
NaN 1547
father 1533
M AND F RELATIVES 1434
FATHER & STPMOTHER 1199
FEMALE RELATIVE 980
MALE RELATIVE 211
Name: FAMILY16, dtype: int64
protestant 37042
catholic 17974
NaN 3438
none 3421
jewish 1266
other 648
christian 417
MOSLEM/ISLAM 161
ORTHODOX-CHRISTIAN 138
buddhism 128
hinduism 114
INTER-NONDENOMINATIONAL 31
NATIVE AMERICAN 24
OTHER EASTERN 12
Name: RELIG16, dtype: int64
Example: Frequencies and Proportions
Following Doug’s example R code, our next task is to make a table with the frequencies and proportions for agekdbrn
. You can get proportions by passing normalize=True
to value_counts()
:
'agekdbrn'].value_counts(normalize=True) gss[
21.0 0.090963
20.0 0.084792
19.0 0.077564
22.0 0.070758
18.0 0.066109
25.0 0.065263
23.0 0.064502
24.0 0.060656
26.0 0.050131
27.0 0.047426
17.0 0.043114
28.0 0.042058
30.0 0.038887
29.0 0.033477
16.0 0.026756
32.0 0.023037
31.0 0.021726
33.0 0.016443
35.0 0.013653
34.0 0.011962
15.0 0.009215
36.0 0.008412
37.0 0.006340
38.0 0.005241
39.0 0.004016
40.0 0.003762
14.0 0.003635
41.0 0.002071
42.0 0.001944
13.0 0.001522
43.0 0.000761
45.0 0.000634
44.0 0.000549
47.0 0.000507
46.0 0.000465
12.0 0.000338
50.0 0.000296
48.0 0.000169
49.0 0.000169
51.0 0.000169
52.0 0.000127
9.0 0.000085
65.0 0.000042
54.0 0.000042
11.0 0.000042
53.0 0.000042
56.0 0.000042
55.0 0.000042
57.0 0.000042
Name: agekdbrn, dtype: float64
To combine the proportions with the frequencies, use pd.concat()
. It takes a list of Series and puts them together in a DataFrame. We could create the Series ahead of time, but we can also just pass in the function calls that create them.
By default, pd.concat()
will add the new Series as new rows. We want to add a new column, so pass in the key word argument axis=1
. Python starts counting from 0, so rows are axis 0 and columns are axis 1.
The result is DataFrame, with the values of agekdbrn
as an index. We can call functions of the DataFrame just like we call functions of gss
. To put the table in order by age, add the sort_index()
function. This is an example of function chaining: we’re calling a function of the result of a previous function. (This is what R does with its pipe.)
'agekdbrn'].value_counts(), gss['agekdbrn'].value_counts(normalize=True)], axis=1).sort_index() pd.concat([gss[
agekdbrn | agekdbrn | |
---|---|---|
9.0 | 2 | 0.000085 |
11.0 | 1 | 0.000042 |
12.0 | 8 | 0.000338 |
13.0 | 36 | 0.001522 |
14.0 | 86 | 0.003635 |
15.0 | 218 | 0.009215 |
16.0 | 633 | 0.026756 |
17.0 | 1020 | 0.043114 |
18.0 | 1564 | 0.066109 |
19.0 | 1835 | 0.077564 |
20.0 | 2006 | 0.084792 |
21.0 | 2152 | 0.090963 |
22.0 | 1674 | 0.070758 |
23.0 | 1526 | 0.064502 |
24.0 | 1435 | 0.060656 |
25.0 | 1544 | 0.065263 |
26.0 | 1186 | 0.050131 |
27.0 | 1122 | 0.047426 |
28.0 | 995 | 0.042058 |
29.0 | 792 | 0.033477 |
30.0 | 920 | 0.038887 |
31.0 | 514 | 0.021726 |
32.0 | 545 | 0.023037 |
33.0 | 389 | 0.016443 |
34.0 | 283 | 0.011962 |
35.0 | 323 | 0.013653 |
36.0 | 199 | 0.008412 |
37.0 | 150 | 0.006340 |
38.0 | 124 | 0.005241 |
39.0 | 95 | 0.004016 |
40.0 | 89 | 0.003762 |
41.0 | 49 | 0.002071 |
42.0 | 46 | 0.001944 |
43.0 | 18 | 0.000761 |
44.0 | 13 | 0.000549 |
45.0 | 15 | 0.000634 |
46.0 | 11 | 0.000465 |
47.0 | 12 | 0.000507 |
48.0 | 4 | 0.000169 |
49.0 | 4 | 0.000169 |
50.0 | 7 | 0.000296 |
51.0 | 4 | 0.000169 |
52.0 | 3 | 0.000127 |
53.0 | 1 | 0.000042 |
54.0 | 1 | 0.000042 |
55.0 | 1 | 0.000042 |
56.0 | 1 | 0.000042 |
57.0 | 1 | 0.000042 |
65.0 | 1 | 0.000042 |
Calculating a Summary Statistic
The mean()
function calculates means. Similar functions are available for other summary statistics.
'agekdbrn'].mean() gss[
23.911023755177954
Doug’s example does a t-test next, but in Python that requires a whole other library (SciPy) so we’ll skip it. Python’s not that great for statistics anyway.
Data Visualization
Doug cheated and used R’s quick-and-ugly plot()
function rather than teaching you ggplot()
. Python doesn’t have a quick-and-ugly way to make graphs, but it does have a port of ggplot
called plotnine
. So, ironically enough, I’m going to teach you how to use R’s standard graphics tool in a Python lecture.
ggplot()
is based on the Grammar of Graphics (hence the name). In the Grammar of Graphics, and aesthetic defines the relationship between a variable and a feature of the graph, such as the x-coordinate or the color of the points. A geometry defines what the relationship will look like.
Histograms
First, import plotnine as p9
. To create a plot, call the p9.ggplot()
function, passing in the DataFrame to be plotted. Then add to it the result of a p9.aes()
function that defines the aesthetic, then one of the p9.geom()
functions with the desired geometry. To make a histogram of agekdbrn
, make x='agekdbrn'
the aesthetic and the geometry geom_histogram()
. Pass in binwidth=1
to get one bin for each value.
import plotnine as p9
+ p9.aes(x='agekdbrn') + p9.geom_histogram(binwidth=1) p9.ggplot(gss)
/home/r/rdimond/.local/lib/python3.9/site-packages/plotnine/layer.py:333: PlotnineWarning: stat_bin : Removed 41156 rows containing non-finite values.
<ggplot: (8759070341973)>
The warning just means it’s ignoring the missing values, which is good. Now compare that with age
:
+ p9.aes(x='age') + p9.geom_histogram(binwidth=1) p9.ggplot(gss)
/home/r/rdimond/.local/lib/python3.9/site-packages/plotnine/layer.py:333: PlotnineWarning: stat_bin : Removed 228 rows containing non-finite values.
<ggplot: (8759060933759)>
Scatterplots
To crate a scatterplot, set both an x and a y coordinate in the aesthetic and use p9.geom_point()
.
+ p9.aes(x='age', y='agekdbrn') + p9.geom_point() p9.ggplot(gss)
/home/r/rdimond/.local/lib/python3.9/site-packages/plotnine/layer.py:411: PlotnineWarning: geom_point : Removed 41209 rows containing missing values.
<ggplot: (8759060321470)>
You could almost copy this code and paste it into R and have it run. The only difference is that in R you don’t have to specify which package a function comes from (which is great until two packages try to define the same function) so you’d need to remove p9.
from all the function calls. By the same token, you can read our guide to Data Visualization in R with ggplot2, or take the class, and apply it in Python.
Recoding
Next we’ll recode degree
and create ed3
with just three values. Start by taking a look at the values of degree
:
'degree'].value_counts(sort=False) gss[
LT HIGH SCHOOL 13587
HIGH SCHOOL 33195
JUNIOR COLLEGE 3668
bachelor 9475
graduate 4716
Name: degree, dtype: int64
The categories we want for ed3
are ‘Less Than High School’, ‘High School or Junior College’, and ‘Bachelor Degree or Higher’. To create ed3
you just use it in a subset as if it already existed, assign it a value, and it will be created automatically. If the subset does not contain all rows, the other rows will get missing (NaN
).
Thus the recoding process works just like changing 89 OR OLDER
to 89.0
: select the proper rows and the new ed3
column using loc[]
and set them to the appropriate value. Use the pipe character (|
) for logical or, just like in Stata, but Python is really picky about having each part of the condition in parentheses.
'degree']=='LT HIGH SCHOOL', 'ed3'] = 'Less Than High School'
gss.loc[gss['degree']=='HIGH SCHOOL') | (gss['degree']=='JUNIOR COLLEGE'), 'ed3'] = 'High School or Junior College'
gss.loc[(gss['degree']=='bachelor') | (gss['degree']=='graduate'), 'ed3'] = 'Bachelor Degree or Higher'
gss.loc[(gss[ gss
year | age | agekdbrn | degree | sex | race | RES16 | FAMILY16 | RELIG16 | PASEI10 | cohort | wtssall | ed3 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1972 | 23.0 | NaN | bachelor | female | white | BIG-CITY SUBURB | father | NaN | 62.0 | 1949.0 | 0.444600 | Bachelor Degree or Higher |
1 | 1972 | 70.0 | NaN | LT HIGH SCHOOL | male | white | CITY GT 250000 | M AND F RELATIVES | NaN | 67.7 | 1902.0 | 0.889300 | Less Than High School |
2 | 1972 | 48.0 | NaN | HIGH SCHOOL | female | white | CITY GT 250000 | MOTHER & FATHER | NaN | 31.6 | 1924.0 | 0.889300 | High School or Junior College |
3 | 1972 | 27.0 | NaN | bachelor | female | white | CITY GT 250000 | MOTHER & FATHER | NaN | 69.3 | 1945.0 | 0.889300 | Bachelor Degree or Higher |
4 | 1972 | 61.0 | NaN | HIGH SCHOOL | female | white | TOWN LT 50000 | MOTHER & FATHER | NaN | 38.3 | 1911.0 | 0.889300 | High School or Junior College |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
64809 | 2018 | 37.0 | 29.0 | HIGH SCHOOL | female | white | TOWN LT 50000 | MOTHER & STPFATHER | catholic | 77.4 | 1981.0 | 0.471499 | High School or Junior College |
64810 | 2018 | 75.0 | 19.0 | HIGH SCHOOL | female | white | farm | MOTHER & FATHER | none | 14.0 | 1943.0 | 0.942997 | High School or Junior College |
64811 | 2018 | 67.0 | 22.0 | HIGH SCHOOL | female | white | COUNTRY,NONFARM | MOTHER & FATHER | protestant | 59.1 | 1951.0 | 0.942997 | High School or Junior College |
64812 | 2018 | 72.0 | 31.0 | HIGH SCHOOL | male | white | TOWN LT 50000 | MOTHER & FATHER | protestant | 68.6 | 1946.0 | 0.942997 | High School or Junior College |
64813 | 2018 | 79.0 | 21.0 | HIGH SCHOOL | female | white | farm | MOTHER & FATHER | catholic | 39.9 | 1939.0 | 0.471499 | High School or Junior College |
64814 rows × 13 columns
Creating Categorical Variables
If you look at the dtypes
of gss
, you’ll see ed3
is an object
. While that could mean just about anything, in practice if a variable shows up as object
it’s almost always a string. We want it to be a categorical variable.
gss.dtypes
year int16
age float64
agekdbrn float64
degree category
sex category
race category
RES16 category
FAMILY16 category
RELIG16 category
PASEI10 float64
cohort float64
wtssall float64
ed3 object
dtype: object
Like in Stata, a Python categorical variable has a string description of the category and an underlying number. But in Python you never need to know or care what the number is. You just write code using the string descriptions, like we’ve done so far. To make a column a categorical variable, call its astype()
function and pass in category
. Python will assign numeric values for each string, without any further intervention by you.
Why not just leave them as string? Some tasks need them to be categories, but also storing one number per observation takes a lot less memory than one string per observation.
'ed3'] = gss['ed3'].astype('category')
gss[ gss.dtypes
year int16
age float64
agekdbrn float64
degree category
sex category
race category
RES16 category
FAMILY16 category
RELIG16 category
PASEI10 float64
cohort float64
wtssall float64
ed3 category
dtype: object
ed3
is really an ordered categorical variable, but Python doesn’t know that yet. To tell it, call the column’s cat.reorder_categories()
function, passing in a list of the categories in order and the argument ordered=True
.
'ed3'] = gss['ed3'].cat.reorder_categories(
gss[
['Less Than High School',
'High School or Junior College',
'Bachelor Degree or Higher'
],=True
ordered )
Once a categorical variable is ordered, you can use things like greater than, less than, max()
and min()
even though the values are all strings (as far as you’re concerned):
'ed3'].max() gss[
'Bachelor Degree or Higher'
Boxplots
Now make a boxplot showing the distribution of agekdbrn
by level of ed3
. The aesthetic for this is that ed3
is the x-coordinate and agekdbrn
the y-coordinate. The geometry is p9.geom_boxplot()
.
+ p9.aes(x='ed3', y='agekdbrn') + p9.geom_boxplot() p9.ggplot(gss)
/home/r/rdimond/.local/lib/python3.9/site-packages/plotnine/layer.py:333: PlotnineWarning: stat_boxplot : Removed 41156 rows containing non-finite values.
<ggplot: (8759060258036)>
That’s ugly because the x-axis labels overlap. Fortunately there’s a simple solution: turn the graph sideways. You can’t do that by just swapping the x and y coordinate because p9.geom_boxplot()
always plots the distribution of y across the levels of x. Instead, add p9.coord_flip()
to the chain. This makes the logical x axis vertical and the logical y axis horizontal:
+ p9.aes(x='ed3', y='agekdbrn') + p9.geom_boxplot() + p9.coord_flip() p9.ggplot(gss)
/home/r/rdimond/.local/lib/python3.9/site-packages/plotnine/layer.py:333: PlotnineWarning: stat_boxplot : Removed 41156 rows containing non-finite values.
<ggplot: (8759060242270)>
Much better! Most of the time bar graphs, boxplots, or anything else with categorical labels will work better if the categorical variable is vertical and the bars/boxes/whatever are horizontal.
It may be very important to your analysis to recognize that the people with NaN for ed3
are most like the people with less than a high school education when it comes to agekdbrn
. It looks like the data are not missing completely at random and that could bias your analysis. But for a paper you probably don’t want to include that category in your graph. That’s okay: just pass a subset of gss
to p9.ggplot()
.
You just can’t select that subset the way you’re used to in Stata. In Python, any condition involving NaN (missing) returns False. So if you wrote gss['ed3']==NaN
the result would be False whether ed3
was missing or not. (Actually, that code wouldn’t work. We won’t go into how to write it in a valid way because it still wouldn’t give the result you need.) Instead, use the functions notna()
or isna()
to detect whether a value is missing or not. In this case we want notna()
.
'ed3'].notna()]) + p9.aes(x='ed3', y='agekdbrn') + p9.geom_boxplot() + p9.coord_flip() p9.ggplot(gss[gss[
/home/r/rdimond/.local/lib/python3.9/site-packages/plotnine/layer.py:333: PlotnineWarning: stat_boxplot : Removed 41015 rows containing non-finite values.
<ggplot: (8759058393338)>