Analysis of Variance (ANOVA) (Coursera – Data Analysis Tools)

In the past I have participated in a few online MOOC courses with Udacity and Coursera.   Here I am again engaging in another Coursera course, this time Data Analysis Tools.  This is a topic of interest to me and it brings a small benefit in terms of fulfilling a training requirement at my employment.

This course explores hypothesis testing using a number of different tools and in Python.  I have mostly been working in R for the past year or two but prior to that I learned Python thru the very first iteration of Udacity’s Intro to Computer Science (a very good course I might add).  The assignments in this Coursera course are to be posted to a blog and luckily I happen to have this one that I have been occassionally posting to.  So without further digression….

My submission for the assignment “Running an analysis of variance”

To begin, I am using the NESARC dataset for this assignment.

Hypothesis Test Question –
Is the average daily volume of ethanol consumed by those respondents reporting alcohol consumption equal (H0) or not equal (Ha) depending on reported ethnicity.

When examining the association between ethnicity (categorical) and ethanol consumption (quantitative), an Analysys of Variance (ANOVA) reveals that the null hypothesis can be rejected.


import numpy
import pandas
import statsmodels.formula.api as smf
import statsmodels.stats.multicomp as multi

data = pandas.read_csv(‘nesarc_pds.csv’, low_memory=False)

#setting variables you will be working with to numeric
data[‘ETHRACE2A’] = data[‘ETHRACE2A’].convert_objects(convert_numeric=True)
data[‘ETOTLCA2’] = data[‘ETOTLCA2’].convert_objects(convert_numeric=True)


ct1 = sub1.groupby(‘ETOTLCA2’).size()
print (ct1)

# using ols function for calculating the F-statistic and associated p value
model1 = smf.ols(formula=’ETOTLCA2 ~ C(ETHRACE2A)’, data=sub1)
results1 =
print (results1.summary())




Ethnicity being:

1 White, Not Hispanic or Latino
2 Black, Not Hispanic or Latino
3 American Indian/Alaska Native, Not Hispanic or Latino
4 Asian/Native Hawaiian/Pacific Islander, Not Hispanic or Latino
5 Hispanic or Latino


sub2 = sub1[[‘ETOTLCA2’, ‘ETHRACE2A’]].dropna()

print (‘means for ETOTLCA2 by ethnicity’)
m1= sub2.groupby(‘ETHRACE2A’).mean()
print (m1)

print (‘median for ETOTLCA2 by ethnicity’)
md1 = sub2.groupby(‘ETHRACE2A’).median()
print (md1)

print (‘standard deviations for ETOTLCA2 by ethnicity’)
sd1 = sub2.groupby(‘ETHRACE2A’).std()
print (sd1)






Among the survey sample, those reporting ethnicity of white-non hispanic (Mean=0.55, s.d. 1.33) compared to those reporting ethnicity of American Indian/Alaska Native-non hispanic (Mean=0.86, s.d. 2.48), provided OLS Regression results of F=3.997, p=0.00304. This tells us that ethnicity has a statistical significance with ethanol consumption.

With this encouraging result, a pairs variance analysis was conducted using the Tukey’s HSD test.


sub3 = sub1[[‘ETOTLCA2’, ‘ETHRACE2A’]].dropna()

model2 = smf.ols(formula=’ETOTLCA2 ~ C(ETHRACE2A)’, data=sub3).fit()
print (model2.summary())

print (‘means for ETOTLCA2 by ethnicity’)
m2= sub3.groupby(‘ETHRACE2A’).mean()
print (m2)

print (‘standard deviations for ETOTLCA2 by ethnicity’)
sd2 = sub3.groupby(‘ETHRACE2A’).std()
print (sd2)

mc1 = multi.MultiComparison(sub3[‘ETOTLCA2’], sub3[‘ETHRACE2A’])
res1 = mc1.tukeyhsd()





In this Post-Hoc test we can see that there is a significant difference between groups 1(white) and 3 (American indian) as well as between 3 and 4 (asian). Ethanol consumption is greater in the American indian respondents and least in the asian respondents.