In the past I have participated in a few online MOOC courses with Udacity and Coursera. Here I am again engaging in another Coursera course, this time Data Analysis Tools. This is a topic of interest to me and it brings a small benefit in terms of fulfilling a training requirement at my employment.

This course explores hypothesis testing using a number of different tools and in Python. I have mostly been working in R for the past year or two but prior to that I learned Python thru the very first iteration of Udacity’s Intro to Computer Science (a very good course I might add). The assignments in this Coursera course are to be posted to a blog and luckily I happen to have this one that I have been occassionally posting to. So without further digression….

My submission for the assignment “Running an analysis of variance”

To begin, I am using the NESARC dataset for this assignment.

Hypothesis Test Question –

Is the average daily volume of ethanol consumed by those respondents reporting alcohol consumption equal (H0) or not equal (Ha) depending on reported ethnicity.

When examining the association between ethnicity (categorical) and ethanol consumption (quantitative), an Analysys of Variance (ANOVA) reveals that the null hypothesis can be rejected.

Code:

import numpy

import pandas

import statsmodels.formula.api as smf

import statsmodels.stats.multicomp as multi

data = pandas.read_csv(‘nesarc_pds.csv’, low_memory=False)

#setting variables you will be working with to numeric

data[‘ETHRACE2A’] = data[‘ETHRACE2A’].convert_objects(convert_numeric=True)

data[‘ETOTLCA2’] = data[‘ETOTLCA2’].convert_objects(convert_numeric=True)

#subset data to THOSE REPORTING ALCOHOL CONSUMPTION

sub1=data[(data[‘ETOTLCA2’]>0)]

ct1 = sub1.groupby(‘ETOTLCA2’).size()

print (ct1)

# using ols function for calculating the F-statistic and associated p value

model1 = smf.ols(formula=’ETOTLCA2 ~ C(ETHRACE2A)’, data=sub1)

results1 = model1.fit()

print (results1.summary())

/Code

Result:

/Result

Ethnicity being:

1 White, Not Hispanic or Latino

2 Black, Not Hispanic or Latino

3 American Indian/Alaska Native, Not Hispanic or Latino

4 Asian/Native Hawaiian/Pacific Islander, Not Hispanic or Latino

5 Hispanic or Latino

Code:

sub2 = sub1[[‘ETOTLCA2’, ‘ETHRACE2A’]].dropna()

print (‘means for ETOTLCA2 by ethnicity’)

m1= sub2.groupby(‘ETHRACE2A’).mean()

print (m1)

print (‘median for ETOTLCA2 by ethnicity’)

md1 = sub2.groupby(‘ETHRACE2A’).median()

print (md1)

print (‘standard deviations for ETOTLCA2 by ethnicity’)

sd1 = sub2.groupby(‘ETHRACE2A’).std()

print (sd1)

/Code

Result:

/Result

Among the survey sample, those reporting ethnicity of white-non hispanic (Mean=0.55, s.d. 1.33) compared to those reporting ethnicity of American Indian/Alaska Native-non hispanic (Mean=0.86, s.d. 2.48), provided OLS Regression results of F=3.997, p=0.00304. This tells us that ethnicity has a statistical significance with ethanol consumption.

With this encouraging result, a pairs variance analysis was conducted using the Tukey’s HSD test.

Code:

sub3 = sub1[[‘ETOTLCA2’, ‘ETHRACE2A’]].dropna()

model2 = smf.ols(formula=’ETOTLCA2 ~ C(ETHRACE2A)’, data=sub3).fit()

print (model2.summary())

print (‘means for ETOTLCA2 by ethnicity’)

m2= sub3.groupby(‘ETHRACE2A’).mean()

print (m2)

print (‘standard deviations for ETOTLCA2 by ethnicity’)

sd2 = sub3.groupby(‘ETHRACE2A’).std()

print (sd2)

mc1 = multi.MultiComparison(sub3[‘ETOTLCA2’], sub3[‘ETHRACE2A’])

res1 = mc1.tukeyhsd()

print(res1.summary())

/Code

Result:

/Result

In this Post-Hoc test we can see that there is a significant difference between groups 1(white) and 3 (American indian) as well as between 3 and 4 (asian). Ethanol consumption is greater in the American indian respondents and least in the asian respondents.