Pearson Correlation (Coursera – Data Analysis Tools)

This week’s tool is Pearson Correlation.

I am continuing with the NESARC dataset for this assignment since I have worked with it on the earlier two and have gained a bit of familiarity with it.  This is an important point, it is very difficult to come to a brand new (to you) set of data and be able to make any sense with it.  Domain knowledge, or at least some workable level of understanding of the data is essential in being able to work with it.  As I have spent a few weeks with this NESARC now I am much more comfortable with it.

In the Pearson Correlation we are working with two quantitative variables.  The result of the process is an r value, sometimes referred to as the correlation coefficient. I have chosen this time to compare number of cigarettes smoked with ethanol consumption.

Hypothesis Test Question –
Is the number cigarettes smoked per day associated to the volume of ethanol consumed per day?


import pandas
import numpy
import seaborn
import scipy
import matplotlib.pyplot as plt

data = pandas.read_csv(‘nesarc_sub.csv’, low_memory=False)

#setting variables you will be working with to numeric
data[‘ETOTLCA2’] = data[‘ETOTLCA2’].convert_objects(convert_numeric=True)
data[‘S3AQ3C1’] = data[‘S3AQ3C1’].convert_objects(convert_numeric=True)

scat1 = seaborn.regplot(x=”S3AQ3C1″, y=”ETOTLCA2″, fit_reg=True, data=data)
plt.xlabel(‘Cigarettes Smoked Per Day’)
plt.ylabel(‘Volume of Ethanol Consumed Per Day’)
plt.title(‘Scatterplot for the Association Between Cigarettes Smoked and Ethanol Consumed’)


#keep only the ones that are within +3 to -3 standard deviations in the column ‘Data’.

print (‘association between cig smoked and eth consumed’)
print (scipy.stats.pearsonr(data_clean[‘S3AQ3C1’], data_clean[‘ETOTLCA2’]))


Note that I have removed the outliers (+/- 3 sd) in the ethanol consumed variable.









We see in the graph that there is a slight positive correlation between cigarettes smoked and volume of ethanol consumed.  With the r value of 0.063 we know that this correlation is quite weak and squaring this, 0.063 * 0.063 = 0.00369, tells us that it would be considerably difficult (<1% of the time) to predict variability in ethanol volume consumption based on cigarettes smoked per day.  However, with the extremely low p value (7.54e-13) we can see that there is statistical significance in this result.







Chi Square Test of Independence (Coursera – Data Analysis Tools)

Finally on to the second assignment in this course

As in the first assignment I am using the NESARC dataset for this assignment. In the Chi Square Test rather than have an explanatory variable and a quantitative variable, we have two explanatofy variables. So I have chosen this time to use alcohol dependence and the category of how often a responder drank alcohol.

Hypothesis Test Question –
Is the number of days when consuming alcohol associated to alcohol dependence.

When examining this association, the chi square test of analysis reveals that the null hypothesis can be rejected.


import pandas
import numpy
import scipy.stats
import seaborn
import matplotlib.pyplot as plt

data = pandas.read_csv(‘nesarc_pds.csv’, low_memory=False)

# new code setting variables you will be working with to numeric
data[‘ALCABDEP12DX’] = data[‘ALCABDEP12DX’].convert_objects(convert_numeric=True)
data[‘CONSUMER’] = pandas.to_numeric(data[‘CONSUMER’], errors=’coerce’)
data[‘S2AQ8A’] = pandas.to_numeric(data[‘S2AQ8A’], errors=’coerce’)
data[‘S2AQ8B’] = pandas.to_numeric(data[‘S2AQ8B’], errors=’coerce’)
data[‘AGE’] = pandas.to_numeric(data[‘AGE’], errors=’coerce’)

#subset data to young adults age 18 to 25 who have CONSUMED ALCOHOL in the past 12 months
sub1=data[(data[‘AGE’]>=18) & (data[‘AGE’]<=25) & (data[‘CONSUMER’]==1)]

#make a copy of my new subsetted data
sub2 = sub1.copy()

# recode missing values to python missing (NaN)
sub2[‘S2AQ8A’]=sub2[‘S2AQ8A’].replace(9, numpy.nan)
sub2[‘S2AQ8B’]=sub2[‘S2AQ8B’].replace(99, numpy.nan)

#recoding values for S3AQ3B1 into a new variable, USFREQMO
recode1 = {1: 30, 2: 30, 3: 14, 4: 6, 5: 6, 6: 2.5, 7: 1, 8: 0.5, 9: 0.5, 10: 0.5}
sub2[‘USFREQMO’]= sub2[‘S2AQ8A’].map(recode1)
#recoding values for ALCABDEP12DX into a new variable, ALCDEP
recode2 = {0: 0, 1: 0, 2: 1, 3: 1}
sub2[‘ALCDEP’]= sub2[‘ALCABDEP12DX’].map(recode2)

# contingency table of observed counts
ct1=pandas.crosstab(sub2[‘ALCDEP’], sub2[‘USFREQMO’])
print (ct1)

# column percentages

# chi-square
print (‘chi-square value, p value, expected counts’)
cs1= scipy.stats.chi2_contingency(ct1)
print (cs1)





Alcohol Dependence being:

0 No alcohol diagnosis, or alcohol abuse only
1 Alcohol dependence only or alcohol abuse and dependence

Frequency of use being number of days per month.
A Chi Square test of independence revealed that among daily, young adult drinkers, number days per month when alcohol is consumed (collapsed into 6 ordered categories) and past year alcohol dependence (binary categorical variable) were significantly associated, X2 =466.55, 5 df, p=1.32e-98.
Post hoc comparisons of rates of alcohol dependence by pairs of number of days per month drinking revealed that higher rates of alcohol dependence were seen among those drinking on more days, up to 14 to 13 days per month. In comparison, prevalence of alcohol dependence was statistically similar among those groups drinking 6 or less days per month.

Analysis of Variance (ANOVA) (Coursera – Data Analysis Tools)

In the past I have participated in a few online MOOC courses with Udacity and Coursera.   Here I am again engaging in another Coursera course, this time Data Analysis Tools.  This is a topic of interest to me and it brings a small benefit in terms of fulfilling a training requirement at my employment.

This course explores hypothesis testing using a number of different tools and in Python.  I have mostly been working in R for the past year or two but prior to that I learned Python thru the very first iteration of Udacity’s Intro to Computer Science (a very good course I might add).  The assignments in this Coursera course are to be posted to a blog and luckily I happen to have this one that I have been occassionally posting to.  So without further digression….

My submission for the assignment “Running an analysis of variance”

To begin, I am using the NESARC dataset for this assignment.

Hypothesis Test Question –
Is the average daily volume of ethanol consumed by those respondents reporting alcohol consumption equal (H0) or not equal (Ha) depending on reported ethnicity.

When examining the association between ethnicity (categorical) and ethanol consumption (quantitative), an Analysys of Variance (ANOVA) reveals that the null hypothesis can be rejected.


import numpy
import pandas
import statsmodels.formula.api as smf
import statsmodels.stats.multicomp as multi

data = pandas.read_csv(‘nesarc_pds.csv’, low_memory=False)

#setting variables you will be working with to numeric
data[‘ETHRACE2A’] = data[‘ETHRACE2A’].convert_objects(convert_numeric=True)
data[‘ETOTLCA2’] = data[‘ETOTLCA2’].convert_objects(convert_numeric=True)


ct1 = sub1.groupby(‘ETOTLCA2’).size()
print (ct1)

# using ols function for calculating the F-statistic and associated p value
model1 = smf.ols(formula=’ETOTLCA2 ~ C(ETHRACE2A)’, data=sub1)
results1 =
print (results1.summary())




Ethnicity being:

1 White, Not Hispanic or Latino
2 Black, Not Hispanic or Latino
3 American Indian/Alaska Native, Not Hispanic or Latino
4 Asian/Native Hawaiian/Pacific Islander, Not Hispanic or Latino
5 Hispanic or Latino


sub2 = sub1[[‘ETOTLCA2’, ‘ETHRACE2A’]].dropna()

print (‘means for ETOTLCA2 by ethnicity’)
m1= sub2.groupby(‘ETHRACE2A’).mean()
print (m1)

print (‘median for ETOTLCA2 by ethnicity’)
md1 = sub2.groupby(‘ETHRACE2A’).median()
print (md1)

print (‘standard deviations for ETOTLCA2 by ethnicity’)
sd1 = sub2.groupby(‘ETHRACE2A’).std()
print (sd1)






Among the survey sample, those reporting ethnicity of white-non hispanic (Mean=0.55, s.d. 1.33) compared to those reporting ethnicity of American Indian/Alaska Native-non hispanic (Mean=0.86, s.d. 2.48), provided OLS Regression results of F=3.997, p=0.00304. This tells us that ethnicity has a statistical significance with ethanol consumption.

With this encouraging result, a pairs variance analysis was conducted using the Tukey’s HSD test.


sub3 = sub1[[‘ETOTLCA2’, ‘ETHRACE2A’]].dropna()

model2 = smf.ols(formula=’ETOTLCA2 ~ C(ETHRACE2A)’, data=sub3).fit()
print (model2.summary())

print (‘means for ETOTLCA2 by ethnicity’)
m2= sub3.groupby(‘ETHRACE2A’).mean()
print (m2)

print (‘standard deviations for ETOTLCA2 by ethnicity’)
sd2 = sub3.groupby(‘ETHRACE2A’).std()
print (sd2)

mc1 = multi.MultiComparison(sub3[‘ETOTLCA2’], sub3[‘ETHRACE2A’])
res1 = mc1.tukeyhsd()





In this Post-Hoc test we can see that there is a significant difference between groups 1(white) and 3 (American indian) as well as between 3 and 4 (asian). Ethanol consumption is greater in the American indian respondents and least in the asian respondents.