Chi Square Test of Independence (Coursera – Data Analysis Tools)

Finally on to the second assignment in this course

As in the first assignment I am using the NESARC dataset for this assignment. In the Chi Square Test rather than have an explanatory variable and a quantitative variable, we have two explanatofy variables. So I have chosen this time to use alcohol dependence and the category of how often a responder drank alcohol.

Hypothesis Test Question –
Is the number of days when consuming alcohol associated to alcohol dependence.

When examining this association, the chi square test of analysis reveals that the null hypothesis can be rejected.


import pandas
import numpy
import scipy.stats
import seaborn
import matplotlib.pyplot as plt

data = pandas.read_csv(‘nesarc_pds.csv’, low_memory=False)

# new code setting variables you will be working with to numeric
data[‘ALCABDEP12DX’] = data[‘ALCABDEP12DX’].convert_objects(convert_numeric=True)
data[‘CONSUMER’] = pandas.to_numeric(data[‘CONSUMER’], errors=’coerce’)
data[‘S2AQ8A’] = pandas.to_numeric(data[‘S2AQ8A’], errors=’coerce’)
data[‘S2AQ8B’] = pandas.to_numeric(data[‘S2AQ8B’], errors=’coerce’)
data[‘AGE’] = pandas.to_numeric(data[‘AGE’], errors=’coerce’)

#subset data to young adults age 18 to 25 who have CONSUMED ALCOHOL in the past 12 months
sub1=data[(data[‘AGE’]>=18) & (data[‘AGE’]<=25) & (data[‘CONSUMER’]==1)]

#make a copy of my new subsetted data
sub2 = sub1.copy()

# recode missing values to python missing (NaN)
sub2[‘S2AQ8A’]=sub2[‘S2AQ8A’].replace(9, numpy.nan)
sub2[‘S2AQ8B’]=sub2[‘S2AQ8B’].replace(99, numpy.nan)

#recoding values for S3AQ3B1 into a new variable, USFREQMO
recode1 = {1: 30, 2: 30, 3: 14, 4: 6, 5: 6, 6: 2.5, 7: 1, 8: 0.5, 9: 0.5, 10: 0.5}
sub2[‘USFREQMO’]= sub2[‘S2AQ8A’].map(recode1)
#recoding values for ALCABDEP12DX into a new variable, ALCDEP
recode2 = {0: 0, 1: 0, 2: 1, 3: 1}
sub2[‘ALCDEP’]= sub2[‘ALCABDEP12DX’].map(recode2)

# contingency table of observed counts
ct1=pandas.crosstab(sub2[‘ALCDEP’], sub2[‘USFREQMO’])
print (ct1)

# column percentages

# chi-square
print (‘chi-square value, p value, expected counts’)
cs1= scipy.stats.chi2_contingency(ct1)
print (cs1)





Alcohol Dependence being:

0 No alcohol diagnosis, or alcohol abuse only
1 Alcohol dependence only or alcohol abuse and dependence

Frequency of use being number of days per month.
A Chi Square test of independence revealed that among daily, young adult drinkers, number days per month when alcohol is consumed (collapsed into 6 ordered categories) and past year alcohol dependence (binary categorical variable) were significantly associated, X2 =466.55, 5 df, p=1.32e-98.
Post hoc comparisons of rates of alcohol dependence by pairs of number of days per month drinking revealed that higher rates of alcohol dependence were seen among those drinking on more days, up to 14 to 13 days per month. In comparison, prevalence of alcohol dependence was statistically similar among those groups drinking 6 or less days per month.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s