This week’s tool is Pearson Correlation.
I am continuing with the NESARC dataset for this assignment since I have worked with it on the earlier two and have gained a bit of familiarity with it. This is an important point, it is very difficult to come to a brand new (to you) set of data and be able to make any sense with it. Domain knowledge, or at least some workable level of understanding of the data is essential in being able to work with it. As I have spent a few weeks with this NESARC now I am much more comfortable with it.
In the Pearson Correlation we are working with two quantitative variables. The result of the process is an r value, sometimes referred to as the correlation coefficient. I have chosen this time to compare number of cigarettes smoked with ethanol consumption.
Hypothesis Test Question –
Is the number cigarettes smoked per day associated to the volume of ethanol consumed per day?
import matplotlib.pyplot as plt
data = pandas.read_csv(‘nesarc_sub.csv’, low_memory=False)
#setting variables you will be working with to numeric
data[‘ETOTLCA2’] = data[‘ETOTLCA2’].convert_objects(convert_numeric=True)
data[‘S3AQ3C1’] = data[‘S3AQ3C1’].convert_objects(convert_numeric=True)
scat1 = seaborn.regplot(x=”S3AQ3C1″, y=”ETOTLCA2″, fit_reg=True, data=data)
plt.xlabel(‘Cigarettes Smoked Per Day’)
plt.ylabel(‘Volume of Ethanol Consumed Per Day’)
plt.title(‘Scatterplot for the Association Between Cigarettes Smoked and Ethanol Consumed’)
#keep only the ones that are within +3 to -3 standard deviations in the column ‘Data’.
print (‘association between cig smoked and eth consumed’)
print (scipy.stats.pearsonr(data_clean[‘S3AQ3C1’], data_clean[‘ETOTLCA2’]))
Note that I have removed the outliers (+/- 3 sd) in the ethanol consumed variable.
We see in the graph that there is a slight positive correlation between cigarettes smoked and volume of ethanol consumed. With the r value of 0.063 we know that this correlation is quite weak and squaring this, 0.063 * 0.063 = 0.00369, tells us that it would be considerably difficult (<1% of the time) to predict variability in ethanol volume consumption based on cigarettes smoked per day. However, with the extremely low p value (7.54e-13) we can see that there is statistical significance in this result.