Testing Moderation in the Context of Correlation (Coursera – Data Analysis Tools)

The material this week focused in part again on the smoking and nicotine dependence relationship and further whether it was moderated by lifetime depression status. This led me to wonder if other mental health issues may be moderators. For this example I chose to consider Lifetime Social Phobia

Initially we need to esatblish the relationship between smoking quantity and nicotine dependence.

Code

import pandas
import numpy
import scipy.stats
import seaborn
import matplotlib.pyplot as plt
import statsmodels.stats.proportion as sm

data = pandas.read_csv(‘nesarc_pds.csv’, low_memory=False)

#setting variables you will be working with to numeric
data[‘TAB12MDX’] = data[‘TAB12MDX’].convert_objects(convert_numeric=True)
data[‘CHECK321’] = data[‘CHECK321’].convert_objects(convert_numeric=True)
data[‘S3AQ3B1’] = data[‘S3AQ3B1’].convert_objects(convert_numeric=True)
data[‘S3AQ3C1’] = data[‘S3AQ3C1’].convert_objects(convert_numeric=True)
data[‘AGE’] = data[‘AGE’].convert_objects(convert_numeric=True)

#subset data to young adults age 18 to 25 who have smoked in the past 12 months
sub1=data[(data[‘AGE’]>=18) & (data[‘AGE’]<=25) & (data[‘CHECK321’]==1)]

#make a copy of my new subsetted data
sub2 = sub1.copy()

# recode missing values to python missing (NaN)
sub2[‘S3AQ3B1’]=sub2[‘S3AQ3B1’].replace(9, numpy.nan)
sub2[‘S3AQ3C1’]=sub2[‘S3AQ3C1’].replace(99, numpy.nan)
#recoding values for S3AQ3B1 into a new variable, USFREQMO
recode1 = {1: 30, 2: 22, 3: 14, 4: 6, 5: 2.5, 6: 1}
sub2[‘USFREQMO’]= sub2[‘S3AQ3B1’].map(recode1)

#recoding values for S3AQ3B1 into a new variable, USFREQMO
recode2 = {1: 30, 2: 22, 3: 14, 4: 5, 5: 2.5, 6: 1}
sub2[‘USFREQMO’]= sub2[‘S3AQ3B1’].map(recode2)
def USQUAN (row):
if row[‘S3AQ3B1’] != 1:
return 0
elif row[‘S3AQ3C1’] <= 5 :
return 3
elif row[‘S3AQ3C1’] <=10:
return 8
elif row[‘S3AQ3C1’] <= 15:
return 13
elif row[‘S3AQ3C1’] <= 20:
return 18
elif row[‘S3AQ3C1’] > 20:
return 37
sub2[‘USQUAN’] = sub2.apply (lambda row: USQUAN (row),axis=1)

# contingency table of observed counts
ct1=pandas.crosstab(sub2[‘TAB12MDX’], sub2[‘USQUAN’])
print (ct1)

# column percentages
colsum=ct1.sum(axis=0)
colpct=ct1/colsum
print(colpct)

# chi-square
print (‘chi-square value, p value, expected counts’)
cs1= scipy.stats.chi2_contingency(ct1)
print (cs1)

/Code

Result

result4-1

/Result

With a large chi square value (194.42) and a significant p value (4.22e-40), we see that smoking quantity and nicotine dependence are signifanctly associated.
Then we include the third variable, Lifetime Social Phobia, and split the data into two sets, 0= no social phobia, 1= social phobia.

Code

sub3=sub2[(sub2[‘SOCPDLIFE’]== 0)]
sub4=sub2[(sub2[‘SOCPDLIFE’]== 1)]

print (‘association between smoking quantity and nicotine dependence for those W/O social phobia’)
# contingency table of observed counts
ct2=pandas.crosstab(sub3[‘TAB12MDX’], sub3[‘USQUAN’])
print (ct2)

# column percentages
colsum=ct1.sum(axis=0)
colpct=ct1/colsum
print(colpct)

# chi-square
print (‘chi-square value, p value, expected counts’)
cs2= scipy.stats.chi2_contingency(ct2)
print (cs2)

print (‘association between smoking quantity and nicotine dependence for those WITH social phobia’)
# contingency table of observed counts
ct3=pandas.crosstab(sub4[‘TAB12MDX’], sub4[‘USQUAN’])
print (ct3)

# column percentages
colsum=ct1.sum(axis=0)
colpct=ct1/colsum
print(colpct)

# chi-square
print (‘chi-square value, p value, expected counts’)
cs3= scipy.stats.chi2_contingency(ct3)
print (cs3)

/Code

Result

result4-2

/Result

In this we find :

Those without social phobia having a high chi square value (182.56) and a significant p value (1.51e-37)

Those with social phobia also have a high chi square value (15.59) and a significant p value (0.008)

So in this situation we would say that the condition of social phobia does not moderate the relationship between smoking quantity and nicotine dependence.

result4-3

Advertisements

Pearson Correlation (Coursera – Data Analysis Tools)

This week’s tool is Pearson Correlation.

I am continuing with the NESARC dataset for this assignment since I have worked with it on the earlier two and have gained a bit of familiarity with it.  This is an important point, it is very difficult to come to a brand new (to you) set of data and be able to make any sense with it.  Domain knowledge, or at least some workable level of understanding of the data is essential in being able to work with it.  As I have spent a few weeks with this NESARC now I am much more comfortable with it.

In the Pearson Correlation we are working with two quantitative variables.  The result of the process is an r value, sometimes referred to as the correlation coefficient. I have chosen this time to compare number of cigarettes smoked with ethanol consumption.

Hypothesis Test Question –
Is the number cigarettes smoked per day associated to the volume of ethanol consumed per day?

Code:

import pandas
import numpy
import seaborn
import scipy
import matplotlib.pyplot as plt

data = pandas.read_csv(‘nesarc_sub.csv’, low_memory=False)

#setting variables you will be working with to numeric
data[‘ETOTLCA2’] = data[‘ETOTLCA2’].convert_objects(convert_numeric=True)
data[‘S3AQ3C1’] = data[‘S3AQ3C1’].convert_objects(convert_numeric=True)

scat1 = seaborn.regplot(x=”S3AQ3C1″, y=”ETOTLCA2″, fit_reg=True, data=data)
plt.xlabel(‘Cigarettes Smoked Per Day’)
plt.ylabel(‘Volume of Ethanol Consumed Per Day’)
plt.title(‘Scatterplot for the Association Between Cigarettes Smoked and Ethanol Consumed’)

data_clean=data.dropna()

#keep only the ones that are within +3 to -3 standard deviations in the column ‘Data’.
data_clean[numpy.abs(data_clean.ETOTLCA2-data.ETOTLCA2.mean())<=(3*data_clean.ETOTLCA2.std())]

print (‘association between cig smoked and eth consumed’)
print (scipy.stats.pearsonr(data_clean[‘S3AQ3C1’], data_clean[‘ETOTLCA2’]))

/Code

Note that I have removed the outliers (+/- 3 sd) in the ethanol consumed variable.

Graph

result3-1

/Graph

Result

result3-2

/Result

 

Discussion

We see in the graph that there is a slight positive correlation between cigarettes smoked and volume of ethanol consumed.  With the r value of 0.063 we know that this correlation is quite weak and squaring this, 0.063 * 0.063 = 0.00369, tells us that it would be considerably difficult (<1% of the time) to predict variability in ethanol volume consumption based on cigarettes smoked per day.  However, with the extremely low p value (7.54e-13) we can see that there is statistical significance in this result.

/Discussion

 

 

 

 

Chi Square Test of Independence (Coursera – Data Analysis Tools)

Finally on to the second assignment in this course

As in the first assignment I am using the NESARC dataset for this assignment. In the Chi Square Test rather than have an explanatory variable and a quantitative variable, we have two explanatofy variables. So I have chosen this time to use alcohol dependence and the category of how often a responder drank alcohol.

Hypothesis Test Question –
Is the number of days when consuming alcohol associated to alcohol dependence.

When examining this association, the chi square test of analysis reveals that the null hypothesis can be rejected.

Code:

import pandas
import numpy
import scipy.stats
import seaborn
import matplotlib.pyplot as plt

data = pandas.read_csv(‘nesarc_pds.csv’, low_memory=False)

# new code setting variables you will be working with to numeric
data[‘ALCABDEP12DX’] = data[‘ALCABDEP12DX’].convert_objects(convert_numeric=True)
data[‘CONSUMER’] = pandas.to_numeric(data[‘CONSUMER’], errors=’coerce’)
data[‘S2AQ8A’] = pandas.to_numeric(data[‘S2AQ8A’], errors=’coerce’)
data[‘S2AQ8B’] = pandas.to_numeric(data[‘S2AQ8B’], errors=’coerce’)
data[‘AGE’] = pandas.to_numeric(data[‘AGE’], errors=’coerce’)

#subset data to young adults age 18 to 25 who have CONSUMED ALCOHOL in the past 12 months
sub1=data[(data[‘AGE’]>=18) & (data[‘AGE’]<=25) & (data[‘CONSUMER’]==1)]

#make a copy of my new subsetted data
sub2 = sub1.copy()

# recode missing values to python missing (NaN)
sub2[‘S2AQ8A’]=sub2[‘S2AQ8A’].replace(9, numpy.nan)
sub2[‘S2AQ8B’]=sub2[‘S2AQ8B’].replace(99, numpy.nan)

#recoding values for S3AQ3B1 into a new variable, USFREQMO
recode1 = {1: 30, 2: 30, 3: 14, 4: 6, 5: 6, 6: 2.5, 7: 1, 8: 0.5, 9: 0.5, 10: 0.5}
sub2[‘USFREQMO’]= sub2[‘S2AQ8A’].map(recode1)
#recoding values for ALCABDEP12DX into a new variable, ALCDEP
recode2 = {0: 0, 1: 0, 2: 1, 3: 1}
sub2[‘ALCDEP’]= sub2[‘ALCABDEP12DX’].map(recode2)

# contingency table of observed counts
ct1=pandas.crosstab(sub2[‘ALCDEP’], sub2[‘USFREQMO’])
print (ct1)

# column percentages
colsum=ct1.sum(axis=0)
colpct=ct1/colsum
print(colpct)

# chi-square
print (‘chi-square value, p value, expected counts’)
cs1= scipy.stats.chi2_contingency(ct1)
print (cs1)

/Code

Result:

result2-1

/Result

Alcohol Dependence being:

0 No alcohol diagnosis, or alcohol abuse only
1 Alcohol dependence only or alcohol abuse and dependence

Frequency of use being number of days per month.
A Chi Square test of independence revealed that among daily, young adult drinkers, number days per month when alcohol is consumed (collapsed into 6 ordered categories) and past year alcohol dependence (binary categorical variable) were significantly associated, X2 =466.55, 5 df, p=1.32e-98.
Post hoc comparisons of rates of alcohol dependence by pairs of number of days per month drinking revealed that higher rates of alcohol dependence were seen among those drinking on more days, up to 14 to 13 days per month. In comparison, prevalence of alcohol dependence was statistically similar among those groups drinking 6 or less days per month.
result2-2

Radio Externals Assembly

The brackets, front and rear case panels and the case itself go together pretty easily.  There is a bit of juggling to get these little plastic spacers mounted onto the bottom of brackets but it is doable.   I have to say that the front does look really nice.  The main part of the front is a piece of wood, nicely stained and oiled, and the bottom portion is an aluminum panel with the old style logo screened onto it.

Here is how it looks with the front panel

All of the screws both for the components and the case are supposed to have star lock washers.  I had plenty for the components with a few left over, even after I dropped a couple.  It’s a different story for me with the case though.  I’m short 6, assembled without them for now, maybe i’ll be at the hardware store and remember to get a few extras.

The back panel is a mirrored piece of 1/4″ plexi, I guess so that the red LED which is pointed to the back can be reflected back toward the front from inside.  Kind of a neat touch.

Few shots completely assembled

I think it does look pretty classy.

Now for the bad news.  I plugged in a speaker, turn it on and although I’m getting power and the click of power in the speaker, I’m getting no radio.  It did work the other evening prior to the case assembly part.  Not sure what it is yet, I did find one capacitor lead that had broken off and I fixed that but still nothing.  The antenna is still connected it looks like and I see nothing else obvious.  Though I’m sure that with the fiddling around getting the case pieces on I must have broken some connection somewhere.  Also I’m thinking something could be shorting out, there are some nuts on the bottom of the pc board which are very very close together and I watched for that when assembling but I am going to have to check all of that again too.

I am hoping to get it sorted out soon and am pretty sure is going to be some broken connection/lead.  I think that is the least successful part of this kit, the mechanical connections.  They are difficult and touchy.  If this is intended for folks who haven’t worked with anything like this before I’m afraid that they may be disappointed or frustrated with these.  Were I to know this before starting I may have opted to solder the connections.  As it is, if I can’t find the problem I’m thinking to remove components 1 by 1 and solder them instead (to the bottom of the board though which is where the solder pads are exposed.  Ultimately I feel that this would be much more successful as a solder kit, or if it must be no solder then it needs an easier and more resilient connection method.

 

More details, more avenues

Having ran across an interesting blog post today somewhere else, I started to get a few more ideas for expansion of the little r script that I have been building and that opens up a few avenues that I can start to work down in more detail.  To start with I’m thinking about users who have tweeted.   Using my current searchTwitter (berniesanders) I am putting the results for that into a data frame and that is real nice as it columnizes a number of variables one of which is the screen name.  So with that it is not difficult to make a table of those with frequency of tweets in the pulled data to see the most frequent tweeters.  Presently for development of the program I am just pulling a small set of tweets, 250, but I think the plan is to expand this quite a bit so it will reach back some number of days.  I have read somewhere that the Twitter API only exposes like 7 days of tweets thru searchTwitter, but I also have found elsewhere a method of collecting tweets from a stream and storing them which seems like the thing to do ultimately.  With this, and keeping with the US 2016 Election theme I am thinking it will be an interesting project to look for users tweeting in some way about multiple candidates.  That seems like a good direction to go.

Refinement

Rplot1013

I spent a little time today refining my code to clean up the resultant data.  For one I found I wasn’t doing cleaning on anything.  I had doing all of the cleaning prior to making a corpus of the words, thus no effect.  It still looked cool though.   I’ve also worked out removing standard and custom stop words and cleaned out the URLs using a little function with a gsub.   Above is the current result using #BernieSanders hashtag.  I’m not necessarily all that in to one candidate or the other, it’s just that this one is in the news lately and seems to have enough interesting activity.  The tweet count pulled was increased to 250.  Right now I can’t get stemCompletion to work, so many words here are truncated a little.   It is interesting to study though.   It seems like quite a cross section.   Something I find odd is that the word “amp” is a pretty common one and I’m not sure why that is.  It is a point for further research.

First output

cloud101015

It was a few days ago that I started this process.  I had from a while ago a twitter account set up and an app so I had all the keys codes and such.  One evening earlier this week I was able to connect and pull some data.  So I thought first thing would be to try to make a word cloud.  Through some trials and error I have managed such.  One of the main challenges I had was that it turned out that some of the tweets at the time have some oddball characters in them and that was causing stuff to choke and throw an error (error in simple_triplet_matrix….).  As usual the internet knows all the answers and I found somewhere that I could substitute those bad characters with nothing using   iconv(tweets, to = “utf-8-mac”, sub=””).   At first I didn’t have the -mac, didn’t know about it.   That was necessary for my program to succeed since I’m running it on a macbook.

Also, this cloud is cleaned a bit to remove some punctuation and some stop words as well as lowercasing everything, just basic functions available in the tm package.

Well, it’s a start of some sort.   I have managed to work out some mechanics and now can get down to the business of investigation.