Testing Moderation in the Context of Correlation (Coursera – Data Analysis Tools)

The material this week focused in part again on the smoking and nicotine dependence relationship and further whether it was moderated by lifetime depression status. This led me to wonder if other mental health issues may be moderators. For this example I chose to consider Lifetime Social Phobia

Initially we need to esatblish the relationship between smoking quantity and nicotine dependence.

Code

import pandas
import numpy
import scipy.stats
import seaborn
import matplotlib.pyplot as plt
import statsmodels.stats.proportion as sm

data = pandas.read_csv(‘nesarc_pds.csv’, low_memory=False)

#setting variables you will be working with to numeric
data[‘TAB12MDX’] = data[‘TAB12MDX’].convert_objects(convert_numeric=True)
data[‘CHECK321’] = data[‘CHECK321’].convert_objects(convert_numeric=True)
data[‘S3AQ3B1’] = data[‘S3AQ3B1’].convert_objects(convert_numeric=True)
data[‘S3AQ3C1’] = data[‘S3AQ3C1’].convert_objects(convert_numeric=True)
data[‘AGE’] = data[‘AGE’].convert_objects(convert_numeric=True)

#subset data to young adults age 18 to 25 who have smoked in the past 12 months
sub1=data[(data[‘AGE’]>=18) & (data[‘AGE’]<=25) & (data[‘CHECK321’]==1)]

#make a copy of my new subsetted data
sub2 = sub1.copy()

# recode missing values to python missing (NaN)
sub2[‘S3AQ3B1’]=sub2[‘S3AQ3B1’].replace(9, numpy.nan)
sub2[‘S3AQ3C1’]=sub2[‘S3AQ3C1’].replace(99, numpy.nan)
#recoding values for S3AQ3B1 into a new variable, USFREQMO
recode1 = {1: 30, 2: 22, 3: 14, 4: 6, 5: 2.5, 6: 1}
sub2[‘USFREQMO’]= sub2[‘S3AQ3B1’].map(recode1)

#recoding values for S3AQ3B1 into a new variable, USFREQMO
recode2 = {1: 30, 2: 22, 3: 14, 4: 5, 5: 2.5, 6: 1}
sub2[‘USFREQMO’]= sub2[‘S3AQ3B1’].map(recode2)
def USQUAN (row):
if row[‘S3AQ3B1’] != 1:
return 0
elif row[‘S3AQ3C1’] <= 5 :
return 3
elif row[‘S3AQ3C1’] <=10:
return 8
elif row[‘S3AQ3C1’] <= 15:
return 13
elif row[‘S3AQ3C1’] <= 20:
return 18
elif row[‘S3AQ3C1’] > 20:
return 37
sub2[‘USQUAN’] = sub2.apply (lambda row: USQUAN (row),axis=1)

# contingency table of observed counts
ct1=pandas.crosstab(sub2[‘TAB12MDX’], sub2[‘USQUAN’])
print (ct1)

# column percentages
colsum=ct1.sum(axis=0)
colpct=ct1/colsum
print(colpct)

# chi-square
print (‘chi-square value, p value, expected counts’)
cs1= scipy.stats.chi2_contingency(ct1)
print (cs1)

/Code

Result

result4-1

/Result

With a large chi square value (194.42) and a significant p value (4.22e-40), we see that smoking quantity and nicotine dependence are signifanctly associated.
Then we include the third variable, Lifetime Social Phobia, and split the data into two sets, 0= no social phobia, 1= social phobia.

Code

sub3=sub2[(sub2[‘SOCPDLIFE’]== 0)]
sub4=sub2[(sub2[‘SOCPDLIFE’]== 1)]

print (‘association between smoking quantity and nicotine dependence for those W/O social phobia’)
# contingency table of observed counts
ct2=pandas.crosstab(sub3[‘TAB12MDX’], sub3[‘USQUAN’])
print (ct2)

# column percentages
colsum=ct1.sum(axis=0)
colpct=ct1/colsum
print(colpct)

# chi-square
print (‘chi-square value, p value, expected counts’)
cs2= scipy.stats.chi2_contingency(ct2)
print (cs2)

print (‘association between smoking quantity and nicotine dependence for those WITH social phobia’)
# contingency table of observed counts
ct3=pandas.crosstab(sub4[‘TAB12MDX’], sub4[‘USQUAN’])
print (ct3)

# column percentages
colsum=ct1.sum(axis=0)
colpct=ct1/colsum
print(colpct)

# chi-square
print (‘chi-square value, p value, expected counts’)
cs3= scipy.stats.chi2_contingency(ct3)
print (cs3)

/Code

Result

result4-2

/Result

In this we find :

Those without social phobia having a high chi square value (182.56) and a significant p value (1.51e-37)

Those with social phobia also have a high chi square value (15.59) and a significant p value (0.008)

So in this situation we would say that the condition of social phobia does not moderate the relationship between smoking quantity and nicotine dependence.

result4-3

Advertisements

Pearson Correlation (Coursera – Data Analysis Tools)

This week’s tool is Pearson Correlation.

I am continuing with the NESARC dataset for this assignment since I have worked with it on the earlier two and have gained a bit of familiarity with it.  This is an important point, it is very difficult to come to a brand new (to you) set of data and be able to make any sense with it.  Domain knowledge, or at least some workable level of understanding of the data is essential in being able to work with it.  As I have spent a few weeks with this NESARC now I am much more comfortable with it.

In the Pearson Correlation we are working with two quantitative variables.  The result of the process is an r value, sometimes referred to as the correlation coefficient. I have chosen this time to compare number of cigarettes smoked with ethanol consumption.

Hypothesis Test Question –
Is the number cigarettes smoked per day associated to the volume of ethanol consumed per day?

Code:

import pandas
import numpy
import seaborn
import scipy
import matplotlib.pyplot as plt

data = pandas.read_csv(‘nesarc_sub.csv’, low_memory=False)

#setting variables you will be working with to numeric
data[‘ETOTLCA2’] = data[‘ETOTLCA2’].convert_objects(convert_numeric=True)
data[‘S3AQ3C1’] = data[‘S3AQ3C1’].convert_objects(convert_numeric=True)

scat1 = seaborn.regplot(x=”S3AQ3C1″, y=”ETOTLCA2″, fit_reg=True, data=data)
plt.xlabel(‘Cigarettes Smoked Per Day’)
plt.ylabel(‘Volume of Ethanol Consumed Per Day’)
plt.title(‘Scatterplot for the Association Between Cigarettes Smoked and Ethanol Consumed’)

data_clean=data.dropna()

#keep only the ones that are within +3 to -3 standard deviations in the column ‘Data’.
data_clean[numpy.abs(data_clean.ETOTLCA2-data.ETOTLCA2.mean())<=(3*data_clean.ETOTLCA2.std())]

print (‘association between cig smoked and eth consumed’)
print (scipy.stats.pearsonr(data_clean[‘S3AQ3C1’], data_clean[‘ETOTLCA2’]))

/Code

Note that I have removed the outliers (+/- 3 sd) in the ethanol consumed variable.

Graph

result3-1

/Graph

Result

result3-2

/Result

 

Discussion

We see in the graph that there is a slight positive correlation between cigarettes smoked and volume of ethanol consumed.  With the r value of 0.063 we know that this correlation is quite weak and squaring this, 0.063 * 0.063 = 0.00369, tells us that it would be considerably difficult (<1% of the time) to predict variability in ethanol volume consumption based on cigarettes smoked per day.  However, with the extremely low p value (7.54e-13) we can see that there is statistical significance in this result.

/Discussion

 

 

 

 

Chi Square Test of Independence (Coursera – Data Analysis Tools)

Finally on to the second assignment in this course

As in the first assignment I am using the NESARC dataset for this assignment. In the Chi Square Test rather than have an explanatory variable and a quantitative variable, we have two explanatofy variables. So I have chosen this time to use alcohol dependence and the category of how often a responder drank alcohol.

Hypothesis Test Question –
Is the number of days when consuming alcohol associated to alcohol dependence.

When examining this association, the chi square test of analysis reveals that the null hypothesis can be rejected.

Code:

import pandas
import numpy
import scipy.stats
import seaborn
import matplotlib.pyplot as plt

data = pandas.read_csv(‘nesarc_pds.csv’, low_memory=False)

# new code setting variables you will be working with to numeric
data[‘ALCABDEP12DX’] = data[‘ALCABDEP12DX’].convert_objects(convert_numeric=True)
data[‘CONSUMER’] = pandas.to_numeric(data[‘CONSUMER’], errors=’coerce’)
data[‘S2AQ8A’] = pandas.to_numeric(data[‘S2AQ8A’], errors=’coerce’)
data[‘S2AQ8B’] = pandas.to_numeric(data[‘S2AQ8B’], errors=’coerce’)
data[‘AGE’] = pandas.to_numeric(data[‘AGE’], errors=’coerce’)

#subset data to young adults age 18 to 25 who have CONSUMED ALCOHOL in the past 12 months
sub1=data[(data[‘AGE’]>=18) & (data[‘AGE’]<=25) & (data[‘CONSUMER’]==1)]

#make a copy of my new subsetted data
sub2 = sub1.copy()

# recode missing values to python missing (NaN)
sub2[‘S2AQ8A’]=sub2[‘S2AQ8A’].replace(9, numpy.nan)
sub2[‘S2AQ8B’]=sub2[‘S2AQ8B’].replace(99, numpy.nan)

#recoding values for S3AQ3B1 into a new variable, USFREQMO
recode1 = {1: 30, 2: 30, 3: 14, 4: 6, 5: 6, 6: 2.5, 7: 1, 8: 0.5, 9: 0.5, 10: 0.5}
sub2[‘USFREQMO’]= sub2[‘S2AQ8A’].map(recode1)
#recoding values for ALCABDEP12DX into a new variable, ALCDEP
recode2 = {0: 0, 1: 0, 2: 1, 3: 1}
sub2[‘ALCDEP’]= sub2[‘ALCABDEP12DX’].map(recode2)

# contingency table of observed counts
ct1=pandas.crosstab(sub2[‘ALCDEP’], sub2[‘USFREQMO’])
print (ct1)

# column percentages
colsum=ct1.sum(axis=0)
colpct=ct1/colsum
print(colpct)

# chi-square
print (‘chi-square value, p value, expected counts’)
cs1= scipy.stats.chi2_contingency(ct1)
print (cs1)

/Code

Result:

result2-1

/Result

Alcohol Dependence being:

0 No alcohol diagnosis, or alcohol abuse only
1 Alcohol dependence only or alcohol abuse and dependence

Frequency of use being number of days per month.
A Chi Square test of independence revealed that among daily, young adult drinkers, number days per month when alcohol is consumed (collapsed into 6 ordered categories) and past year alcohol dependence (binary categorical variable) were significantly associated, X2 =466.55, 5 df, p=1.32e-98.
Post hoc comparisons of rates of alcohol dependence by pairs of number of days per month drinking revealed that higher rates of alcohol dependence were seen among those drinking on more days, up to 14 to 13 days per month. In comparison, prevalence of alcohol dependence was statistically similar among those groups drinking 6 or less days per month.
result2-2

Analysis of Variance (ANOVA) (Coursera – Data Analysis Tools)

In the past I have participated in a few online MOOC courses with Udacity and Coursera.   Here I am again engaging in another Coursera course, this time Data Analysis Tools.  This is a topic of interest to me and it brings a small benefit in terms of fulfilling a training requirement at my employment.

This course explores hypothesis testing using a number of different tools and in Python.  I have mostly been working in R for the past year or two but prior to that I learned Python thru the very first iteration of Udacity’s Intro to Computer Science (a very good course I might add).  The assignments in this Coursera course are to be posted to a blog and luckily I happen to have this one that I have been occassionally posting to.  So without further digression….

My submission for the assignment “Running an analysis of variance”

To begin, I am using the NESARC dataset for this assignment.

Hypothesis Test Question –
Is the average daily volume of ethanol consumed by those respondents reporting alcohol consumption equal (H0) or not equal (Ha) depending on reported ethnicity.

When examining the association between ethnicity (categorical) and ethanol consumption (quantitative), an Analysys of Variance (ANOVA) reveals that the null hypothesis can be rejected.

Code:

import numpy
import pandas
import statsmodels.formula.api as smf
import statsmodels.stats.multicomp as multi

data = pandas.read_csv(‘nesarc_pds.csv’, low_memory=False)

#setting variables you will be working with to numeric
data[‘ETHRACE2A’] = data[‘ETHRACE2A’].convert_objects(convert_numeric=True)
data[‘ETOTLCA2’] = data[‘ETOTLCA2’].convert_objects(convert_numeric=True)

#subset data to THOSE REPORTING ALCOHOL CONSUMPTION
sub1=data[(data[‘ETOTLCA2’]>0)]

ct1 = sub1.groupby(‘ETOTLCA2’).size()
print (ct1)

# using ols function for calculating the F-statistic and associated p value
model1 = smf.ols(formula=’ETOTLCA2 ~ C(ETHRACE2A)’, data=sub1)
results1 = model1.fit()
print (results1.summary())

/Code

Result:

result1

/Result
Ethnicity being:

1 White, Not Hispanic or Latino
2 Black, Not Hispanic or Latino
3 American Indian/Alaska Native, Not Hispanic or Latino
4 Asian/Native Hawaiian/Pacific Islander, Not Hispanic or Latino
5 Hispanic or Latino

Code:

sub2 = sub1[[‘ETOTLCA2’, ‘ETHRACE2A’]].dropna()

print (‘means for ETOTLCA2 by ethnicity’)
m1= sub2.groupby(‘ETHRACE2A’).mean()
print (m1)

print (‘median for ETOTLCA2 by ethnicity’)
md1 = sub2.groupby(‘ETHRACE2A’).median()
print (md1)

print (‘standard deviations for ETOTLCA2 by ethnicity’)
sd1 = sub2.groupby(‘ETHRACE2A’).std()
print (sd1)

/Code

Result:

Result2

 

/Result

Among the survey sample, those reporting ethnicity of white-non hispanic (Mean=0.55, s.d. 1.33) compared to those reporting ethnicity of American Indian/Alaska Native-non hispanic (Mean=0.86, s.d. 2.48), provided OLS Regression results of F=3.997, p=0.00304. This tells us that ethnicity has a statistical significance with ethanol consumption.

With this encouraging result, a pairs variance analysis was conducted using the Tukey’s HSD test.

Code:

sub3 = sub1[[‘ETOTLCA2’, ‘ETHRACE2A’]].dropna()

model2 = smf.ols(formula=’ETOTLCA2 ~ C(ETHRACE2A)’, data=sub3).fit()
print (model2.summary())

print (‘means for ETOTLCA2 by ethnicity’)
m2= sub3.groupby(‘ETHRACE2A’).mean()
print (m2)

print (‘standard deviations for ETOTLCA2 by ethnicity’)
sd2 = sub3.groupby(‘ETHRACE2A’).std()
print (sd2)

mc1 = multi.MultiComparison(sub3[‘ETOTLCA2’], sub3[‘ETHRACE2A’])
res1 = mc1.tukeyhsd()
print(res1.summary())

/Code

Result:

result3

/Result

In this Post-Hoc test we can see that there is a significant difference between groups 1(white) and 3 (American indian) as well as between 3 and 4 (asian). Ethanol consumption is greater in the American indian respondents and least in the asian respondents.

Radio Externals Assembly

The brackets, front and rear case panels and the case itself go together pretty easily.  There is a bit of juggling to get these little plastic spacers mounted onto the bottom of brackets but it is doable.   I have to say that the front does look really nice.  The main part of the front is a piece of wood, nicely stained and oiled, and the bottom portion is an aluminum panel with the old style logo screened onto it.

Here is how it looks with the front panel

All of the screws both for the components and the case are supposed to have star lock washers.  I had plenty for the components with a few left over, even after I dropped a couple.  It’s a different story for me with the case though.  I’m short 6, assembled without them for now, maybe i’ll be at the hardware store and remember to get a few extras.

The back panel is a mirrored piece of 1/4″ plexi, I guess so that the red LED which is pointed to the back can be reflected back toward the front from inside.  Kind of a neat touch.

Few shots completely assembled

I think it does look pretty classy.

Now for the bad news.  I plugged in a speaker, turn it on and although I’m getting power and the click of power in the speaker, I’m getting no radio.  It did work the other evening prior to the case assembly part.  Not sure what it is yet, I did find one capacitor lead that had broken off and I fixed that but still nothing.  The antenna is still connected it looks like and I see nothing else obvious.  Though I’m sure that with the fiddling around getting the case pieces on I must have broken some connection somewhere.  Also I’m thinking something could be shorting out, there are some nuts on the bottom of the pc board which are very very close together and I watched for that when assembling but I am going to have to check all of that again too.

I am hoping to get it sorted out soon and am pretty sure is going to be some broken connection/lead.  I think that is the least successful part of this kit, the mechanical connections.  They are difficult and touchy.  If this is intended for folks who haven’t worked with anything like this before I’m afraid that they may be disappointed or frustrated with these.  Were I to know this before starting I may have opted to solder the connections.  As it is, if I can’t find the problem I’m thinking to remove components 1 by 1 and solder them instead (to the bottom of the board though which is where the solder pads are exposed.  Ultimately I feel that this would be much more successful as a solder kit, or if it must be no solder then it needs an easier and more resilient connection method.

 

Radio Internals Assembly

I have spent a couple hours over the past few days putting parts on the board.  Tiny, tiny parts.  Well the, components aren’t all that bad, it’s the little screws, washers and nuts that are hard.  First test run worked out super, LED lit up and I was receiving local stations perfectly.

Here are a few photos of the board with the air cap mounted to it.  This one wasn’t all that difficult.

Next some shots of the fully populated board.

There are a few problems beside the tiny screws and nuts.  Easy enough to overcome but worth mentioning.

  1.  Pg 15 of the manual has a couple of errors, an omission and an incorrect resistor marking.  There is to be a 1k resistor to the right of the 10n cap but there is no instruction to install it.  Also in the bottom half of the page it lists a 100ohm resistor as Red-Black-Brown which should say Brown-Black-Brown.  Easy enough to figure out.
  2. Most of the components went on just fine with the screws.  On one occasion I had a lead break off of the transistor when it was getting tightened.  Luckily there was enough lead left to redo it and have enough of the other two leads to reach.  This is a sort of a problem theme I think, because…
  3. On the attachment for the wires on the battery holder, the antenna and the audio jack I had the wires break off after having screwed them on.  Multiple times.  At the moment they are all intact, but I’m thinking if it happens again I’m just going to solder them.  I think the screws sort of cut the conductors right off or something.
  4. Wrapping the antenna wasn’t too difficult, the leads though  are a bit touchy and I’m hoping they won’t break off.

All in all, so far it’s still pretty sweet.  Next up is the mechanical assembly, the case and knob and all.

Heathkit Returns

A couple of years ago I found that Heathkit was staging some sort of comeback of sorts.  They had put up a web page, basically just a “keep checking here” sort of thing.  Inside the source was a link to a survey where they asked a bunch of questions about what people were interested in etc.  Info about it can be found elsewhere on the web.  One of the things they did was to collect up email addresses and have a sort of “insiders” mail list.  Not very much came out of that email but after a couple of years finally the insiders were invited to visit the site to check out their very first new kit in the 21st century.  It is just a simple, no solder, AM radio kit, no speaker, etc.  Price was a little hefty at $150 but I bit and the other day it arrived finally.

Boy is it a beauty.  I an not old enough to have had a Heathkit kit before (though I’m not young either), but I am aware of them and saw a few go together as a kid.

IMG_20151130_162510

So far it seems as would be expected from what we hear and read.  Everything is really nicely done, the assembly/operation manual is overly complete with instructions needed as well as a bunch of theory and such.

So hopefully over the next several days or a couple of weeks I’ll be assembling and testing and making notes here as to my progress.

Here are a few more pics.

IMG_20151130_162531