Open Data Science Conference – San Francisco 2015

A while ago a friend of mine raised the idea of attending ODSCWest.  At first I thought that conferences and such wouldn’t be that interesting or useful for me but when reviewing the info about the workshops it turned out maybe I would be interested after all.  So here I am sitting in NCal a day early getting things sorted out and ready to go.

There will be many somewhat interesting talks, lots of data professionals and cool things to learn.  Mostly I’m interested in the workshops – two days full of workshops and hands on experience with data science tools and programs.  A couple that I am most excited about are Intro to Text Analytics and Intro to Text Mining.  Both of these are very much along the lines of what I am working on personally and I hope that I will learn some techniques and gain a bit more insight about how the industry is moving.

Of the talks, and besides the keynotes, there are some great topics including things about crowd sourcing data collection, data analytics for the common good, collecting and analyzing open local government data.  I’ll be attending some of these if I can.  It feels to me that local government data analytics would be a great niche to work in.


More details, more avenues

Having ran across an interesting blog post today somewhere else, I started to get a few more ideas for expansion of the little r script that I have been building and that opens up a few avenues that I can start to work down in more detail.  To start with I’m thinking about users who have tweeted.   Using my current searchTwitter (berniesanders) I am putting the results for that into a data frame and that is real nice as it columnizes a number of variables one of which is the screen name.  So with that it is not difficult to make a table of those with frequency of tweets in the pulled data to see the most frequent tweeters.  Presently for development of the program I am just pulling a small set of tweets, 250, but I think the plan is to expand this quite a bit so it will reach back some number of days.  I have read somewhere that the Twitter API only exposes like 7 days of tweets thru searchTwitter, but I also have found elsewhere a method of collecting tweets from a stream and storing them which seems like the thing to do ultimately.  With this, and keeping with the US 2016 Election theme I am thinking it will be an interesting project to look for users tweeting in some way about multiple candidates.  That seems like a good direction to go.



I spent a little time today refining my code to clean up the resultant data.  For one I found I wasn’t doing cleaning on anything.  I had doing all of the cleaning prior to making a corpus of the words, thus no effect.  It still looked cool though.   I’ve also worked out removing standard and custom stop words and cleaned out the URLs using a little function with a gsub.   Above is the current result using #BernieSanders hashtag.  I’m not necessarily all that in to one candidate or the other, it’s just that this one is in the news lately and seems to have enough interesting activity.  The tweet count pulled was increased to 250.  Right now I can’t get stemCompletion to work, so many words here are truncated a little.   It is interesting to study though.   It seems like quite a cross section.   Something I find odd is that the word “amp” is a pretty common one and I’m not sure why that is.  It is a point for further research.

First output


It was a few days ago that I started this process.  I had from a while ago a twitter account set up and an app so I had all the keys codes and such.  One evening earlier this week I was able to connect and pull some data.  So I thought first thing would be to try to make a word cloud.  Through some trials and error I have managed such.  One of the main challenges I had was that it turned out that some of the tweets at the time have some oddball characters in them and that was causing stuff to choke and throw an error (error in simple_triplet_matrix….).  As usual the internet knows all the answers and I found somewhere that I could substitute those bad characters with nothing using   iconv(tweets, to = “utf-8-mac”, sub=””).   At first I didn’t have the -mac, didn’t know about it.   That was necessary for my program to succeed since I’m running it on a macbook.

Also, this cloud is cleaned a bit to remove some punctuation and some stop words as well as lowercasing everything, just basic functions available in the tm package.

Well, it’s a start of some sort.   I have managed to work out some mechanics and now can get down to the business of investigation.

The start of something small.

I’m keeping this blog as a sort of journal for a project involving data analysis.   My starting point is twitter data, but I don’t really know where it might end up.

I have worked with databases and reporting for a few decades and more recently and I guess in conjunction with the rise of data science I have become more interested in getting deeper.  Have taken a few online classes (Udacity, Coursera), have messed with R and Python a little personally and for work purposes and so I have come to a point where I feel like embarking on a project.  Now, usually a data project should start with a question.   That’s what all the texts say.  But I don’t have a specific question in mind at the moment.  Oh, sure, there are millions of questions that one could choose from.  I want the data to show me what to ask.   This is an area where I have specific experience and skill.  Looking at a set of something, studying it, getting familiar with it, and noticing things that are odd or different or interesting.  I think i actually learned this initially from Sesame Street (“one of these things is not like the others…”).  Seems silly, yes?  I think not, I think it is a very good skill to have.  At the 4 items level and at the millions of items level.  Being able to notice subtle differences, subtle similarities, stuff that doesn’t seem right or is out of place, the absence of something.   These are the triggers and indicators for action, insight, deeper knowledge.

So, here I go.  I’ll be journaling things that I find, sections of code, learnings, problems, whatever.   Mostly for my use and reference, but maybe there will be a reader who benefits.