Recently I was trying to extract structure from a large corpus of documents. Nearly all the documents were short, many were just notes of one or two lines in length. Regular approaches to clustering do not work so well here. Nonetheless after doing some research I found a suitable method that I was able to apply on the data using the statistical programming language R.Continue reading
The recent Surviving Death documentary on Netflix is an interesting look at some of the evidence for the survival hypothesis in parapsychology. The first program in the series deals with Near-Death Experiences (NDEs). Accounts of NDEs are of interest to scientists, philosophers and others because of the possible insights they can provide into the nature of the mind-brain relationship. NDEs are also of interest because of their effects. NDEs are often profoundly transformative, having long-lasting and major effects on a persons attitudes and values. There is some research that shows that just learning about NDEs can bring psycho-spiritual benefits.
I have just had a paper published in the Journal of Near-Death studies in which I used a computational technique known as sentiment analysis to measure the sentiment polarity of the words with which people described their NDEs.Continue reading
As a fan of lifelong learning, one thing I like about data science and data analytics is that there’s always new things to learn and most of them have useful practical applications. The explosion of interest in the field in the last few years means there are a bewildering amount of online resources, tutorials and courses for learning data science. For someone taking the first steps on their data science journey it can be difficult to know where to start. The humble book still has it’s place as a source of knowledge and new ideas though. And here are 5 books that everybody interested in data science should read.Continue reading
When I was asked to do a workshop for the recent whyR mini conference running alongside Career Zoo at Thomond Park, I had to give some thought as to what to present on. I wanted to demo something that showed the power of R while at the same time being easy to use and something that might be at least somewhat interesting and fun for participants!Continue reading
A recent longitudinal study by the HEA that tracked the progress of more than 34,000 students enrolled in third level education in Ireland in 2007/2008, found that 76% graduated over the following ten years. Completion rates varied somewhat by type of college, subject and gender. Overall 58% of students graduated on time. Although apparently these figures compare well internationally, one can see that more than 40% of third level students in this cohort didn’t graduate on time and nearly a quarter hadn’t graduated in the following ten year period.Continue reading
Gathering Twitter Data
Analysis of Twitter data has become a popular topic in recent years. Data mining of tweets is used by companies for things like brand monitoring and trends research, by academics for research purposes and of course by data science students doing projects for coursework! But what’s the best way of building a corpus of tweets for your data work? Well, it depends. Below are some of the ways of putting together a collection of tweets.Continue reading
Classification is a common machine learning task. This is where we have a data set of labelled examples with which we build a model that can then be used to (hopefully accurately!) assign a class to new unlabelled examples. There are various points at which we might want to test the performance of the model. Initially we might tune parameters or hyperparameters using cross validation, then check the best performing models on the test set. If putting the model into production we may also want to test it on live data, we might even use different evaluation measures at different stages of this process. This article discusses some frequently used measures for evaluating the performance of classification models.
I had been meaning to read this book for a while. It features on many recommended reading lists for data science and its author, Cathy O’Neil, was a proponent of data science who co-authored “Doing Data Science”, an excellent practical introduction to the subject. So I was interested to read what might be the antidote to some of the current big data hubris. Having started to read it a while back but put it aside, a recent holiday to Poland gave me a chance to revisit it.
There are several sources of error that can affect the accuracy of machine learning models including bias and variance. A fundamental machine learning concept is what’s known as the bias-variance tradeoff. This article discusses what’s meant by bias and variance and how trading them off against one another can affect model accuracy.