Category Archives: Data Mining

Sentiment Analysis of Web-Scraped Near-Death Experience Narratives

The recent Surviving Death documentary on Netflix is an interesting look at some of the evidence for the survival hypothesis in parapsychology. The first program in the series deals with Near-Death Experiences (NDEs). Accounts of NDEs are of interest to scientists, philosophers and others because of the possible insights they can provide into the nature of the mind-brain relationship. NDEs are also of interest because of their effects. NDEs are often profoundly transformative, having long-lasting and major effects on a persons attitudes and values. There is some research that shows that just learning about NDEs can bring psycho-spiritual benefits.

I have just had a paper published in the Journal of Near-Death studies in which I used a computational technique known as sentiment analysis to measure the sentiment polarity of the words with which people described their NDEs.

Continue reading

Learning Analytics – Predicting Academic Performance

A recent longitudinal study by the HEA that tracked the progress of more than 34,000 students enrolled in third level education in Ireland in 2007/2008, found that 76% graduated over the following ten years. Completion rates varied somewhat by type of college, subject and gender. Overall 58% of students graduated on time. Although apparently these figures compare well internationally, one can see that more than 40% of third level students in this cohort didn’t graduate on time and nearly a quarter hadn’t graduated in the following ten year period.

Continue reading

#RugbyWorldCup on Twitter – Part 2

This is the second part of my analysis of tweets containing #RugbyWorldCup based on a dataset collected from Twitter using rtweet between 24th September and 12th October. In this part I will be analysing the tweets themselves, doing some data exploration, sentiment analysis and topic modelling.

Continue reading

The #RugbyWorldCup so far on the Twitterverse

We’re about halfway through the 2019 Rugby World Cup, the pool stages finished yesterday and I thought it might be interesting to look at what the Twitterverse has made of the tournament so far.

Since just after the beginning of the tournament (24th September to be precise) up til 12th October I have been connecting to the Twitter API every few days using rtweet.

Continue reading

Gathering Twitter Data

Gathering Twitter Data


Analysis of Twitter data has become a popular topic in recent years. Data mining of tweets is used by companies for things like brand monitoring and trends research, by academics for research purposes and of course by data science students doing projects for coursework! But what’s the best way of building a corpus of tweets for your data work? Well, it depends. Below are some of the ways of putting together a collection of tweets.

Continue reading

The Bias-Variance Tradeoff

There are several sources of error that can affect the accuracy of machine learning models including bias and variance. A fundamental machine learning concept is what’s known as the bias-variance tradeoff. This article discusses what’s meant by bias and variance and how trading them off against one another can affect model accuracy.

Continue reading

A Quick Introduction to Clustering

Cluster analysis more usually referred to as clustering, is a common data mining task. In clustering the goal is to divide the data set into groups so that objects in the same group are similar to one another while objects in different groups are different to one another. In other words the goal is to minimize the intra-cluster distance while maximizing the inter-cluster distance.

Continue reading

Why “computer says no” may no longer be an option.

The General Data Protection Regulation (GDPR)  is a new data protection regulation that will be effective across the EU from 25th May 2018. The GDPR applies to all companies that process data of EU citizens regardless of where the companies are based. It replaces the Directive 95/46/EC normally referred to as the Data Protection Directive which dates back to the 1990’s.

Continue reading

Proximity Measures for Data Mining

Data mining is the process of finding interesting patterns in large quantities of data. This process of knowledge discovery involves various steps, the most obvious of these being the application of algorithms to the data set to discover patterns as in, for example, clustering. The goal of clustering is to find natural groups in the data so that objects in a group are similar to other objects in the same group and dissimilar to objects in other groups. When implementing clustering (and also some classification algorithms such as k nearest neighbours), it is important to be able to quantify the proximity of objects to one another. Continue reading