A recent longitudinal study by the HEA that tracked the progress of more than 34,000 students enrolled in third level education in Ireland in 2007/2008, found that 76% graduated over the following ten years. Completion rates varied somewhat by type of college, subject and gender. Overall 58% of students graduated on time. Although apparently these figures compare well internationally, one can see that more than 40% of third level students in this cohort didn’t graduate on time and nearly a quarter hadn’t graduated in the following ten year period.Continue reading
This is the second part of my analysis of tweets containing #RugbyWorldCup based on a dataset collected from Twitter using rtweet between 24th September and 12th October. In this part I will be analysing the tweets themselves, doing some data exploration, sentiment analysis and topic modelling.Continue reading
We’re about halfway through the 2019 Rugby World Cup, the pool stages finished yesterday and I thought it might be interesting to look at what the Twitterverse has made of the tournament so far.
Since just after the beginning of the tournament (24th September to be precise) up til 12th October I have been connecting to the Twitter API every few days using rtweet.
Gathering Twitter Data
Analysis of Twitter data has become a popular topic in recent years. Data mining of tweets is used by companies for things like brand monitoring and trends research, by academics for research purposes and of course by data science students doing projects for coursework! But what’s the best way of building a corpus of tweets for your data work? Well, it depends. Below are some of the ways of putting together a collection of tweets.Continue reading
Classification is a common machine learning task. This is where we have a data set of labelled examples with which we build a model that can then be used to (hopefully accurately!) assign a class to new unlabelled examples. There are various points at which we might want to test the performance of the model. Initially we might tune parameters or hyperparameters using cross validation, then check the best performing models on the test set. If putting the model into production we may also want to test it on live data, we might even use different evaluation measures at different stages of this process. This article discusses some frequently used measures for evaluating the performance of classification models.
I had been meaning to read this book for a while. It features on many recommended reading lists for data science and its author, Cathy O’Neil, was a proponent of data science who co-authored “Doing Data Science”, an excellent practical introduction to the subject. So I was interested to read what might be the antidote to some of the current big data hubris. Having started to read it a while back but put it aside, a recent holiday to Poland gave me a chance to revisit it.
There are several sources of error that can affect the accuracy of machine learning models including bias and variance. A fundamental machine learning concept is what’s known as the bias-variance tradeoff. This article discusses what’s meant by bias and variance and how trading them off against one another can affect model accuracy.
The last article provided a brief introduction to clustering. This one demonstrates how to conduct a basic clustering analysis in the statistical computing environment R (I have actually split it into 2 parts as it got rather long!). For demos like this it is easiest to use a small data set, ideally with few features relative to instances. The one used in this example is the Acidosis Patients data set available from this collection of clustering data sets. This data set has 40 instances, each corresponding to a patient and 6 features each corresponding to a measurement of blood or cerebrospinal fluid. Continue reading
Cluster analysis more usually referred to as clustering, is a common data mining task. In clustering the goal is to divide the data set into groups so that objects in the same group are similar to one another while objects in different groups are different to one another. In other words the goal is to minimize the intra-cluster distance while maximizing the inter-cluster distance.