Recently I was trying to extract structure from a large corpus of documents. Nearly all the documents were short, many were just notes of one or two lines in length. Regular approaches to clustering do not work so well here. Nonetheless after doing some research I found a suitable method that I was able to apply on the data using the statistical programming language R.Continue reading
When I was asked to do a workshop for the recent whyR mini conference running alongside Career Zoo at Thomond Park, I had to give some thought as to what to present on. I wanted to demo something that showed the power of R while at the same time being easy to use and something that might be at least somewhat interesting and fun for participants!Continue reading
Classification is a common machine learning task. This is where we have a data set of labelled examples with which we build a model that can then be used to (hopefully accurately!) assign a class to new unlabelled examples. There are various points at which we might want to test the performance of the model. Initially we might tune parameters or hyperparameters using cross validation, then check the best performing models on the test set. If putting the model into production we may also want to test it on live data, we might even use different evaluation measures at different stages of this process. This article discusses some frequently used measures for evaluating the performance of classification models.
I had been meaning to read this book for a while. It features on many recommended reading lists for data science and its author, Cathy O’Neil, was a proponent of data science who co-authored “Doing Data Science”, an excellent practical introduction to the subject. So I was interested to read what might be the antidote to some of the current big data hubris. Having started to read it a while back but put it aside, a recent holiday to Poland gave me a chance to revisit it.
The last article provided a brief introduction to clustering. This one demonstrates how to conduct a basic clustering analysis in the statistical computing environment R (I have actually split it into 2 parts as it got rather long!). For demos like this it is easiest to use a small data set, ideally with few features relative to instances. The one used in this example is the Acidosis Patients data set available from this collection of clustering data sets. This data set has 40 instances, each corresponding to a patient and 6 features each corresponding to a measurement of blood or cerebrospinal fluid. Continue reading
The General Data Protection Regulation (GDPR) is a new data protection regulation that will be effective across the EU from 25th May 2018. The GDPR applies to all companies that process data of EU citizens regardless of where the companies are based. It replaces the Directive 95/46/EC normally referred to as the Data Protection Directive which dates back to the 1990’s.
Data mining is the process of finding interesting patterns in large quantities of data. This process of knowledge discovery involves various steps, the most obvious of these being the application of algorithms to the data set to discover patterns as in, for example, clustering. The goal of clustering is to find natural groups in the data so that objects in a group are similar to other objects in the same group and dissimilar to objects in other groups. When implementing clustering (and also some classification algorithms such as k nearest neighbours), it is important to be able to quantify the proximity of objects to one another. Continue reading
Part 2 of this intro to k-NN demonstrates an implementation of the algorithm in r. Part 1 discussed the algorithm itself. I have chosen a data set from the UCI Machine Learning Repository to work with. I am using the Banknote Authentication data set. This data set consists of measurements of 400 x 400 pixel pictures of forged and genuine bank notes. Pictures were grey scale with a resolution of about 660 dpi. A wavelet transform tool was used to extract features from the pictures. The features used are variance, skewness and kurtosis of the wavelet transformed image and the entropy of the image. The class label is whether the bank note is genuine or not (0=no, 1=yes). k-NN should be a reasonable choice of algorithm for this data set as features are numerical and there are not too many of them in relation to the number of instances though obviously other factors (e.g. amount of noise in the data set) are important also. Continue reading
One of the oldest and most popular classification algorithms is nearest neighbors algorithm. It’s also one of the easiest algorithms to understand so is a good place to start when learning about data mining algorithms. Part one of this article provides a brief introduction to, and overview of, k-NN. Part two will demonstrate an implementation of it in r.
Essentially the nearest neighbors algorithm is based on the premise that the more features objects have in common the more likely they are to belong to the same class. Nearest neighbors is a non-parametric method so it is not reliant on assumptions about the underlying distribution of the data set. It is called a lazy learning method because unlike most classification algorithms it does not attempt to model the data set. Instead test cases are compared to other cases in the data set to determine their class. Continue reading