Classification is a common machine learning task. This is where we have a data set of labelled examples with which we build a model that can then be used to (hopefully accurately!) assign a class to new unlabelled examples. There are various points at which we might want to test the performance of the model. Initially we might tune parameters or hyperparameters using cross validation, then check the best performing models on the test set. If putting the model into production we may also want to test it on live data, we might even use different evaluation measures at different stages of this process. This article discusses some frequently used measures for evaluating the performance of classification models.

## Weapons of Math Destruction – A Review

I had been meaning to read this book for a while. It features on many recommended reading lists for data science and its author, Cathy O’Neil, was a proponent of data science who co-authored “Doing Data Science”, an excellent practical introduction to the subject. So I was interested to read what might be the antidote to some of the current big data hubris. Having started to read it a while back but put it aside, a recent holiday to Poland gave me a chance to revisit it.

## The Bias-Variance Tradeoff

There are several sources of error that can affect the accuracy of machine learning models including bias and variance. A fundamental machine learning concept is what’s known as the bias-variance tradeoff. This article discusses what’s meant by bias and variance and how trading them off against one another can affect model accuracy.

## Clustering Analysis in R – part 2

This is part 2 of a clustering demo in R. You can read Part 1 here which deals with assessing clustering tendency of the data and deciding on cluster number. This part looks at performing clustering using Partitioning Around Medoids algorithm and validating the results. Continue reading

## Clustering Analysis in R – Part 1

The last article provided a brief introduction to clustering. This one demonstrates how to conduct a basic clustering analysis in the statistical computing environment R (I have actually split it into 2 parts as it got rather long!). For demos like this it is easiest to use a small data set, ideally with few features relative to instances. The one used in this example is the Acidosis Patients data set available from this collection of clustering data sets. This data set has 40 instances, each corresponding to a patient and 6 features each corresponding to a measurement of blood or cerebrospinal fluid. Continue reading

## A Quick Introduction to Clustering

Cluster analysis more usually referred to as clustering, is a common data mining task. In clustering the goal is to divide the data set into groups so that objects in the same group are similar to one another while objects in different groups are different to one another. In other words the goal is to minimize the intra-cluster distance while maximizing the inter-cluster distance.

## Why “computer says no” may no longer be an option.

The General Data Protection Regulation (GDPR) is a new data protection regulation that will be effective across the EU from 25^{th} May 2018. The GDPR applies to all companies that process data of EU citizens regardless of where the companies are based. It replaces the Directive 95/46/EC normally referred to as the Data Protection Directive which dates back to the 1990’s.

## How to build an effective data science capability?

Data Science, Data Mining, Machine Learning, Artificial Intelligence, Big Data … the list goes on. All terms that eager cheerleaders of the data revolution are highlighting that organisations need to embrace. With all the hype and attention it’s not surprising that businesses feel they need to be become more data-driven or risk losing competitive advantage. And definitely there is substance to the hype otherwise companies like IBM wouldn’t be pouring literally billions of dollars in investment into their big data capabilities.

Continue reading

## Statistical Paradigms – Bayesian and Frequentist

(Note: This article discusses Bayesian and Frequentist statistics and follows from this previous one). Parapsychology has played an important role in ensuring that psychology retains at least some focus on anomalous human experiences. These experiences are very common and if psychology is truly to be the science of behavior and mental processes then it needs to take account of them. In addition to posing legitimate questions to materialist reductionist orthodoxy, parapsychology has also made contributions to scientific methodology in areas like study design, statistical inference and meta-analysis.

## What do parapsychology and data science have in common?

I have been interested in parapsychology ever since picking up a copy of the excellent Eysenck and Sargent book – Explaining the Unexplained many years ago. It’s a bit dated now but still a great introduction for anyone interested in learning more about the topic.

Just like with data science there is sometimes confusion about what parapsychology is. Perhaps it’s easier to start with what it’s not. It’s not astrology, ghost busting, monster hunting, fortune telling or investigating UFO sightings though these are things that often come to mind when one thinks of the paranormal mainly because of the influence of television shows on the ‘paranormal’.