Cluster analysis more usually referred to as clustering, is a common data mining task. In clustering the goal is to divide the data set into groups so that objects in the same group are similar to one another while objects in different groups are different to one another. In other words the goal is to minimize the intra-cluster distance while maximizing the inter-cluster distance.
The General Data Protection Regulation (GDPR) is a new data protection regulation that will be effective across the EU from 25th May 2018. The GDPR applies to all companies that process data of EU citizens regardless of where the companies are based. It replaces the Directive 95/46/EC normally referred to as the Data Protection Directive which dates back to the 1990’s.
Data Science, Data Mining, Machine Learning, Artificial Intelligence, Big Data … the list goes on. All terms that eager cheerleaders of the data revolution are highlighting that organisations need to embrace. With all the hype and attention it’s not surprising that businesses feel they need to be become more data-driven or risk losing competitive advantage. And definitely there is substance to the hype otherwise companies like IBM wouldn’t be pouring literally billions of dollars in investment into their big data capabilities.
(Note: This article discusses Bayesian and Frequentist statistics and follows from this previous one). Parapsychology has played an important role in ensuring that psychology retains at least some focus on anomalous human experiences. These experiences are very common and if psychology is truly to be the science of behavior and mental processes then it needs to take account of them. In addition to posing legitimate questions to materialist reductionist orthodoxy, parapsychology has also made contributions to scientific methodology in areas like study design, statistical inference and meta-analysis.
I have been interested in parapsychology ever since picking up a copy of the excellent Eysenck and Sargent book – Explaining the Unexplained many years ago. It’s a bit dated now but still a great introduction for anyone interested in learning more about the topic.
Just like with data science there is sometimes confusion about what parapsychology is. Perhaps it’s easier to start with what it’s not. It’s not astrology, ghost busting, monster hunting, fortune telling or investigating UFO sightings though these are things that often come to mind when one thinks of the paranormal mainly because of the influence of television shows on the ‘paranormal’.
There is huge hype at the moment about Data Science and it seems like everybody is trying to get in on the game. While hype might help bring the topic to popular attention, it can also serve to obscure and confuse. What do people mean when they talk about data science? Is it all just hype? Continue reading
Data mining is the process of finding interesting patterns in large quantities of data. This process of knowledge discovery involves various steps, the most obvious of these being the application of algorithms to the data set to discover patterns as in, for example, clustering. The goal of clustering is to find natural groups in the data so that objects in a group are similar to other objects in the same group and dissimilar to objects in other groups. When implementing clustering (and also some classification algorithms such as k nearest neighbours), it is important to be able to quantify the proximity of objects to one another. Continue reading
It’s well known that data preparation is often the most time consuming part of any data science project. Before you can start doing much with the data though it first needs to be imported into your analysis software, something which can sometimes prove more tricky than anticipated. Continue reading
Sometimes people assume that data science and related areas are all about consumer facing businesses trying to learn ways to sell more product. I’m not sure whether this viewpoint is more naive or more cynical. Data science is about translating raw data into knowledge and insight to enable better decision making. Therefore it has wide and varied application. Data science and related fields like data mining, machine learning and big data have huge potential to drive innovation in social enterprise and business for social good.
Education too is being impacted by the data revolution. This isn’t something that may happen in the future. It has already begun. Schools in America are already using systems that combine data points like attendance and grades to predict which students are at risk of school dropout years in advance. Continue reading
Part 2 of this intro to k-NN demonstrates an implementation of the algorithm in r. Part 1 discussed the algorithm itself. I have chosen a data set from the UCI Machine Learning Repository to work with. I am using the Banknote Authentication data set. This data set consists of measurements of 400 x 400 pixel pictures of forged and genuine bank notes. Pictures were grey scale with a resolution of about 660 dpi. A wavelet transform tool was used to extract features from the pictures. The features used are variance, skewness and kurtosis of the wavelet transformed image and the entropy of the image. The class label is whether the bank note is genuine or not (0=no, 1=yes). k-NN should be a reasonable choice of algorithm for this data set as features are numerical and there are not too many of them in relation to the number of instances though obviously other factors (e.g. amount of noise in the data set) are important also. Continue reading