## Gathering Twitter Data

Analysis of Twitter data has become a popular topic in recent years. Data mining of tweets is used by companies for things like brand monitoring and trends research, by academics for research purposes and of course by data science students doing projects for coursework! But what’s the best way of building a corpus of tweets for your data work? Well, it depends. Below are some of the ways of putting together a collection of tweets.

## Common Evaluation Measures for Classification Models

Classification is a common machine learning task. This is where we have a data set of labelled examples with which we build a model that can then be used to (hopefully accurately!) assign a class to new unlabelled examples. There are various points at which we might want to test the performance of the model. Initially we might tune parameters or hyperparameters using cross validation, then check the best performing models on the test set. If putting the model into production we may also want to test it on live data, we might even use different evaluation measures at different stages of this process. This article discusses some frequently used measures for evaluating the performance of classification models.

## Weapons of Math Destruction – A Review

I had been meaning to read this book for a while. It features on many recommended reading lists for data science and its author, Cathy O’Neil, was a proponent of data science who co-authored “Doing Data Science”, an excellent practical introduction to the subject. So I was interested to read what might be the antidote to some of the current big data hubris. Having started to read it a while back but put it aside, a recent holiday to Poland gave me a chance to revisit it.

## The Bias-Variance Tradeoff

There are several sources of error that can affect the accuracy of machine learning models including bias and variance. A fundamental machine learning concept is what’s known as the bias-variance tradeoff. This article discusses what’s meant by bias and variance and how trading them off against one another can affect model accuracy.

## Clustering Analysis in R – Part 1

The last article provided a brief introduction to clustering. This one demonstrates how to conduct a basic clustering analysis in the statistical computing environment R (I have actually split it into 2 parts as it got rather long!). For demos like this it is easiest to use a small data set, ideally with few features relative to instances. The one used in this example is the Acidosis Patients data set available from this collection of clustering data sets. This data set has 40 instances, each corresponding to a patient and 6 features each corresponding to a measurement of blood or cerebrospinal fluid. Continue reading

## A Quick Introduction to Clustering

Cluster analysis more usually referred to as clustering, is a common data mining task. In clustering the goal is to divide the data set into groups so that objects in the same group are similar to one another while objects in different groups are different to one another. In other words the goal is to minimize the intra-cluster distance while maximizing the inter-cluster distance.

## Why “computer says no” may no longer be an option.

The General Data Protection Regulation (GDPR)  is a new data protection regulation that will be effective across the EU from 25th May 2018. The GDPR applies to all companies that process data of EU citizens regardless of where the companies are based. It replaces the Directive 95/46/EC normally referred to as the Data Protection Directive which dates back to the 1990’s.

## How to build an effective data science capability?

Data Science, Data Mining, Machine Learning, Artificial Intelligence, Big Data … the list goes on. All terms that eager cheerleaders of the data revolution are highlighting that organisations need to embrace. With all the hype and attention it’s not surprising that businesses feel they need to be become more data-driven or risk losing competitive advantage. And definitely there is substance to the hype otherwise companies like IBM wouldn’t be pouring literally billions of dollars in investment into their big data capabilities.