A recent longitudinal study by the HEA that tracked the progress of more than 34,000 students enrolled in third level education in Ireland in 2007/2008, found that 76% graduated over the following ten years. Completion rates varied somewhat by type of college, subject and gender. Overall 58% of students graduated on time. Although apparently these figures compare well internationally, one can see that more than 40% of third level students in this cohort didn’t graduate on time and nearly a quarter hadn’t graduated in the following ten year period.Continue reading
The last article provided a brief introduction to clustering. This one demonstrates how to conduct a basic clustering analysis in the statistical computing environment R (I have actually split it into 2 parts as it got rather long!). For demos like this it is easiest to use a small data set, ideally with few features relative to instances. The one used in this example is the Acidosis Patients data set available from this collection of clustering data sets. This data set has 40 instances, each corresponding to a patient and 6 features each corresponding to a measurement of blood or cerebrospinal fluid. Continue reading
It’s well known that data preparation is often the most time consuming part of any data science project. Before you can start doing much with the data though it first needs to be imported into your analysis software, something which can sometimes prove more tricky than anticipated. Continue reading
Part 2 of this intro to k-NN demonstrates an implementation of the algorithm in r. Part 1 discussed the algorithm itself. I have chosen a data set from the UCI Machine Learning Repository to work with. I am using the Banknote Authentication data set. This data set consists of measurements of 400 x 400 pixel pictures of forged and genuine bank notes. Pictures were grey scale with a resolution of about 660 dpi. A wavelet transform tool was used to extract features from the pictures. The features used are variance, skewness and kurtosis of the wavelet transformed image and the entropy of the image. The class label is whether the bank note is genuine or not (0=no, 1=yes). k-NN should be a reasonable choice of algorithm for this data set as features are numerical and there are not too many of them in relation to the number of instances though obviously other factors (e.g. amount of noise in the data set) are important also. Continue reading
People sometimes ask what the best tool or software package for doing data science is. The answer is the tool or tools that get the job done! Data science is more about the process and results than any particular piece of software or technology. It is definitely useful for the analyst to be familiar with a range of tools so that he or she can choose the most appropriate one for the task at hand.