Five must read books for beginning data scientists.

As a fan of lifelong learning, one thing I like about data science and data analytics is that there’s always new things to learn and most of them have useful practical applications. The explosion of interest in the field in the last few years means there are a bewildering amount of online resources, tutorials and courses for learning data science. For someone taking the first steps on their data science journey it can be difficult to know where to start. The humble book still has it’s place as a source of knowledge and new ideas though. And here are 5 books that everybody interested in data science should read.

These 5 books will give you a good introduction to the field of data science and provide a solid base for further learning. I would suggest reading them in the order listed below but there’s nothing stopping you reading them in any order you like. In choosing this list, I tried to include books written in an instructional style as well as some written in a more narrative style, so here goes:

The Signal and The Noise

The Signal and the Noise by Nate Silver

If you can only read one book about data science, then make it this one. Ok, strictly speaking it’s about prediction rather than data science but in reading this book, you will learn much about the scope of predictive analytics and by extension data science. Silver writes in an engaging way about the application of predictive models to a diverse collection of problems. He outlines some spectacular successes and failures of predictive modelling, outlining the reasons why particular approaches did or didn’t work. A book brimming with ideas and insights.

practical statistics for data scientists 50 essential concepts

Practical Statistics for Data Scientists by Peter Bruce and Andrew Bruce

This book does exactly what it says on the tin as an introduction to statistics, but does so hardly using any math at all. Practical statistics for data scientists focuses on explaining statistical concepts in a clear and concise way and keeps mathematical notation to a minimum. Obviously knowledge of statistics is important in data science, this book certainly won’t teach you all you need to know but it’s a good start. If you already have some background knowledge in statistics you might want to read something more advanced, if interested in AI and machine learning you’ll want to brush up on your linear algebra and calculus.

introduction to statistical learning

An Introduction to Statistical Learning by Gareth James, Daniela Witten, Trevor Hastie and Robert Tibshirani

This book introduces and explains the most popular algorithms for supervised and unsupervised machine learning.  At the end of each chapter there are exercises you can work through in R to test your understanding of the previous material. I’d recommend not skipping those and working through them to test how well you have grasped the material in each chapter. After reading this book and completing all the exercises you should have a solid grasp of the main machine learning algorithms and a good basis for further study.

mining of massive datasets

Mining of Massive Datasets by Jure Leskovec, Anand Rajaraman and Jeffrey David Ullman.

I remember being at what was otherwise a pretty good talk by an eminent data scientist when he mentioned that if you can fit the data you’re analysing on a spreadsheet then you aren’t doing data science. Well for sure that’s not correct, big data and data science are not the same thing, with big data being mainly a subset of the field of data science. Nonetheless it is important, for data scientists to understand the principles behind working with big data, typically defined as a dataset being too big to fit on a single machine but which could be many many multiples of that size. This book teaches data mining and machine learning methods for use with large datasets with a focus on MapReduce as a framework for parallel processing of big data. There are good exercises in this one too, don’t skip them would be my advice.

Weapons of math destruction

Weapons of Math Destruction by Cathy O’Neil

I’ve reviewed this book previously and don’t have much else to add. Suffice it to say that the ability to create data products that have the impact to adversely affect peoples lives brings with it ethical responsibilities and this book highlights that, which is why it’s an essential read.

Of course there are many more books that could be included in the list above, some honourable mentions include (but not limited to): The Visual Display of Quantitative Information by Edward Tufte (one area of data science that’s not covered in the above list is data visualisation, this book by Tufte is a classic), The Seven Pillars of Statistical Wisdom by Stephen Stigler (a great book about statistics, not specifically data science though), The Numbers Game by Chris Anderson and David Sally (an excellent book about the application of data analytics to football, this is one of my favourites but isn’t broad enough in scope to include in the list above), Big Data: A Revolution that will Transform how We Live, Work and Think by Kenneth Cukier and Viktor Mayer-Schonberger (A good book, I enjoyed the Signal and the Noise more though), and π – Faith in Chaos which isn’t a book at all or really about data science, but it is about numbers, mysticism and other interesting things.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.