Monthly Archives: August 2013

A detailed synopsis of the book ‘Data Science for Business’

A detailed synopsis of the book ‘Data Science for Business’, What you need to know about Data Mining and Data-Analytic Thinking, written by Foster Provost and Tom Fawcett, published by O’Reilly Media.

The preface alone, tackling the ever present discussion of cross-over from academia to commerce, illustrates some nifty and jaw-dropping knowledge re-use techniques that put the reader in the right frame of mind to be creative in exploiting big data.

Chapter 1 centres the argument, with a good coverage of the well-known project where Walmart’s competitor Target managed to use their sales data actually predict when their customers were likely to start families and change their buying patterns accordingly. See the work by Duhigg (2012) in the bibliography for the full story.

Chapter 2 addresses the terminology in the hope that we can start standardising the language of this new field. I am not convinced their choices will stand the test of time, but the task certainly needs doing if we are all to speak the same language. There are many names for the same things – because so many fields converge in the field of Data Science. The authors quite rightly go back to the first principals of the pioneers, including my personal hero Claude Shannon, with his work on information theory dating from 1948.

If, as I do, you learn by experience rather than theory, you may be interested in their explanation of the emerging CRISP Data Mining Model: Cross Industry Standard Process for Data Mining, introduced by Shearer, 2000.

By chapter 6 ( Similarity, Neighbours and Clusters ) things get really interesting and certainly stretch what I see in most regular commercial decision making. The authors introduce another ingredient which suddenly makes the whole mixture fizz: The fundamental concept of similarity between data items. Their bibliography leads to the ‘Dictionary of Distances’ by Deza & Deza (Elsevier Science, 2006) which explains the many ways of calculating the distance between 2 things. Let me explain just a few:

  • Euclidean distance. If you know your lat and long and you know the lat and long of your customer, think Pythagoras and use squares to approximate how far away they are. As a bird flies.
  • Manhattan distance. A better way when customers can’t fly direct, but they have to use a grid of roads to reach you.
  • Cosine distance. A calculation used in text classification to measure the similarity of two documents
  • Edit distance or the Levenshtein metric. Also used in biology. How similar are 2 gene strings? How similar are 2 pieces of copy? Same calculation.

Chapter 7 gets really personal: how accurate are you yourself? how many decisions do you make? How often are they correct? What is your performance as a data analyst? What is the cost/benefit of all this? and introducing … the confusion matrix.

The subject matter of Chapter 8, on ROC curves, graphs for visualizing, and profit graphs will in truth be well familiar to most but I for one will be re-reading it a few times to scavenge tips for tuning my own algorithms

Chapter 9 turns to such activities as Targeting Online Consumers With Advertisements.  Read for the fascinating discussion of the 2013 study by Michal Kosinski, David Stillwell and Thore Graepel on how what people “Like” on Facebook is quite predictive of many unexpected traits

Accompanied by a really cool footnote / academic joke (I think): “For those unfamiliar with Facebook, it is a social networking site that allows people to . . . “

Chapter 10 is on text. (SEO)

Chapter 11 revisits the ‘why’ of all this, in terms of is it worth doing and (chapter 12) does it tell us something we did not really suspect before? Leading to 13: a look at Business Strategy which, quite frankly, is where I would have started if I had written this book.

A cracking book and I cannot recommend it too highly. The information they impart borders on the commercially competitive.

Warning: you do need to have retained your high-school level grasp of maths notation to absorb the contents.


I do that too! Data Mining and Data-Analytic Thinking: Data Science for Business

Foster Provost and Tom Fawcett have set out to write the go-to reference on Big Data. ‘Data Science for Business’, what you need to know about Data Mining and Data-Analytic Thinking, published by O’Reilly Media.

They have produced an authoritative book that is both a pathfinder and a lighthouse. It is a long, clearly-written book that shows what can be done using Big Data, where to go and what techniques to use to get it done, and what to watch out for.

Thank you for writing this book. The authors and their many references are already established and respected. The book brings the issues and their business applications together in one essential place. Already in just 1 month since release (25th July 2013) the eBook has gathered praise quotes from a dozen industry names. I am honoured to receive a complimentary review copy.

So to add to the recommendations, I pitch my review slightly differently: Who in business should buy this book? What does this book add to what we are already doing in business with Data and Data Mining?

On first reading, if you work in analysis, IT, Business Intelligence, Management Reporting, Marketing or SEO, I guarantee your reaction at some point will be ‘I do that too’.

For me the ‘Aha!’ realisation came a few pages into chapter 2. The authors discuss database searches for the most profitable items in a business. All businesses do that every day! But not always in the way the academics think.

The book surprised me in covering a broader range of topics than I previously considered were Data Science. Here are some great success stories to illustrate what data science is. Buy the book to see how these things really work and how the leading companies are applying themselves to these challenges. These studies border on the commercially sensitive.

–          How a supermarket can use their sales analysis to predict when people are expecting a baby, and so gain an advantage by making offers before their competitors.

–          How advertisers use Facebook Likes to profile and segment their audience

–          How Netflix make their movie recommendations

–          How to compare web pages for plagiarism

–          How to tell how far away a customer is from their mobile app

Chapter 10 talks about text analysis. In contrast to most of the book, I would say here that small and medium sized businesses are ahead of Google and the academics. While the search engines refine their algorithms to extract news and meaning from bare text, there is whole industry sector manipulating the source data to fool the algorithms and keep one step ahead: it is called Search Engine Optimisation.

If you are just starting out in using Big Data for your business decisions, you need to know the importance of Maths.  In particular there are 2 challenges in the mathematics that underpin Data Science that I should warn you about even if you do not read the book:

  • One is causation and correlation. When you find the beer-buying customers are also the nappy-buying customers, that is just the first step towards some very careful thinking before you draw any conclusions about which is cause and which is effect and how you might adjust your marketing or product mix to assist your customers accordingly
  • The other is what is now called ‘Overfitting’. Gaze hard enough and you will find trends in data just like you can find shapes in clouds or patterns on the back of your eyelids. If you search too hard through too much data, you invalidate correlation co-efficients and confidence calculations. Or to put it another way, every cloud looks like something.

A great book. For everyone who can still manage their high-school level maths, I recommend you buy this book. For everyone else, I recommend you be aware of the book and the issues within it and get it on the corporate bookshelf. For myself I look forward to checking back regularly for future editions as the science develops. Five stars.