A detailed synopsis of the book ‘Data Science for Business’

A detailed synopsis of the book ‘Data Science for Business’, What you need to know about Data Mining and Data-Analytic Thinking, written by Foster Provost and Tom Fawcett, published by O’Reilly Media.

The preface alone, tackling the ever present discussion of cross-over from academia to commerce, illustrates some nifty and jaw-dropping knowledge re-use techniques that put the reader in the right frame of mind to be creative in exploiting big data.

Chapter 1 centres the argument, with a good coverage of the well-known project where Walmart’s competitor Target managed to use their sales data actually predict when their customers were likely to start families and change their buying patterns accordingly. See the work by Duhigg (2012) in the bibliography for the full story.

Chapter 2 addresses the terminology in the hope that we can start standardising the language of this new field. I am not convinced their choices will stand the test of time, but the task certainly needs doing if we are all to speak the same language. There are many names for the same things – because so many fields converge in the field of Data Science. The authors quite rightly go back to the first principals of the pioneers, including my personal hero Claude Shannon, with his work on information theory dating from 1948.

If, as I do, you learn by experience rather than theory, you may be interested in their explanation of the emerging CRISP Data Mining Model: Cross Industry Standard Process for Data Mining, introduced by Shearer, 2000.

By chapter 6 ( Similarity, Neighbours and Clusters ) things get really interesting and certainly stretch what I see in most regular commercial decision making. The authors introduce another ingredient which suddenly makes the whole mixture fizz: The fundamental concept of similarity between data items. Their bibliography leads to the ‘Dictionary of Distances’ by Deza & Deza (Elsevier Science, 2006) which explains the many ways of calculating the distance between 2 things. Let me explain just a few:

  • Euclidean distance. If you know your lat and long and you know the lat and long of your customer, think Pythagoras and use squares to approximate how far away they are. As a bird flies.
  • Manhattan distance. A better way when customers can’t fly direct, but they have to use a grid of roads to reach you.
  • Cosine distance. A calculation used in text classification to measure the similarity of two documents
  • Edit distance or the Levenshtein metric. Also used in biology. How similar are 2 gene strings? How similar are 2 pieces of copy? Same calculation.

Chapter 7 gets really personal: how accurate are you yourself? how many decisions do you make? How often are they correct? What is your performance as a data analyst? What is the cost/benefit of all this? and introducing … the confusion matrix.

The subject matter of Chapter 8, on ROC curves, graphs for visualizing, and profit graphs will in truth be well familiar to most but I for one will be re-reading it a few times to scavenge tips for tuning my own algorithms

Chapter 9 turns to such activities as Targeting Online Consumers With Advertisements.  Read for the fascinating discussion of the 2013 study by Michal Kosinski, David Stillwell and Thore Graepel on how what people “Like” on Facebook is quite predictive of many unexpected traits

Accompanied by a really cool footnote / academic joke (I think): “For those unfamiliar with Facebook, it is a social networking site that allows people to . . . “

Chapter 10 is on text. (SEO)

Chapter 11 revisits the ‘why’ of all this, in terms of is it worth doing and (chapter 12) does it tell us something we did not really suspect before? Leading to 13: a look at Business Strategy which, quite frankly, is where I would have started if I had written this book.

A cracking book and I cannot recommend it too highly. The information they impart borders on the commercially competitive.

Warning: you do need to have retained your high-school level grasp of maths notation to absorb the contents.


I do that too! Data Mining and Data-Analytic Thinking: Data Science for Business

Foster Provost and Tom Fawcett have set out to write the go-to reference on Big Data. ‘Data Science for Business’, what you need to know about Data Mining and Data-Analytic Thinking, published by O’Reilly Media.

They have produced an authoritative book that is both a pathfinder and a lighthouse. It is a long, clearly-written book that shows what can be done using Big Data, where to go and what techniques to use to get it done, and what to watch out for.

Thank you for writing this book. The authors and their many references are already established and respected. The book brings the issues and their business applications together in one essential place. Already in just 1 month since release (25th July 2013) the eBook has gathered praise quotes from a dozen industry names. I am honoured to receive a complimentary review copy.

So to add to the recommendations, I pitch my review slightly differently: Who in business should buy this book? What does this book add to what we are already doing in business with Data and Data Mining?

On first reading, if you work in analysis, IT, Business Intelligence, Management Reporting, Marketing or SEO, I guarantee your reaction at some point will be ‘I do that too’.

For me the ‘Aha!’ realisation came a few pages into chapter 2. The authors discuss database searches for the most profitable items in a business. All businesses do that every day! But not always in the way the academics think.

The book surprised me in covering a broader range of topics than I previously considered were Data Science. Here are some great success stories to illustrate what data science is. Buy the book to see how these things really work and how the leading companies are applying themselves to these challenges. These studies border on the commercially sensitive.

–          How a supermarket can use their sales analysis to predict when people are expecting a baby, and so gain an advantage by making offers before their competitors.

–          How advertisers use Facebook Likes to profile and segment their audience

–          How Netflix make their movie recommendations

–          How to compare web pages for plagiarism

–          How to tell how far away a customer is from their mobile app

Chapter 10 talks about text analysis. In contrast to most of the book, I would say here that small and medium sized businesses are ahead of Google and the academics. While the search engines refine their algorithms to extract news and meaning from bare text, there is whole industry sector manipulating the source data to fool the algorithms and keep one step ahead: it is called Search Engine Optimisation.

If you are just starting out in using Big Data for your business decisions, you need to know the importance of Maths.  In particular there are 2 challenges in the mathematics that underpin Data Science that I should warn you about even if you do not read the book:

  • One is causation and correlation. When you find the beer-buying customers are also the nappy-buying customers, that is just the first step towards some very careful thinking before you draw any conclusions about which is cause and which is effect and how you might adjust your marketing or product mix to assist your customers accordingly
  • The other is what is now called ‘Overfitting’. Gaze hard enough and you will find trends in data just like you can find shapes in clouds or patterns on the back of your eyelids. If you search too hard through too much data, you invalidate correlation co-efficients and confidence calculations. Or to put it another way, every cloud looks like something.

A great book. For everyone who can still manage their high-school level maths, I recommend you buy this book. For everyone else, I recommend you be aware of the book and the issues within it and get it on the corporate bookshelf. For myself I look forward to checking back regularly for future editions as the science develops. Five stars.

The Modern Web, by Peter Gasston

The Modern Web


What would you expect in a book titled ‘The Modern Web’? A discussion on the connected world? the global village? How openness and the social web is bringing down levels of violence and crime, raising levels of transparency and honesty, and in a very small way bringing about the Kingdom of Heaven on Earth? How the wide availability of information is disrupting commerce, education, and our social structures?



I read The Modern Web, by Peter Gasston, from O’Reilly publishers in exchange for a public review. I was quite surprised to find the author describes the Modern Web from his perspective as a front-end web designer with only passing familiarity with server-side or any other aspects of the web. In other words, the view from the coal-face or interface between creative design agency and content deliverer.


The book is a revelation of the jobbing skills of the agency web developer!

  • Why every other web page looks the same these days, like a pumped up WordPress template.
  • The trials of implementing creative concepts within the confines of the web standards committees.
  • Playing catch-up as feature after feature leapfrogs into the www.
  • The nuts and bolts of what works today in ccs, html5, the rival browsers
  • Javascript with or without jquery

My first instinct is to review the author, not the book, and ask Peter whether he realises that a developer with front-end and no back-end will soon be obsolete. My second instinct is to speculate whether the emerging additions of local storage, webRtc, and beyond will in fact make everyone else obsolete. Maybe much of legacy software development will soon be swamped by the rising tide of html5 and javascript?

Interesting questions, all raised by the writing of this book, but none of them directly answered. Instead the author makes a workmanlike job of explaining the snapshot that is today ( actually the day before yesterday ) in the web developers world.

Some really useful sections:

  • Ever wondered why wikpedia authors insist on those strange video formats? Peter Gasston nails it in his ‘Format Wars’ breakout box in chapter 9 where I read the most concise and enlightening explanation of the codecs and patents and their impact on your design decisions.
  • Ever wanted an overview on modern Javascript on a ‘need-to-know’ basis rather than ‘learn the fundamentals’ basis? Chapter 5 is your friend, even upto and including explaining Polyfills and Shims.

There is not much else in this book that I did not know already, and I am not even a web developer. But it is useful to have all this stuff in one place.

The Modern Web is also, fittingly, a brilliant source of links. I was intrigued from the start by the retro typeface of the frontis (New Baskerville, I think), the name of his publishers ( ‘No Starch Inc’ ), the name of Gasston’s blog ( Broken Links- http://broken-links.com/), his technical reviewer, David Storey, and his disarming acknowledgements (“Although I’ve never met him, I’d like to thank David Walsh for maintaining an excellent website that I have used a lot.”)

A brilliant read for following avenues of links, and a great single point of reference for current practice in html coding.

I bet he uses Notepad.

Disruptive Possibilities: How Big Data Changes Everything

Disruptive Possibilities: How Big Data Changes Everything – by Jeffrey Needham.

I agreed to write a pre-release review this brief (70 page), intriguing book for O’Reilly in return for a complementary copy. I enjoyed it hugely. On the first read-through I was perplexed by the authors rattlesnake-fast pace, throwing out ideas and theories with self-assurance but little reference or external justification. So I read it again and if you download this book, I do not hesitate to say you will enjoy the author’s challenging, assertive and clearly experienced take on Big Data and what it means for industry.

And if you come from within the industry itself, as Jeff and I do, you will enjoy it all the more for the rear-mirror view Jeff sketches out of how IT infrastructures reached where they are now.

Of all the technologies that came and went and still remain stitched together to support commercial information processing as it is today, Jeff is strongest on Oracle, a little video, File Systems and Storage Networks. I hope if he revises and improves this book he will add some real-world references to his story, to give the rest of us some concrete examples to enjoy.

I can’t say I remember Jeff Needham from anywhere else before reviewing this book, so I checked out his profile on LinkedIn. If you try it, the author is the Jeff from Hortonworks, source for Apache Hadoop training and certification. Looking at his resume, I think I could once have been a customer! Jeff says he ‘owned’ portions of one of the first ANSI-compliant, 64-bit C compilers back in the last century!

An experienced chap, then.

Writing a pre-release review carries some responsibility: I hope these comments and critiques will encourage you to identify how this book relates to your own position, and download the book. I also have to give out advice that may swiftly need updating if the author, editor or publisher react to me. Namely: the book is incredibly short, without references, and without many supporting case-studies beyond the forcefulness and self-confidence of the author.

You know what Hadoop is, then? And how it came into being? This book will certainly explain that. Jeff was there at the birth, involved deeply in solving the challenges faced by Yahoo and the early internet businesses with high customer volumes and transaction volumes. He focuses on infrastructure, storage, and platform. Software, algorithms and design take second place in Jeff’s world view.

Jeff explains how the responses of the pioneers to these issues are re-usable for the other drivers of high volume computing. Future IT requirements are an exponentially growing multiple of:

  • large customer numbers
  • the development of recording not just entity state and transactions but also all changes, views, and interactions of those
  • data from sensors
  • mobile, of course.

What is ‘Big Data’? The author’s key assertion is that Moore’s Law – the prediction of continued improvement in the computing price/performance ratio – applies more to processing power than to data storage costs. And this, he believes, shapes the way super-computing evolves to embrace massively parallel but ideally very simple information processing operations on cheap hardware: the ‘platform’, he calls it.

( Aside: On the details, I don’t entirely buy everything he says. I bounced the idea off Kelly Sommers (@kellabyte) on Twitter and she commented: “I’ve found scaling RAM far more expensive than storage. Instagram recently saved 75% moving to persisted DB from in-mem DB.”)

The consequence,he explains very convincingly, is nothing short of disruption. Disruption to business models, and disruption to incumbent IT staff and processes.

An extremely convincing and thought-provoking book. Not a source for explaining exactly what Hadoop is built from or how it works, but still well worth the read.

Also refreshingly different in style to most other books in this area. Long, passionate stretches of dense argument that cover the ground as fast as a thriller.


Step By Step Microsoft Excel 2013 by Curtis D. Frye

Excel Manual in Chinese Excel 2010 Manual Excel 2013 Manual

Curtis Frye, author of Step by Step Microsoft Excel 2013,  is a long standing, prolific and popular author. So when O’Reilly offered me a complimentary copy of the latest edition to review, I was intrigued. What is so special about the Step By Step format? What do you learn that you cannot learn from other sources? Is it worth you paying money to buy the book?
For Office Excel strugglers worldwide, the answer is a definite yes. Buy this book, even if you already bought the ones before, and catch up with your Excel skills. The author presents well and concisely, so you can cover simple manoeuvres in short order. For management, strategists, and advanced users, the answer is yes, but a qualified yes. For both groups of interested readers, read on and I will explain why.
Firstly, Curtis Frye is clearly experienced at imparting information. So while there is probably very little that is new here, and the content is all about the presentation and operation side of Excel, what the author teaches he teaches well and quickly.
Next up <spoiler alert/> this 2013 edition of the book has a section that I had not seen in earlier editions lying around the workplaces I pass; and this section alone is worth the cover price: feature comparison.
The main point about Excel, strategically speaking, is features. Features are what persuade corporates to keep renewing their licences. Features are what introduce the quirks. features make and break productivity. Features are what are driving us away from proprietary products to open-sourced and standards-based alternatives. Features, in short, creep.
So, for the features comparison alone, this book is worth the cover price. You will find it in chapter 1 where the author promises, rightly, that “You will learn how to: Identify the different Excel 2013 programs, Identify new features of Excel 2013”.
But. Why do we see earlier versions of this book lying around everywhere? Mostly in unwanted locations: coffee tables, water-cooler library shelves, on empty desks with spines unbroken . . . Why?  Why with the combined resources of, say, MSDN, MrExcel, Google, and F1 Help is this thirst for knowledge unfulfilled?
Having read Curtis Frye’s book start to finish, and a good part of earlier editions for comparison, I suspect the answer lies as much in the Excel product as in the book. Top-heavy with features, Excel as a tool leaves the user with the taste of unfulfilment, the lingering feeling that they could be doing so much more. Like a scraped iceberg or a half-peeled onion. And meeting that feeling is really the core point of this book.
The book is beautifully presented. Aimed totally at the beginner (even formulas do not appear until midway through chapter 3), the layout has always been clear. The first page of each chapter lists the coming topics as page images. In angled perspective, like Powerpoint 2013. The typography is lovely and clear with very thin-stemmed sans-serif fonts that show well in shrunk, distorted or mobile situations. Text chunks are well-sized and broken up by the usual break-out boxes of hints, tips and asides.
The first time I read the book, my eye skipped most of the break-out boxes. On the second read-through, I noticed something more interesting. Those break-out texts are not the usual tips, tricks and power-user advice. More often they represent the authors struggles and frustrations as he works through the product himself! This is the secret of his success: like you, novice and inquisitive reader, Curtis has also been challenged by the creeping features and complexity of Excel. Look closely and you can detect a parallel subtext in the break-out boxes; a challengingly deadpan acceptance of the Microsoft Office counter-culture.
Like: “IMPORTANT If your data is the wrong type to be represented by the chart type you select, Excel displays an error message.”
Or: “TIP If you don’t know which distribution to choose, use Linear. .. you will most likely know, or be told by a colleague, when to use the others”
Altogether an empathetic author.
Oh, and for power users, a consolation prize at the end of the book: the legendary Excel keyboard shortcuts list!
Good luck.

Shipping Greatness

O’Reilly send me a copy of “Shipping Greatness” to review. This is a book that presses so many buttons you just have to read it. If necessary  standing up in the bookshop, or in the checkout queue. In my case I was so inspired by my first reading that I launched into shipping an App just to practice the ideas in the book. It took a little longer than I had anticipated, which is why the review got delayed a few weeks.

So, what did I learn from the book?

Chris Vander Mey worked at Google, Amazon, Microsoft and other high tech leaders. He shares all the best bits he learned about shipping software – ie the full development lifecycle, including the bits at each end, the conception and the after-sales services.

In this book he shares

  • – how to pitch your idea
  • – how to design
  • – how to secure team buy-in
  • – current successful techniques of UX ( user experience ) design
  • – how to size
  • – how to price
  • – commercial quality control
  • – how to launch
  • – incremental releasing
  • – how to support

This is totally invaluable. How come software tools get better and better, skills more and more widely known, hardware continues to develop with Moore’s law, and yet only a handful of tech leaders seem to be able to deliver solutions that work? While many businesses struggle with software releases and, in some famous cases, even collapse under the weight of upgrades, mergers or development projects? This book shares the secrets.

Chris Vander Mey name-drops and quotes extensively from the working practices of Bezos, Schmidt et al.

He tells us what works for them and therefore what might work for you. This book is full of gems of advice for anyone involved anywhere in the software development cycle.
And in this wired world, that means all of us.

Open Source and Small Businesses

Well, well, well. Here are some real surprises from some genuinely credible data-mining.

Tim O’Reilly and Hari Ravichandran of Endurance International Group (EIG) managed to pull out big-data statistics from a massive sample of ISP clients.  The authors combine these measures with intelligent knowledge of market size, trends and a couple of other pieces. They tell a good story of open source and its role in the economy.

The book is “Economic Impact of Open Source on Small Business: A Case Study” by Mike Hendrickson, Roger Magoulas and Tim O’Reilly.

I bet you want some ideas for yourself? I imagine more potential readers are looking for opportunity rather than for overall market understanding. There are some convincing claims in this book that could benefit many of us:


  • Which CMS is likely to be the most cost-effective?
  • Which programming languages really matter in the present market?
  • What kinds of business can make money from free content?

This is a small publication: read it and throw it away because fast-moving data like this needs to be consumed fresh. You will learn the answers to questions like the above. The book is well worth buying if you are about to set out on a new commercial venture. It is also a valuable signpost through the many technical options in ebusiness. And a sobering counter, in a surprising way, to over-engineers.

Available at O’Reilly