Nigel Marriott looks to data science, infographics and cognitive psychology to expand our definition of what it means to think like a statistician
In 1903, in his book Mankind in the Making, the writer H. G. Wells noted that: “The great body of physical science … [is] only accessible and only thinkable to those who have had a sound training in mathematical analysis, and the time may not be very remote when it will be understood that for complete initiation as an efficient citizen … it is necessary to be able to compute, to think in averages and maxima and minima, as it is now to be able to read and write.”1
The above is a fairly tortuous passage that was shortened and simplified in 1951 by Samuel Wilks when he remarked, during his presidential address to the American Statistical Association, that: “Statistical thinking will one day be as necessary for efficient citizenship as the ability to read and write.”
It is certainly a snappier quote. However, unlike Wells's original, it does require the reader to understand the phrase “statistical thinking”.
Statistical thinking – as Wilks might have thought of it – can be broken down into six core concepts: expectation and variance – which encompass the “averages and maxima and minima” that Wells refers to – plus distribution, probability, risk and correlation.
A broader perspective
These six concepts are still valid in the twenty-first century, and I have been teaching them as part of training courses for the past decade. However, 63 years after Wilks's presidential address, I believe it is time to expand the definition of statistical thinking – and I would like to put forward three new concepts: data, cognition and visualisation (Figure 1).
Let us start with data – the lifeblood of the statistician. It is not accounted for explicitly in our current definition of statistical thinking, but I argue that it needs to be. A BBC Horizon documentary declared last year that we are living in the “age of big data”, and although Google chairman Eric Schmidt wrote recently that “big data needs statisticians to make sense of it” (see News, page 3), it is the data scientist that appears to be benefiting most from growing interest in the analysis and application of data.
The whole concept of data is changing rapidly. We no longer have just numbers to deal with. Data now consists of free text and images, which present all kinds of challenges for sorting and organisation. Data scientists are perceived as having the skills to deal with this broader class of data, and perhaps that is why interest in the job is growing.
According to Google Trends data (see Figure 2), there have been many more searches for “statistician” compared to “data scientist” over the past decade. However, in recent months, searches for “data scientist” have overtaken those for “statistician”. Searches for the more junior role of “data analyst” have also increased, while those for “statistician” have declined and flattened out.
This evidence is by no means conclusive, but it would suggest to me that statisticians risk being left behind and not being seen as integral to this new (and big) data movement. Adding “data” to our definition of statistical thinking will not solve this problem in and of itself, but it will convey an important message: that statisticians, the original data scientists, embrace data in all its forms.
Think about it
Next comes “cognition” – a subject explored brilliantly by the Nobel Prize winner Daniel Kahneman in his book Thinking, Fast and Slow,2 which examines the ability (or inability) of the human race to think statistically. Published in May 2012, the book's stated aim is to help people “improve the ability to identify and understand errors of judgement and choice, in others and eventually ourselves”.
Its origins lie in Kahneman's collaborations with Amos Tversky in the 1970s, when they posed the question: “Are people good intuitive statisticians?”. The answer, then and now, is “no”.
Kahneman's key thesis is that human beings have two types of thinking, which he denotes as System 1 (fast) and System 2 (slow). System 1 is intuitive, effortless, instinctive and based on our experiences, whereas System 2 is rooted in logical reasoning, is energy-intensive and requires memory recall. The book explores how these two types of thinking interact and influence each other. It explains numerous cognitive fallacies that we humans are prone to – which can be blamed largely on the instinctive, effortless response of System 1 – but at the end of each chapter, Kahneman summarises a number of ways we can reduce the risk of making such errors.
Statistical thinking is clearly a System 2 way of thinking, which explains why the human race struggles with it. This includes statisticians, who might assume they know enough to avoid such cognitive pitfalls. In a number of experiments, Kahneman and Tversky showed that statisticians were liable to make the same cognitive errors as non-statisticians. Two examples from his book demonstrate this. Try to answer these questions instinctively, without thinking too much about them, before reading on.
- Steve is very shy and withdrawn, invariably helpful but with little interest in people or in the world of reality. A meek and tidy soul, he has a need for order and structure and a passion for detail. Is Steve more likely to be a librarian or a farmer?
- Linda is 31 years old, single, outspoken and very bright. She majored in philosophy. As a student, she was deeply concerned with issues of discrimination and social justice. Rank these three statements in descending order based on the likelihood that they are true:
- A: Linda is active in the feminist movement
- B: Linda is a bank teller
- C: Linda is a bank teller and is active in the feminist movement
If you said that Steve was more likely to be a librarian you fell for the representative fallacy.
Steve is clearly more representative of the stereotypical librarian, but unless you know the relative frequencies of librarians to farmers, the correct answer is “I don't know”. In fact, according to Kahneman, farmers outnumber librarians by 20 to 1, so Steve is more likely to be a farmer.
Statisticians generally performed better than non-statisticians on the Steve problem, but when it came to the Linda problem, 89% of non-statistical students and 85% of statistical students got it wrong. The correct answer is A-B-C, but most people said A-C-B. Yet it is mathematically impossible for the probability of C to be greater than the probability of A or B separately, since C is merely the intersection of outcomes A and B and thus a subset of either. Kahneman was shocked that the students who took this test still got it wrong even when the error was pointed out to them. It seemed that the representative fallacy, which is characteristic of System 1 thinking, was drowning out their ability to engage System 2.
Acknowledging “cognition” as a component of statistical thinking is important as a way of differentiating the statistical mindset – one that prides logical reasoning over instinctive response. Statistical analysis is often used to support data-driven decision-making, but to achieve that we need to make users of data aware of the cognitive fallacies that they might succumb to, and help them avoid such traps.
Acknowledging ‘cognition’ as a component of statistical thinking will help differentiate the statistical mindset as one that prides logical reasoning over instinctive response
This brings us neatly on to “visualisation”, which is an important decision-making support. Of course, statisticians are well versed in data visualisations – we are taught histograms, box plots, scatter plots, etc. But learning how to present data correctly does not automatically confer on us the skills required to convey a message in visual form. In my job as a statistical consultant, I was once chided by a client who said that whilst I might be an expert in charts, he needed a graphic, and “that is what graphic designers provide”. The field of infographics is therefore the fusion of charts and design, and it is changing the way people view data. By adding “visualisation” to our definition of statistical thinking we are again sending a message: this time, that statistical thinkers are as good at communicating data as they are at analysing and interpreting it.
Statisticians need to keep up with changing times, and we should not be afraid of asking for help if help is needed to achieve this – whether that support comes from graphic designers, neuroscientists or other data professionals. The expertise we need in the twenty-first century is not the sole preserve of statisticians, and we should recognise this. But if we succeed in adding “data”, “cognition” and “visualisation” to our understanding of what statistical thinking is and should be, we will boast a powerful and unique combination of skills.
- 1 (1903) Mankind in the Making. London: Chapman & Hall.
- 2 (2012) Thinking, Fast and Slow. London: Penguin.