My book's out!

Buy my book "Data to Information to Decision and ten things people get wrong about numbers, data, and models," available on Amazon!

Aug 17, 2022

DATA to INFORMATION to DECISION and ten things people get wrong about numbers, data, and models

From space debris missing New York to reading too much into Big Data correlations; from MSNBC hosts confusing millions with trillions to serious people thinking the new economy demand was upward-sloping (spoiler: it wasn't); with a detour through the strange world of statistics, where "2 is not different from 3," and Bayesian reasoning applied to jewelry theft, this book shows how quantitative "numbers"-thinking can help make sense of the world better that qualitative "words"-thinking. And how to have fun doing it.

We follow the path from data (a bunch of related numbers) to information (the part of the data that matters for decisions) to decisions (the reason we care about the data in the first place). The path comes with pitfalls, ten of which are:

Using numbers without context ("Tesla made 7000 cars in one week")
Using samples that aren't representative ("most flights arriving at SFO are very full")
Ignoring truncation or censoring of data ("average customer buys 3 pints of ice-cream per visit")
Ignoring hidden factors in causality ("flying private whitens your teeth")
Over-aggregating data ("most people who join a gym get very little benefit, long-term")
Over-fitting models ("Accentuate's model is better than BangorCG's because it fits more of the data")
Seeing patterns in randomness ("Aiden had a perfect 10 heads sequence of 10 coin flips; must be skill")
Misunderstanding statistical results ("this test is 95% reliable, so my positive means I'm 95% likely to be sick")
Using averages to understand extreme behavior ("if the average runner in California and Montana are about as fast, the fastest runner is equally likely to be from either state")
Focusing too much on the analysis to the detriment of the decision ("let's wait for more data, there's still uncertainty")

The parenthetical examples are all free puzzles, just for reading the description here! Solutions in the book. (Click to get a free sample.)

Why read this book?

There are many books purporting to be about quantitative matters. Some focus on people, making them little more than nerd gossip; many are primarily stories of events, discoveries, or companies, and may help readers shine at social events, though not necessarily by making them better thinkers; and then there are books that focus on the concepts, principles, tools, and pitfalls of quantitative thinking. This book is one of the latter.

Using a cooking analogy, there are books about celebrity chefs, restaurant intrigue, or the story behind a specific dish; those make for good dinner-table conversation, but don't help in the kitchen.

This book is a primer, a "how to do the basics and avoid the most common disasters in the kitchen," with pointers to some more advanced cookbooks in the last chapter and related kitchen best practices in the technical notes. (Click to buy book.)

Overview

The book is organized around the process of data-informed decision-making, or how the title puts it, DATA to INFORMATION to DECISION:

Introductory chapters

Four chapters motivate us to pay attention to numbers, data (lots of related numbers), and models (tools to manipulate numbers). Using examples, we introduce concepts and principles of data analysis and decision-making as part of specific applications; with repetition from different perspectives, this holistic introduction tends to be more effective than listing concepts and principles first. (These are summarized at the end.)

Pitfall 1: Numbers without context

Numbers need context to have meaning outside of their mathematical properties: 300 is 3 times 2-squared times 5-squared, so its math meaning doesn't change; but in the physical world 300 dollars is too expensive for a lunch sandwich (because they tend to cost less than $30) and a good price for a new gaming laptop (because they tend to cost over $1500). Appropriate comparisons help us give meaning to numbers. Some numbers come with uncertainty attached, so they need even more context: they need a probability distribution.

Leveling-up our skills

Contemporary technology puts in our hands a tool that can (to some extent) replace professional tools of just a few decades ago: the spreadsheet. All that's needed is a change of attitude, from “solving problems” to “building models.” That change plus a technique called grid search can help us become much better at quantitative thinking, without having to learn advanced math or a programming language.

Path from data to decision

We trace the path from data to information to decision: how to use models to separate information from noise in data and how to use decision-support tools to help us make decisions using the information. The full process is illustrated in figure above. This book focuses on the information extraction and decision-making parts; we cover some measurement issues in the technical notes and leave the execution of actions to books on management and operations.

Pitfall 2: Sample selection biases

Usually we can't collect data for the entire population of interest, so we must use a sample; the sample should represent the population as well as possible (be representative) relative to the question we ask. Sometimes the way we get the sample makes it non-representative: for example, if we're interested in the average height of US persons but use a convenience sample of professional basketball players; often sampling errors aren't as obvious as this, but they are as serious.

Pitfall 3: Data truncation and censoring

Sometimes data that we can't see has information we'd like: customers who choose not to go into a store aren't observed by the store, but their choice has information that might be used by the managers to change things so that those customers choose to go in the next time (and buy stuff). Data that isn't observed is said to be truncated, while data that is observed but distorted (customers come into the store but don't buy anything; treating them as zero revenue hides the unobserved level of their discontent) is called censored.

Statistics interlude

Because we use data from a sample to draw conclusions about a population, our observations are not just numbers: they are numbers with probabilities attached (random variables). Statistical testing uses those attached probabilities to determine whether some population value is still possible with some confidence given the sample value. This is why sometimes we read “the measured value was 2.55, but it's not significantly different from a value of 2.75.” What that means is that the measured value in the sample was 2.55, but because of the randomness of the sample we can't exclude the possibility that the real value in the population is 2.75.

Pitfall 4: Hidden factors and causality

Causality is difficult to determine from observed data: if A and B are correlated (they move together either in the same or opposite directions), does A cause B, B cause A, or are both A and B caused by a hidden factor C? Taking creatine supplements is correlated with drinking protein shakes, but protein shakes don't cause the need for creatine, nor does creatine cause the desire for protein shakes: they are both caused by the hidden factor “trying to build muscle,” which also causes “working out with weights.”

Pitfall 5: Over-aggregating data

When we aggregate data we lose detail information: if we average the calories of all meals in a restaurant menu we get some number; but that number will not help calorie-counting patrons choose between the salad and the lasagna. Aggregation of data becomes a problem when the populations being aggregated (the salad and the lasagna) are very different in the dimensions of interest (calories). In certain pathological cases, aggregating data may even reverse the information in the disaggregate data.

Pitfall 6: Over-fitting models

We use models to extract information from data, so it might be reasonable to think that the better a model fits the data, the better that model is (this is a commonly held misconception). But data includes the information (the part we want) and noise (things we don't want), so a model that fits the data better might be picking up some of that noise. We look into this problem, illustrate an approach that can sometimes help, and show how that approach breaks down with “Big Data” data sets.

Pitfall 7: Seeing patterns in randomness

When is an apparent pattern a real pattern as opposed to a coincidence? In this chapter we look into probabilities of compound events, where the analysis is not for one specific outcome (“I flipped a coin ten times and got ten heads”) but for the universe of outcomes (there were 20,000 people flipping coins ten times, how likely was that at least one of them got ten heads?). Many apparently unlikely events become trivial events when analyzed in this manner.

Bayesian interlude

We learn to integrate new information with previous information using Bayes's rule. Thought the rule itself is simple, there are many errors people tend to make when reasoning about probabilities; these errors can be avoided by following the Bayesian reasoning process. We illustrate the process and use two common puzzles that many people get wrong to capstone the interlude.

Pitfall 8: Interpreting model results

Building on the statistical interlude and the Bayesian interlude, we learn how to interpret the results of statistical tests and the quality metrics associated with model estimates. We also cover issues of over-reliance on significance to include or exclude variables from a model and the problem known as p-hacking.

Pitfall 9: Judging extremes by the averages

Given two groups of people from the same population, one with 100 people and another with 1000 people, the fastest runner in the group with 1000 people is likely to be faster than the fastest runner in the group with 100 people. In this chapter we explore the differences in the behavior of the extremes as a function of group size, different mean, and different variance, and illustrate how wrong we can be when using averages alone to predict behavior at the extremes (sadly, not an uncommon error).

Pitfall 10: Decision analysis

Part of making a decision is knowing how and when to make it. When making decisions under uncertainty with information incrementally revealed during the decision process, there's a trade-off to be made between waiting to get better information and the cost of waiting to make the decision. We also consider how to define what the decision is about; how that relates to option value; and the case of when information extraction and decision-making are done by separate people or organizations (adding the intermediate step of communicating the information).

Last chapter

We look at ways to have fun with numbers, and how we can use that fun to keep our quantitative skills honed. Because learning technical material is an audience participation activity, we look into different levels of understanding (to clarify the meaning of “understanding” technical material). At the end, we have recommendations of materials for further exploration.

Shorts

Most chapters are followed by a short vignette related but not directly about the topic of the chapter, typically related quantitative errors, related points that are less common sources of problems than those in the pitfall, or interesting applications of the material covered in it.

Technical notes and extras

We keep technical elements mostly out of the main text, so they're at the end for readers who may be interested. There are also some extra materials there to complement the chapters, which would be distracting in the main text. (Some of these materials require a little more math knowledge than the main text.)

(Click to get a free sample.)

Technology, Business, Numbers