This node is meant to be a somewhat more technical retelling of the above, with a focus on how to avoid being duped.

The first way in which statistics can be distorted is through the type of sample used. Rarely is an entire population surveyed in a study. More often, a sample is taken and the data from that sample is extrapolated onto the rest of the population. It is thus vitally important that the sample be judiciously chosen.

Let’s imagine we are looking for the average height of Canadians. We choose to sample three random Canadians and we get their heights. The mean value we find for the height of Canadians is a random variable (do follow the linkpipe on that one, the term random variable has an important meaning here). The height of Canadians could be any number within a particular range but it is not equally probable that it would be any of them. Let’s imagine that we take 50 samples of three people each and plot the resulting data set on a histogram. The mean of this histogram (the average of the fifty averages) is the population mean as nearly as we can determine it.

Each of the fifty data sets could also be plotted on a histogram. The small size of the sample means that there is a relatively high chance of an unusually tall or short person turning up in our data set and thus making its mean dramatically different from the mean of the entire population. Therefore, as our sample size gets larger, the distribution of the sample averages will have less spread.

If we knew the true mean height of the Canadian population, we could put it on the histogram from one trial. It will usually be either too large or too small when compared with our estimated value. How far off it will be depends on the sample size of the trial. Since a larger sample represents the whole population more effectively, it makes sense that it would do a better job of estimating the true value.

95% of the time, the true value of the mean height of the Canadian population will be within two standard deviations of the estimated value. The standard deviation of the histogram will become smaller as the sample becomes larger. This means that the area in which the true value almost certainly lies on a histogram becomes smaller when a larger sample size is used. This concept may be more familiar than you think.

Consider polls. When a poll result is stated, it is usually in the form: “55% of Canadians say Jean Chretien should play more golf, plus or minus 5% 19 times out of 20.” The “19 times out of 20” is the same 95% from the above paragraph. This means that 5% represents twice the standard deviation for the set from which the 55% value is determined. The pollsters are giving you the standard deviation in disguise!

Another common method by which statistics are fudged is conditioning. This is the process of selecting specific sub samples within a data set for comparison. An example is the average male wage compared with the average female wage. The manner in which this is done affects the results you get.

Studies have shown that kids who go to private schools earn 10% more, on average, than those who go to public schools. What does this mean? If we change the conditioning to examine neighbourhood and background, we see the difference reduced to zero. This essentially means that the marginal impact of going to private school if you already live in a good area (high average income, low unemployment) is quite small. Contrarily, students coming from a poor area stand to gain 10% in their average income for going to private school. Such statistical evidence (keep in mind that this is just an example) can lead to government policy decisions. The above conclusion would support a proposal for vouchers allowing poor kids to go to private school, for example.

One final statistical trick I shall examine is that of scale. Somebody can call an increase from 2-3% inflation (as calculated by the Consumer Price Index, for example) a “50%” jump. In actuality, the change was rather small. Whenever percentage changes are used to examine changes in small values, alarmingly large percentage changes can result. For this reason, if you are presented with very large percentage changes you ought to keep in mind that they may simple represent small variations in small quantities. For the GDP of Luxemburg to grow by 10 or even 50% represents very little actual growth compared with the GDP of the United States growing even 1%.

Remember, people can only lie to you with statistics if you let them! Be aware of how they work and you will be a less gullible member of society.

# How to lie with statistics (idea)

See all of How to lie with statistics, there is 1 more in this node.