Tuesday, February 13, 2007

The magical sample size in polls

Now that political polls are a hot item, it is time to unveil the mysterious sentence that accompanies many public opinion polls (not only political) -- This typically reads "the poll included 1033 adults and has a sampling error of plus or minus three percentage points".

No matter what population is being sampled, the sample size is typically around 1,000 and the precision is almost always "ۭ±3%" (this is called the margin of error).

If you type "poll" in Google you will find plenty of examples. One example is the Jan 2, 2007 NYT Business section article "Investors Greet New Year With Ambivalence". It concludes that
"Having enjoyed a year that was better than average in the stock market and a much weaker one in housing, home owners and investors appear neither exuberant nor glum about 2007." 
This result is based on "The telephone survey was conducted from Dec. 8 to 10 and included 922 adults nationwide and has a sampling error of plus or minus three percentage points."

Discover card also runs their own survey to measure economic confidence of small business owners. Their survey goes to "approximately 1,000 small business owners" and explain that "The margin of error for the sample involving small business owners is approximately+/- 3.2 percentage points with a 95 percent level of confidence".

So how does this work? To specify a sample size one must consider

  1. the population size from which the sample is taken (if it is small, then a correction factor needs to be taken into account), 
  2. the precision of the estimator (how much variability from sample to sample we tolerate), and 
  3. the statistical confidence level or equivalently, the significance level (denoted Alpha) with the corresponding normal distribution percentile Zalpha
The formula that links the sample size to these three factors and the magical 3% margin of error is given by:
3% = estimator standard deviation * correction factor * Zalpha/2 

In polls, the parameter of interest is a population proportion, p, e.g., the proportion of Democratic voters in the US. The sample estimator is simply the sample proportion of interest (e.g., the proportion of Democratic voters in the sample). This estimator has a variance equal to √p(1-p)/n, where n is the sample size.

You will notice that the largest possible variance is when p=0.5. This helps determine a conservative threshold on the estimator precision (#2 above). So now we have

3% = √0.25/n * correction factor * Zalpha/2 

Regarding population size, in polls it is typically assumed to be very large, so there is no need for a correction factor. Finally, the popular significance level used is 5%, which corresponds to approximately Z0.025=2.

AND NOW, LADIES AND GENTLEMEN, WE GET:

3% = √0.25/n * 2 = 1/√n 

If you plug in n=1,000 on the right-hand-side, you will get approximately 3%, which is the relationship used between sample size and margin of error in most public opinion polls. The part that people often find surprising is that no matter how large the population, be it 10,000 or 300,000, 000, the same sample size is required for obtaining this level of precision and this level of statistical significance.

Friday, February 02, 2007

The legendary threshold of 5% for p-values

Almost every introductory course in statistics gets to a point where the concept of the p-value is introduced. This is a tough concept and usually takes time to absorb. It is also usually one of the hardest concepts for students to internalize. An interesting paper by Hubbard and Armstrong discuss the confusion in marketing research which takes place in textbooks and journal articles.

Another "fact" that usually accompanies the p-value concept is the 5% threshold. One typically learns to compare the p-value (that is computed from the data) to a 5% threshold, and if it is below that threshold, then the effect is statistically significant.

Where does the 5% come from? I pondered on that at some point. Since a p-value can be thought of as a measure of risk, then 5% is pretty arbitrary. Obviously some applications warrant lower risk levels, while others might tolerate higher levels. According to Jerry Dallal's webpage, the reason is historical: before the age of computers, tables were used for computing p-values. In Fisher's original tables the levels computed were 5% and a few others. The rest, as they say, is history.