Tuesday, February 13, 2007

The magical sample size in polls

Now that political polls are a hot item, it is time to unveil the mysterious sentence that accompanies many public opinion polls (not only political) -- This typically reads "the poll included 1033 adults and has a sampling error of plus or minus three percentage points".

No matter what population is being sampled, the sample size is typically around 1,000 and the precision is almost always "ۭ±3%" (this is called the margin of error).

If you type "poll" in Google you will find plenty of examples. One example is the Jan 2, 2007 NYT Business section article "Investors Greet New Year With Ambivalence". It concludes that
"Having enjoyed a year that was better than average in the stock market and a much weaker one in housing, home owners and investors appear neither exuberant nor glum about 2007." 
This result is based on "The telephone survey was conducted from Dec. 8 to 10 and included 922 adults nationwide and has a sampling error of plus or minus three percentage points."

Discover card also runs their own survey to measure economic confidence of small business owners. Their survey goes to "approximately 1,000 small business owners" and explain that "The margin of error for the sample involving small business owners is approximately+/- 3.2 percentage points with a 95 percent level of confidence".

So how does this work? To specify a sample size one must consider

  1. the population size from which the sample is taken (if it is small, then a correction factor needs to be taken into account), 
  2. the precision of the estimator (how much variability from sample to sample we tolerate), and 
  3. the statistical confidence level or equivalently, the significance level (denoted Alpha) with the corresponding normal distribution percentile Zalpha
The formula that links the sample size to these three factors and the magical 3% margin of error is given by:
3% = estimator standard deviation * correction factor * Zalpha/2 

In polls, the parameter of interest is a population proportion, p, e.g., the proportion of Democratic voters in the US. The sample estimator is simply the sample proportion of interest (e.g., the proportion of Democratic voters in the sample). This estimator has a variance equal to √p(1-p)/n, where n is the sample size.

You will notice that the largest possible variance is when p=0.5. This helps determine a conservative threshold on the estimator precision (#2 above). So now we have

3% = √0.25/n * correction factor * Zalpha/2 

Regarding population size, in polls it is typically assumed to be very large, so there is no need for a correction factor. Finally, the popular significance level used is 5%, which corresponds to approximately Z0.025=2.


3% = √0.25/n * 2 = 1/√n 

If you plug in n=1,000 on the right-hand-side, you will get approximately 3%, which is the relationship used between sample size and margin of error in most public opinion polls. The part that people often find surprising is that no matter how large the population, be it 10,000 or 300,000, 000, the same sample size is required for obtaining this level of precision and this level of statistical significance.

No comments: