I have been suggesting that the best statistical approach, when confronted with conflicting signals such as the employment estimates from the BLS payroll survey, the separate BLS household survey, or the huge database from the private company Automatic Data Processing, is not to selectively throw some of the data out but rather to combine the different measures. Judging from some of the comments this suggestion has received both
Suppose we have available two polls that have surveyed voters for a particular election. The first surveyed 1,000 voters, and found that 52% of those surveyed favored candidate Jones, with a margin of error of plus or minus 3.2%. [By the way, in case you've forgotten your Stat 101, those margins of error for purposes of evaluating the null hypothesis of no difference between the candidates can be approximated as (1/N)0.5, or 0.032 when N = 1,000]. The second poll surveyed 500 voters, of whom 54% favored candidate Jones, with the margin of error for the second poll of plus or minus 4.5%. Would you (a) throw out the second poll, because it’s less reliable than the first, and (b) then conclude that the evidence for Candidate Jones is unpersuasive, because the null hypothesis of no difference between the candidates is within the first poll’s margin of error?
If that’s the conclusion you reach, you’re really not making proper use of the data in hand. You should instead be reasoning that, between the two polls, we have in fact surveyed 1,500 voters, of whom a total of 520 + 270 = 790 or 52.7% favor Jones. In a poll of 1,500 people, the margin of error would be plus or minus 2.6%. So, even though neither poll alone is entirely convincing, the two taken together make a pretty good case that Jones is in the lead.
In the above example, it’s pretty obvious how to combine the two polls, just by counting the raw number of people covered by each poll and then combining the two as if it were one big sample. But this example illustrates a statistical procedure that works in more general settings as well. We have two different estimates, 0.52 and 0.54, of the same object. We know that the variance of the first estimate is (0.5)2/1000, while the variance of the second estimate is (0.5)2/500 [again, does that sound familiar from Stat 101?]. If we followed the general principle of taking a weighted average of the two, with weights inversely proportional to the variances, that would mean in this case calculating [(1000)(0.52) + (500)(0.54)]/(1000 + 500) = 0.527, which amounts to combining the two estimates in exactly the way that common sense requires for the two-poll example. That principle, of taking a weighted average of different estimates, with weights inversely proportional to the sampling variance of each, turns out to be a good way not just to combine two polls but also to combine independent estimates that may have come from a wide range of different statistical problems.
But what if the second poll not only covered fewer people, but is also less reliable because it is a week older? One way to think about the issue in that case is to notice that the second poll’s estimate differs from the true population proportion because of the contribution of two terms. The first is the sampling error in the original poll (correctly measured by the (0.5)2/500 formula), and the second is the change in that population proportion over the last week. If we knew the variance governing how much public preferences are likely to change within a week, we would just add this to the sampling variance to get the total variance associated with the second estimate, and use this total variance rather than (0.5)2/500 to figure out how strongly to downweight the earlier poll. The earlier poll would then get much less weight than the newer one, but you’d still be better off making some use of the data rather than throwing it out altogether.
And what if you believe that one of the polls is systematically biased, but you’re not sure by how much? Many statisticians in that case might give you the OK to go ahead and ignore the second poll. On the other hand, there are many of us who would still want to make some use of that data, accepting some bias in the estimate in order to achieve a smaller mean squared error. In doing so, we acknowledge that we may make a systematic error in inference that you will avoid, but we will nevertheless be closer to the truth most of the time than you will if there are substantial benefits to bringing in extra data.
Examples where such an approach is quite well-established are estimating a spectrum (where we use the value of the periodogram at nearby frequencies, even though we know it would be a biased estimate of the spectrum at the point of interest) and nonparametric regression (where we use the value when x takes on values other than the one we’re interested in, even though again our assumption is doing so necessarily introduces some bias to the final estimate).
Robert Clemen, in a paper in the International Journal of Forecasting in 1989 surveyed over 200 different academic studies, and concluded:
Consider what we have learned about the combination of forecasts over the past twenty years…. The results have been virtually unanimous: combining multiple forecasts leads to increased forecast accuracy. This has been the result whether the forecasts are judgmental or statistical, econometric or extrapolation. Furthermore, in many cases one can make dramatic performance improvements by simply averaging the forecasts.
If I ask you what you think U.S. employment growth was in December, and your answer is the December BLS payroll number, one could say you have decided that the optimal weights to use for “combining” the payroll, household survey, and ADP estimates are 1.0, 0.0, and 0.0 respectively. But there’s an awful lot of statistical theory and practical experience to suggest those aren’t the best possible weights.
Or to put it another way, even though the BLS payroll numbers were encouraging, the fact that ADP estimates that the U.S. lost 40,000 jobs in December should surely make you a little less confident about the robustness of employment growth than you otherwise would have been.