Combining forecasts

I have been suggesting that the best statistical approach, when confronted with conflicting signals such as the employment estimates from the BLS payroll survey, the separate BLS household survey, or the huge database from the private company Automatic Data Processing, is not to selectively throw some of the data out but rather to combine the different measures. Judging from some of the comments this suggestion has received both
here at Econbrowser as well as at Calculated Risk and Outside the Beltway, I thought it might be useful to say a little more about the benefits of combining forecasts.

Suppose we have available two polls that have surveyed voters for a particular election. The first surveyed 1,000 voters, and found that 52% of those surveyed favored candidate Jones, with a margin of error of plus or minus 3.2%. [By the way, in case you've forgotten your Stat 101, those margins of error for purposes of evaluating the null hypothesis of no difference between the candidates can be approximated as (1/N)0.5, or 0.032 when N = 1,000]. The second poll surveyed 500 voters, of whom 54% favored candidate Jones, with the margin of error for the second poll of plus or minus 4.5%. Would you (a) throw out the second poll, because it’s less reliable than the first, and (b) then conclude that the evidence for Candidate Jones is unpersuasive, because the null hypothesis of no difference between the candidates is within the first poll’s margin of error?

If that’s the conclusion you reach, you’re really not making proper use of the data in hand. You should instead be reasoning that, between the two polls, we have in fact surveyed 1,500 voters, of whom a total of 520 + 270 = 790 or 52.7% favor Jones. In a poll of 1,500 people, the margin of error would be plus or minus 2.6%. So, even though neither poll alone is entirely convincing, the two taken together make a pretty good case that Jones is in the lead.

In the above example, it’s pretty obvious how to combine the two polls, just by counting the raw number of people covered by each poll and then combining the two as if it were one big sample. But this example illustrates a statistical procedure that works in more general settings as well. We have two different estimates, 0.52 and 0.54, of the same object. We know that the variance of the first estimate is (0.5)2/1000, while the variance of the second estimate is (0.5)2/500 [again, does that sound familiar from Stat 101?]. If we followed the general principle of taking a weighted average of the two, with weights inversely proportional to the variances, that would mean in this case calculating [(1000)(0.52) + (500)(0.54)]/(1000 + 500) = 0.527, which amounts to combining the two estimates in exactly the way that common sense requires for the two-poll example. That principle, of taking a weighted average of different estimates, with weights inversely proportional to the sampling variance of each, turns out to be a good way not just to combine two polls but also to combine independent estimates that may have come from a wide range of different statistical problems.

But what if the second poll not only covered fewer people, but is also less reliable because it is a week older? One way to think about the issue in that case is to notice that the second poll’s estimate differs from the true population proportion because of the contribution of two terms. The first is the sampling error in the original poll (correctly measured by the (0.5)2/500 formula), and the second is the change in that population proportion over the last week. If we knew the variance governing how much public preferences are likely to change within a week, we would just add this to the sampling variance to get the total variance associated with the second estimate, and use this total variance rather than (0.5)2/500 to figure out how strongly to downweight the earlier poll. The earlier poll would then get much less weight than the newer one, but you’d still be better off making some use of the data rather than throwing it out altogether.

And what if you believe that one of the polls is systematically biased, but you’re not sure by how much? Many statisticians in that case might give you the OK to go ahead and ignore the second poll. On the other hand, there are many of us who would still want to make some use of that data, accepting some bias in the estimate in order to achieve a smaller mean squared error. In doing so, we acknowledge that we may make a systematic error in inference that you will avoid, but we will nevertheless be closer to the truth most of the time than you will if there are substantial benefits to bringing in extra data.
Examples where such an approach is quite well-established are estimating a spectrum (where we use the value of the periodogram at nearby frequencies, even though we know it would be a biased estimate of the spectrum at the point of interest) and nonparametric regression (where we use the value when x takes on values other than the one we’re interested in, even though again our assumption is doing so necessarily introduces some bias to the final estimate).

Robert Clemen, in a paper in the International Journal of Forecasting in 1989 surveyed over 200 different academic studies, and concluded:

Consider what we have learned about the combination of forecasts over the past twenty years…. The results have been virtually unanimous: combining multiple forecasts leads to increased forecast accuracy. This has been the result whether the forecasts are judgmental or statistical, econometric or extrapolation. Furthermore, in many cases one can make dramatic performance improvements by simply averaging the forecasts.

If I ask you what you think U.S. employment growth was in December, and your answer is the December BLS payroll number, one could say you have decided that the optimal weights to use for “combining” the payroll, household survey, and ADP estimates are 1.0, 0.0, and 0.0 respectively. But there’s an awful lot of statistical theory and practical experience to suggest those aren’t the best possible weights.

Or to put it another way, even though the BLS payroll numbers were encouraging, the fact that ADP estimates that the U.S. lost 40,000 jobs in December should surely make you a little less confident about the robustness of employment growth than you otherwise would have been.

Technorati Tags: ,


11 thoughts on “Combining forecasts

  1. drbrightside

    I agree 100% (with 2% margin for error). Maybe you should produce an index that combines the polls that we could all refer to and track. I bet Kudlow would pub it. I also don’t put much credence in the BLS jobs number until the folloiwng month after the first revision, though the last few months have been seemingly more accurate.

  2. spencer

    Combining multiple data series — polls or samples — increases your probability of correctly identifing an ongoing trends.
    But on the other hand doesn’t it reduce you chances of missing a turning point.
    It is my experience that if you are using economic data series as an input into an investment strategy or decision making process the single most important thing you need to do is identify trend changes as rapidly as possible while minimizing the chances of incorrectly calling a trend change.
    So how does combining data as you suggest fit into this evaluation of economic data.

  3. c thomson

    A trend is a trend is a trend,
    The question is, will it bend?
    Or change its course,
    Through some unforeseen force,
    Or carry along to its end?

  4. Emmanuel

    JDH, I don’t want to sound like a broken record, but I still think that combining surveys is inferior to increasing n so that it approaches N:
    These different surveys of employment are just that–surveys. What I’d like to see is NFP evolve to something more like a census to get rid of these endless revisions.
    I’d expect most businesses to have computers by now for keeping in touch with customers via e-mail, etc. So it isn’t a stretch if the Feds could tally employment by sending short online forms every month to businesses. It won’t take much for these firms to say we hired x people and lost y people twelve times a year. For more modest enterprises, it’s reasonably cost-effective to give them $499 computers to send back employment data each month.
    Doing so makes sense to me. That nobody else thinks much of this idea in econoblog land makes me sad [add limpid violin sounds here].

  5. JDH

    Spencer, although most of the traditional literature on forecast combination has assumed a quadratic loss function (in which you care as much about being over as being under), my colleagues Graham Elliott and Allan Timmermann have a paper in the Journal of Econometrics in 2004 that looks at the more general case, and finds similar methods and similar benefits obtain from combining forecasts for those settings as well.

  6. JDH

    Emmanual, although I start out talking about the problem here using an example of sampling uncertainty, that’s not the primary shortcoming of any of the 3 employment estimates, in which n is really plenty big, e.g., 14 million workers for ADP. Instead, the issue is one of the nature of the limited population that could be available for any single method. For example, there’s just no good way any payroll-based method can handle nonpayroll workers or new establishments. There’s an important conceptual difference between the number of workers a company tells the BLS it has and the number it actually sends checks to (the ADP data). The variance of the measurement error that comes from these issues is more closely related to the second example I give of a week-old poll, where even if you’d sampled every single voter last week, you still wouldn’t have exactly the right number for predicting the election outcome. Thus, even though n is really plenty big here, there are still big gains from combining the different kinds of data. I start out talking about just the first issue of sampling variance, however, because I think it’s easiest to see what some of the basic isssues are for that ideal case.

  7. spencer

    Interesting. But I do not use forecast.
    I am strictly interested in analyzing and interpreting actual data series.

  8. JDH

    I’m not sure we’re talking about different questions, Spencer. You could think of the question, “What was U.S. employment growth in December?” as a forecasting question, in the sense of trying to predict what the BLS payroll figure, as finally revised, will turn out to be. I’m thinking of it more as an inference problem, in the sense that even the “final” estimate isn’t the real thing. I’ve been using the language about forecasting to talk about the latter question because fundamentally it’s the same kind of statistical problem.
    Even if your goal is just to interpret the current (i.e., December) data, I still think you need to consult a number of different indicators.

  9. DickF

    But on the other hand doesn’t it reduce you chances of missing a turning point.
    Wouldn’t all three show a trend if it existed?
    The system the professor is using actually makes the data more accurate and if there was a trend it would be reinforced by inclusion of all three, while an anomaly in one measure would be smoothed by including all three.

  10. Aaron Krowne

    JDH, I agree with using all available data, and thanks for the statistical lesson.

    I suspect that the highest accuracy would come from a Bayesian approach where all data points are thrown in the hopper and where the likelihood factor for each is a combination of time deprecation and a survey-connected noise factor.

    As an additional note on the job stats in question here, I’ve glanced at the recent history of the ADP and BLS data, and it appears that the BLS data tends to strongly “follow” the ADP down, but only weakly follow it up. I suspect, then, we will see some bearish revisions in subsequent months.

  11. rex

    Emmanuel: Your suggestion about getting the data directly from companies via computers is exactly what the ADP employment index does. Naturally, ADP’s universe of clients isn’t representative of the country as a whole (because failing, small or startup companies don’t outsource their payrolls), but it’s a pretty impressive sample of total employment. And it seems silly to just throw it out because the BLS survey came up with a slightly different answer.

Comments are closed.