Guest Contribution: “Are Google data really useful for macroeconomic nowcasting?”

Today, we’re pleased to present a guest contribution by Laurent Ferrara (Professor of Economics at Skema Business School, Paris and Director of the International Institute of Forecasters).

The recent sequence of economic, financial and pandemic crises around the globe has considerably shortened the horizon of predictions for macroeconomic forecasters. At the heart of the Covid-19 crisis, the horizon of interest was rather the end of the week than two years-ahead. This led practitioners to focus on new types of high-frequency and alternative datasets, raising thus new challenges for econometricians (unstructured data, very large datasets, mixed frequencies, high volatility, short samples …).

Various sources of alternative data have been used in the recent literature, such as for example web scraped data, scanner data or satellite data. Generally, those datasets are extremely large and can be considered as big data. One of the main sources of alternative data are Google search data, and seminal papers on the use of such data for forecasting are the ones by Hal Varian and co-authors (see for example here). In the area of nowcasting/forecasting, the literature tends to show evidence of some forecasting power for Google data, at least for some specific macroeconomic variables such as unemployment rate (D’amuri and Marcucci, 2017) en employment (Borup and Montes Schütte, 2020), building permits (Coble and Pincheira, 2017) or car sales (Nymand and Pantelidis, 2018). However, when correctly compared with other sources of information, the jury is still out on the gain that economists can get from using Google data for forecasting and nowcasting. A side question, highly debated on Econbrowser is about the replicability of those data by practitioners (see here for a discussion between Hal Varian and Simon van Norden).

In a recent paper, published with Anna Simoni in the Journal of Business and Economic Statistics (see here for a mimeo), we ask ourselves whether Google data are still useful in nowcasting quarterly GDP growth when controlling for official variables, such as opinion surveys or manufacturing production, generally used by forecasters. And if so, when exactly are those alternative data adding a gain in nowcasting accuracy. Nowcasting GDP growth is extremely useful for policy-makers to assess macroeconomic conditions in real-time. The concept of macroeconomic nowcasting has been popularized by Giannone et al. [2008] and differs from standard forecasting approaches in the sense it aims at evaluating current macroeconomic conditions on a high-frequency basis. The idea is to provide policy-makers with a real-time evaluation of the state of the economy ahead of the release of official Quarterly National Accounts, that always come out with a delay. See for example here for the U.S. economy and here for a recent post on Econbrowser.

Because Google search data are of high dimension, in the sense that the number of variable is large compared to the time series dimension, there is a price to pay for using them: first, we need to reduce their dimensionality from ultra-high to high by using a screening procedure and, second, we need to use a regularized estimator to deal with the pre-selected variables. Regularization techniques are a way to account for many variables, potentially correlated, into a linear regression (see for example the Ridge estimation). In this respect, we put forward a new approach combining variable pre-selection and Ridge regularization enabling to account for a large database. In the paper, we provide some theoretical results as regards the good asymptotic properties of this estimation strategy, that we refer to as Ridge after Model Selection.

In addition to those theoretical results, we get a bunch of empirical results that could be interesting to share with people interested in using high dimensional alternative data for macroeconomic nowcasting. Our objective is to nowcast GDP growth every week of the quarter, for the U.S., euro area and Germany over 3 types of economic periods: (i) a calm period (2014-16), (ii) a period with a sudden downward shift in GDP growth (2017-18, related to trade war between U.S and China/Europe) and (iii) a recession period with large negative growth rates (2008-09, driven by the Global Financial Crisis). In this respect we use classical macro data (surveys and production), as well as alternative data stemming from Google (Google Search Data, already grouped into categories and sub-categories). We compare various approaches based on their nowcasting ability, as measured by the Root Mean Squared Forecasting Error (RMSFE).  Four salient facts emerge from our empirical analysis.

First, we compare a standard regression (with Ridge regularization) with a regression after preselection (our Ridge after Model Selection approach).  Figure 1 shows the results for the euro area during a calm period (2014-16). We clearly see the gain in terms of nowcasting accuracy of pre-selecting data before entering into the model. The idea is that having too many variables adds too much noise. This is specifically the case with Google Search Data, as some of them are not directly related to economic activity. This result confirms previous results against the background of dynamic factor models (see Bai and Ng, 2008 or Barhoumi et al., 2009).

Figure 1: RMSFEs for the euro area during a calm period (2014-16) stemming from a standard regression with Ridge regularization (blue bars) and from the Ridge after Model Selection approach (orange bars). Evolution of RMSFEs within the 13 weeks of the current quarter. Source: Ferrara and Simoni (2023)

Second, we point out the usefulness of Google search data in nowcasting GDP growth rate for the first four weeks of the quarter, that is when there is no official information about the state of the current quarter. In Figure 1, we see that at the beginning of the quarter (from week 1 to week 4), Google data indeed provide an accurate picture of the GDP growth rate in the sense that RMSFEs are reasonably low (between 0.2% and 0.3%), slightly higher than those at the end of the quarter when all the information is available (about 0.2%).

Figure 2: RMSFEs for the euro area during a calm period (2014-16) stemming from a standard regression with Ridge regularization (blue bars), from the Ridge after Model Selection approach (orange bars), from the Ridge after Model Selection approach using only Google data (green bars) and from a basic regression model without any Google data (yellow bars) . Evolution of RMSFEs within the 13 weeks of the current quarter Source: Ferrara and Simoni (2023)

Third, as soon as official data become available, that is starting from week 5 with the release of the first opinion survey of the quarter (in the euro area case), then the relative nowcasting power of Google data rapidly vanishes. We see in Figure 2, that for the week 5, the RMSFE with all data (orange bar) is equivalent to the one without any Google data (the yellow bar), that is. with only macro information contained in the first survey of the quarter.  We also note that RMSFEs stemming from the Ridge after Model Selection approach using only Google data (green bars) do not show any decline overtime, suggesting that the gain visible in orange bars starting from week 5 is coming from the integration of macro variables.

Fourth, recession periods present a specific pattern, as the model without any pre-selection and with only Google data as information set provides the lowest RMSFEs (green bars in Figure 3). This pattern is also generally visible for German and U.S. data. This result must be further understood by additional research, but it might be related to the well-known higher uncertainty that we observe during recessions, meaning that more data must be used to account for it. In any case, this can be seen as a justification of the use of alternative data during crises.

Figure 3: RMSFEs for the euro area during a recession period (2008-09) stemming from a standard regression with Ridge regularization (blue bars), from the Ridge after Model Selection approach (orange bars), from the Ridge after Model Selection approach using only Google data (green bars) and from a basic regression model without any Google data (yellow bars) . Evolution of RMSFEs within the 13 weeks of the current quarter Source: Ferrara and Simoni (2023)

Various robustness checks confirm that those empirical results still hold for all the countries/areas in our analysis and are still valid when we increase the macroeconomic information set by considering 22 usual variables (sales, exports, employment, …). Last a true-real analysis for the euro area with vintages of data confirm the ranking of the various approaches. Overall, all those results point out that Google data can be very useful for GDP growth nowcasting during expansion phases when information is lacking, after a pre-selection step. However, as soon as official macroeconomic information arrives, the marginal gain from Google data tends to rapidly vanish. During recession phases, it seems that forecasters need the largest available information set to assess what’s going on in the economic activity.

This post written by Laurent Ferrara.

