A dynamic factor model approach to incorporate Big Data in state space models for official statistics

In this paper we consider estimation of unobserved components in state space models using a dynamic factor approach to incorporate auxiliary information from high-dimensional data sources. We apply the methodology to unemployment estimation as done by Statistics Netherlands, who uses a multivariate state space model to produce monthly figures for the unemployment using series observed with the labour force survey (LFS). We extend the model by including auxiliary series of Google Trends about job-search and economic uncertainty, and claimant counts, partially observed at higher frequencies. Our factor model allows for nowcasting the variable of interest, providing reliable unemployment estimates in real-time before LFS data become available.


Introduction
There is an increasing interest among national statistical institutes (NSIs) to use data that are generated as a by-product of processes not directly related to statistical production purposes in the production of official statistics. Such data sources are sometimes referred to as "Big Data"; examples are time and location of network activity available from mobile phone companies, social media messages from Twitter and Facebook, sensor data, and internet search behaviour from Google Trends. A common problem with this type of data sources is that they are likely selective with respect to an intended target population. If such data sources are directly used to produce statistical information, then the potential selection bias of these data sources must be accounted for, which is often a hard task since Big Data sources are often noisy and generally contain no auxiliary variables, which are required for bias correction. These problems can be circumvented by using them as covariates in model-based inference procedures to make precise detailed and timely survey estimates, since they come at a high frequency and are therefore very timely. These techniques are known in the literature as small area estimation and nowcasting (Rao and Molina, 2015).
Official statistics are generally based on repeated samples. Therefore multivariate time series models are potentially fruitful to improve the precision and timeliness of domain estimates with survey data obtained in preceding reference periods and other domains. The predictive power of these models can be further improved by incorporating auxiliary series that are related with the target series observed with a repeated survey.
In this paper we investigate how auxiliary series derived from big data sources and registers can be combined with time series observed with repeated samples in high dimensional multivariate structural time series (STS) models. We consider Google Trends and claimant counts as auxiliary series for monthly unemployment estimates observed with a continuously conducted sample survey. Big Data sources have the problem that they are noisy and potentially (partly) irrelevant, and, as such, care must be taken when using them for the production of official statistics. We show that, by using a dynamic factor model in state space form, relevant information can be extracted from such auxiliary high-dimensional data sources, while guarding against the inclusion of irrelevant data.
Statistical information about a country's labour force is generally obtained from labour force surveys, since the required information is not available from registrations or other administrative data sources. The Dutch labour force survey (LFS) is based on a rotating panel design, where monthly household samples are observed five times with quarterly intervals. These figures are, however, considered too volatile to produce sufficiently reliable monthly estimates for the employed and the unemployed labour force at monthly frequency. For this reason Statistics Netherlands estimates monthly unemployment figures, together with its change, as unobserved components in a state space model where the observed series come from the monthly Dutch LFS, using a model originally proposed by Pfeffermann (1991). This method improves the precision of the monthly estimates for unemployment with sample information from previous periods, and can therefore be seen as a form of small area estimation. In addition it accounts for rotation group bias (Bailar, 1975), serial correlation due to partial sample overlap, and discontinuities due to several major survey redesigns (van den Brakel and Krieg, 2015).
Time series estimates for the unemployment can be further improved by including related auxiliary series. The purpose is twofold. First, auxiliary series can further improve the precision of the time series predictions. In this regard, Harvey and Chung (2000) propose a bivariate state space model to combine a univariate series of the monthly unemployed labour force derived from the UK LFS, with the univariate auxiliary series of claimant counts. The latter series represents the number of people claiming unemployment benefits. It is an administrative source, which is not available for every country, and, as for the Netherlands, it can be affected by the same publication delay of the labour force series. Second, auxiliary series derived from Big Data sources like Google Trends are generally available at a higher frequency than the monthly series of the LFS. Combining both series in a time series model allows to make early predictions for the survey outcomes in real-time at the moment that the outcomes for the auxiliary series are available, but the survey data not yet, which is in the literature known as nowcasting, in other words, "forecasting the present".
In this paper, we extend the state space model used by Statistics Netherlands in order to combine the survey data with the claimant counts and the high-dimensional auxiliary series of Google Trends about jobsearch and economic uncertainty, as they could yield more information than a univariate one, which is not affected by publication lags and that can eventually be observed at a higher frequency than the labour force series.
This paper contributes to the existing literature by proposing a method to include a high-dimensional auxiliary series in a state space model in order to improve the (real-time) estimation of unobserved components. The model accounts for the rotating panel design underlying the sample survey series, combines series observed at different frequencies, and deals with missing observations at the end of the sample due to publication delays. It handles the curse of dimensionality that arises from including a large number of series related to the unobserved components, by extracting their common factors.
Besides claimant counts, the majority of the information related to unemployment is nowadays available on the internet; from job advertisements to resumé's templates and websites of recruitment agencies. We therefore follow the idea originating in Choi and Varian (2009), Askitas and Zimmermann (2009) and Suhoy (2009) of using terms related to job and economic uncertainty, searched on Google in the Netherlands. Since 2004, these time series are freely downloadable in real-time from the Google Trends tool, on a monthly or higher frequency. As from the onset it is unclear which search terms are relevant, and if so, to which extent, care must be taken not to model spurious relationships with regards to the labour force series of interest, which could have a detrimental effect on the estimation of unemployment, such as happened for the widely publicized case of Google Flu Trends (Lazer et al., 2014).
Our method allows to exploit the high-frequency and/or real-time information of the auxiliary series, and to use it in order to nowcast the unemployment, before the publication of labour force data. As the number of search terms related to unemployment can easily become large, we employ the two-step estimator of Doz et al. (2011), which combines factor models with the Kalman filter, to deal both with the high-dimensionality of the auxiliary series, and with the estimation of the state space model. The abovementioned estimator is generally used to improve the nowcast of variables that are observed such as GDP (see Giannone et al. (2008) and Hindrayanto et al. (2016) for applications to the US and the euro area), which is not the case for the unemployment. Nonetheless, D'Amuri and Marcucci (2017), Naccarato et al. (2018), and Maas (2019) are all recent studies that use Google Trends to nowcast and forecast the unemployment, by treating the latter as known dependent variable in time series models where the Google searches are part of the explanatory variables. To the best of our knowledge, our paper is the first one to use Google Trends in order to nowcast the unemployment in a model setting that treats the latter variable as unobserved.
We evaluate the performance of our proposed method via Monte Carlo simulations and find that our method can yield large improvements in terms of Mean Squared Forecast Error (MSFE) of the unobserved components' nowcasts. We then assess whether the accuracy of the unemployment's estimation and nowcast improves with our high-dimensional state space model, respectively from in-sample and out-of-sample results. The latter consists of a recursive nowcast. We do not venture into forecasting exercises as Google Trends are considered to be more helpful in predicting the present rather than the future of economic activities (Choi and Varian, 2012). We conclude that Google Trends can significantly improve the fit of the model, although the magnitude of these improvements is sensitive to aspects of the data and the model specification, such as the frequency of observation of the Google Trends, the number of Google Trends' factors included in the model, and the level of estimation accuracy provided by the first step of the two-step estimation procedure.
The remainder of the paper is organized as follows. Section 2 discusses the data used in the empirical analysis. Section 3.1 describes the state space model that is currently used by Statistics Netherlands to estimate the unemployment. Section 3.2 focuses on our proposed method to include a high-dimensional auxiliary series in the aforementioned model. Sections 4 and 5 report, respectively, the simulation and empirical results for our method. Section 6 concludes.

Data
The Dutch LFS is conducted as follows. Each month a stratified two-stage cluster design of addresses is selected. Strata are formed by geographical regions. Municipalities are considered as primary sampling units and addresses as secondary sampling units. All households residing on an address are included in the sample with a maximum of three (in the Netherlands there is generally one household per address). All household members with age of 16 or older are interviewed. Since October 1999, the LFS has been conducted as a rotating panel design. Each month a new sample, drawn according to the above-mentioned design, enters the panel and is interviewed five times at quarterly intervals. The sample that is interviewed for the j th time is called the j th wave of the panel, j = 1, . . . , 5. After the fifth interview, the sample of households leaves the panel. This rotation design implies that in each month five independent samples are observed. The generalized regression (GREG, i.e., design-based) estimator (Särndal et al., 1992) is used to obtain five independent direct estimates for the unemployed labour force, which is defined as a population total. This generates over time a five-dimensional time series of the unemployed labour force. Table 1 provides a visualization for the rotation panel design of the Dutch LFS. Rotating panel designs generally suffer from Rotation Group Bias (RGB), which refers to the phenomena that there are systematic differences among the observations in the subsequent waves (Bailar, 1975). In the Dutch LFS the estimates for the unemployment based on the first wave are indeed systematically larger compared to the estimates based on the follow-up waves (van den Brakel and Krieg, 2015). This is the net results of different factors: • Selective nonresponse among the subsequent waves, i.e., panel attrition.
• Systematic differences due to different data collection models that are applied to the waves. Until 2010 data collection in the first wave was based on face-to-face interviewing. Between 2010 and 2012 data collection in the first wave was based on telephone interviewing for households for which a telephone number of a landline telephone connection was available and face-to-face interviewing for the remaining households. After 2012 data collection in the first wave was based on a sequential mixed mode design that starts with Web interviewing with a follow up using telephone interviewing and faceto-face interviewing. Data collection in the follow-up waves is based on telephone interviewing only.
• Differences in wording and questionnaire design used in the waves. In the first wave a block of questions is used to verify the status of the respondent on the labour force market. In the followup waves the questionnaire focuses on differences that occurred compared to the previous interview, instead of repeating the battery of questions.
• Panel conditioning effects, i.e., systematic changes in the behaviour of the respondents. For example, questions about activities to find a job in the first wave might increase the search activities of the unemployed respondents in the panel. Respondents might also systematically adjust their answers in the follow-up waves, since they learn how to keep the routing through the questionnaire as short as possible.
The Dutch labour force is subject to a one-month publication delay, which means that the sample estimates for month t become available in month t + 1. In order to have more timely and precise estimates of the unemployment, we extend the model by including, respectively, auxiliary series of weekly/monthly Google Trends about job-search and economic uncertainty, and monthly claimant counts, in the Netherlands.
Claimant counts are the number of registered people that receive unemployment benefits. The claimant counts for month t become available in month t + 1.
Google Trends are indexes of search activity. Each index measures the fraction of queries that include the term in question in the chosen geography at a particular time, relative to the total number of queries at that time. The maximum value of the index is set to be 100. According to the length of the selected period, the data can be downloaded at either monthly, weekly, or higher frequencies. The series are standardized according to the chosen period and their values can therefore vary according to the period's length (Stephens-Davidowitz and Varian, 2015). We use weekly and monthly Google Trends for each search term. Google Trends are available in real-time (i.e., they are available in period t for period t, independently on whether the period is a week or a month).
The list of Google search terms used in the empirical analysis of this paper, together with their translation/explanation, is reported in Tables B.1 and B.2. A first set of terms (which is the one used in a previous version of this paper) was chosen by thinking of queries that could be made by unemployed people in the Netherlands. The rest of the terms has been chosen by using the Google Correlate tool and selecting the queries that are highly correlated to each term of the initial set, and that have a meaningful relation to unemployment and, more generally, economic uncertainty 1 . Figure 1 displays the time series of the five waves of the unemployed labour force, together with the claimant counts and an example of job-related Google query. They all seem to be following the same trend, which already shows the potential of using this auxiliary information in estimating the unemployment.

The Dutch labour force model and extensions
We first describe the model in use at Statistics Netherlands in Section 3.1. Next we explain how highdimensional auxiliary series can be added to this model in Section 3.2.

The Dutch labour force model
The monthly sample size of the Dutch LFS is too small to produce sufficiently precise estimates directly. In the past, rolling quarterly figures were published on a monthly frequency. This has the obvious drawback that published figures are unnecessarily delayed since the reference period is the mid month of the rolling quarter. Also, real monthly seasonal effects are smoothed over the rolling quarter. Another problem that arose after the change from a cross-sectional survey to a rotating panel design in 2000, was that the effects of RGB became visible in the labour force figures. Both problems are solved with a structural time series (STS) model, that is used by Statistics Netherlands, since 2010, for the production of monthly statistics about the Dutch labour force (van den Brakel and Krieg, 2015). In a STS model, an observed series is decomposed in several unobserved components, such as a trend, a seasonal component, one or more cycles with a period longer than one year, regression components, and a white noise component. After writing an STS model in the state space form, the Kalman filter can be applied in order to estimate the unobserved components. See Durbin and Koopman (2012) for an introduction to STS modelling.
Let y k j,t denote the GREG estimate for the unemployment in month t based on the sample observed in wave j. Now y k t = (y k 1,t , . . . , y k 5,t ) denotes the vector with the five GREG estimates for the unemployment in month t. The y k j,t are treated as five distinct time series in a five dimensional time series model in order to account for the rotation group bias. The superscript k > 1 indicates that the vector is observed at the low frequency. We need this notation (see e.g. Bańbura et al., 2013) to distinguish between series observed at different frequencies, because later on we will make use of Google Trends which are available on a weekly basis. If y k t is observed at the monthly frequency, as in the case of the unemployed labour force, then k = 4, 5 if the high frequency series is observed at the weekly frequency, since a month can have either 4 or 5 weeks.
The unemployment is estimated, with the Kalman filter, as a state variable in a state space model where y k t represents the observed series. The measurement equation takes the form (Pfeffermann, 1991;van den Brakel and Krieg, 2009): where ı 5 is a 5-dimensional vector of ones, and θ k,y t , i.e. the unemployment, is the common population parameter among the five-dimensional waves of the unemployed labour force. It is composed of the level of a trend (L t ) and a seasonal component (S t ): The transition equations for the level (L t ) and the slope (R t ) of the trend are, respectively: which characterize a smooth trend model. This implies that the level of the trend is integrated of order 2, denoted as I(2), which means that the series of the level is stationary (i.e., mean-reverting) after taking two times successive differences. The slope of the trend, R k,y t , is a first-order integrated series, denoted as I(1). This state variable represents the change in the level of the trend, L k,y The trigonometric stochastic seasonal component allows for the seasonality to vary over time, and it is modeled as in Durbin and Koopman (2012, Chapter 3): where h l = πl 6 , for l = 1, . . . , 6. The second component in equation (3.1), λ k t = (λ k 1,t , . . . , λ k 5,t ) t , accounts for the RGB. Based on the factors that contribute to the RGB, as mentioned in Section 2, the response observed in the first wave is assumed to be the most reliable one and not to be affected by the RGB (van den Brakel and Krieg, 2009). Therefore it is assumed that λ k 1,t = 0. The remaining four components in λ k t are random walks that capture time-dependent differences between the follow-up waves with respect to the first wave: As a result the Kalman filter estimates for θ k,y t in (3.1) are benchmarked to the level of the GREG series of the first wave.
The third component in equation (3.1), e k t = (e k 1,t , . . . , e k 5,t ) t , models the autocorrelation among the survey errors (e k j,t ) in the follow-up waves due to the sample overlap of the rotating panel design. In order to account for this autocorrelation, the survey errors are treated as state variables, which follow the transition equation below. e k j,t = c j,tẽ k j,t , c j,t = var y k j,t , j = 1, . . . , 5, e k 1,t ∼ N 0, σ 2 ν 1 , e k j,t = δẽ k j−1,t−3 + ν k j,t , ν k j,t ∼ N 0, σ 2 ν j , j = 2, . . . , 5, |δ| < 1.
var ẽ k j,t = σ 2 ν j / 1 − δ 2 , j = 2, . . . , 5, with var y k j,t being the design variance of the GREG estimates y k j,t . The scaled sampling errors,ẽ k j,t , for j = 1, . . . , 5, account for the serial autocorrelation induced by the sampling overlap of the rotating panel. Samples in the first wave are observed for the first time and therefore its survey errors are not autocorrelated with survey errors of previous periods. The survey errors of the second to fifth wave are correlated with the survey errors of the previous wave three months before. Based on the approach proposed by Pfeffermann et al. (1998), van den Brakel andKrieg (2009) motivate that these survey errors should be modelled as an AR(3) process, without including the first two lags. Moreover, the survey errors of all waves are assumed to be proportional to the standard error of the GREG estimates. In this way the model accounts for heterogeneity in the variances of the survey errors, which are caused by changing sample sizes over time. As a result the maximum likelihood estimates of the variances of the scaled sampling errors, σ 2 ν j , will have values approximately equal to one.
The structural time series model (3.1) as well as the models proposed in the following sections are fitted with the Kalman filter after putting the model in state space form. We use an exact initialization for the initial values of the state variables of the sampling error, and a diffuse initialization for the other state variables. It is common to call hyperparameters the parameters that define the stochastic properties of the measurement equation and the transition equation of the state space model. These are the parameters that are assumed to be known in the Kalman filter (Durbin and Koopman, 2012, Chapter 2). In our case the hyperparameters are δ and all the parameters that enter the covariance matrices of the innovations. These hyperparameters are estimated by maximum likelihood using the Broyden-Fletcher-Goldfarh-Shanno (BFGS) optimization algorithm. The additional uncertainty of using maximum likelihood estimates for the hyperparameters in the Kalman filter is ignored in the standard errors of the filtered state variables. Since the observed time series contains 185 monthly periods, this additional uncertainty can be ignored. See also Bollineni-Balabay et al. (2017) for details. Both the simulation and estimation results in Sections 4 and 5 are obtained using the statistical software R.
Assuming normality of the innovations is common in state space models because the hyperparameters of the model are estimated by maximizing a Gaussian log-likelihood which is evaluated by the Kalman filter. Moreover, under normality, the Kalman filter yields the minimum variance unbiased estimator of the state variables. Nonetheless, as long as the state space model is linear, if the true distribution of the error terms is non-Gaussian, then the Kalman filter still provides the minimum variance linear unbiased estimator of the state variables (Durbin and Koopman, 2012, Chapter 4). In this case we can further rely on quasi maximum likelihood (QML) theory in order to perform inference based on the QML estimates of the hyperparameters. This means that the hyperparameters can still be consistently estimated by maximizing the Gaussian log-likelihood (or in general, as Gourieroux et al. (1984) argue, a density function that belongs to the family of linear exponential distributions), but we shall use, if needed, the appropriate expression for the covariance matrix of the QML estimators, which should capture the additional uncertainty caused by the model's misspecification (Hamilton, 1994, Chapter 13). In Appendix C we conduct a Monte Carlo simulations study and find that deviations from normality are not of concern for the performance our method. This time series model addresses and solves the mentioned problems with small sample sizes and rotation group bias. Every month a filtered estimate for the trend (L k,y t ) and the population parameter, which is defined as the filtered trend plus the filtered seasonal effect (θ k,y

Including high-dimensional auxiliary series
To improve precision and timeliness of the monthly unemployment figures, we extend the labour force model by including auxiliary series of weekly/monthly Google Trends about job-search and economic uncertainty, and monthly claimant counts, in the Netherlands. Since the claimant counts for month t become available in month t+1, it is anticipated that this auxiliary series is particularly useful to further improve the precision of the trend and population parameter estimates after finalizing the data collection for reference month t. The Google Trends come at a higher frequency already during the reference month t. It is therefore anticipated that these auxiliary series can be used to make first provisional estimates for the trend and the population parameter of the LFS during month t, when the sample estimates y k t are not available, but the Google Trends become available on weekly basis.
Weekly and monthly Google Trends are throughout the paper denoted by x GT t and x k,GT t , respectively. We denote the dimension of the vector x GT t by n, which can be large. In addition, we can expect the Google Trends to be very noisy, such that the signal about unemployment contained in them is weak. We therefore need to address the high-dimensionality of these auxiliary series, in order to make the dimension of our state space model manageable for estimation, and extract the relevant information from these series. For this purpose we employ a factor model which achieves both by retaining the information of these time series in a few common factors.
Moreover, when dealing with mixed frequency variables and with publication delays, we can encounter "jagged edge" datasets, which have missing values at the end of the sample period. The Kalman filter computes a prediction for the unobserved components in presence of missing observations for the respective observable variables.
The two-step estimator by Doz et al. (2011) combines factor models with the Kalman filter and hence addresses both of these issues. In the remainder of this section we explain how this estimator can be employed to nowcast the lower-frequency unobserved components of the labour force model using information from higher-frequency or real-time auxiliary series.
We consider the following state space representation of the dynamic factor model for the Google Trends, with respective measurement and transition equations, as we would like to link it to the state space model used to estimate the unemployment (3.1): where x GT t is a n × 1 vector of observed series, f t is a r × 1 vector of latent factors with r ≪ n, Λ is a n × r matrix of factor loadings, ε t is the n × 1 vector of idiosyncratic components and Ψ its n × n covariance matrix; u t is the r × 1 vector of factors' innovations and I r is a r × r identity matrix (which follows from the identification conditions used in principal component analysis since the factors are only identified up to rotation). Notice that the dynamic equation for f t implies that we are making the assumption that x GT t is I(1) of dimension n, and f t is I(1) of dimension r. Later in this section the need of this assumption will become clearer; the intuition behind it is that the factors and the change in unemployment, R k,y t , must be of the same order of integration.
Among others, Bai (2004) proves the consistency of the estimator of I(1) factors by principal component analysis (PCA), under the assumptions of limited time and cross-sectional dependence and stationarity of the idiosyncratic components, ε t , and non-trivial contributions of the factors to the variance of x t . 2 We assume no cointegrating relationships among the factors. We further assume normality of the innovations for the same reasons outlined in Section 3.1.
The consistency of the two-step estimator has been originally proven in the stationary framework by Doz et al. (2011), and extended to the nonstationary case by Barigozzi and Luciani (2017).
In the first step, the factors (f t ), the factor loadings (Λ), and the covariance matrix of the idiosyncratic components (Ψ ) in model (3.3) are estimated by PCA as in Bai (2004). The matrices Λ and Ψ are then replaced, in model (3.3), by their estimatesΛ andΨ = diag ψ 11 , . . . ,ψ nn obtained in this first step.
These estimates are kept as fixed in the second step, because their high-dimensionality and associated curse of dimensionality complicates re-estimation by maximum likelihood. Moreover, restricting the covariance matrix of the idiosyncratic components Ψ as being diagonal is standard in the literature 3 In order to make use of the auxiliary series to nowcast the unemployment, we stack together the measurement equations for y k t and x k,GT t , respectively (3.1) and the first equation of (3.3) with Λ and Ψ replaced, respectively, byΛ andΨ , and express them at the lowest frequency (in our case the monthly observation's frequency of y k t ). The transition equations for the RGB and survey error component in combination with the rotation scheme applied in the Dutch LFS hamper a formulation of the model on the high frequency. This means that x GT t needs to be first temporally aggregated from the high to the low frequency (either before or after the first step which estimates Λ and Ψ ). Since x GT t are the I(1) weekly Google Trends, which are flow variables as they measure the number of queries made during each week, they are aggregated according to the following rule (Bańbura et al., 2013): The aggregated x k,GT j,t are then rescaled in order to be bounded again between 0 and 100. The subscript j allows for real-time updating of the aggregated Google Trends in week j when new data become available. As such, this index indicates that we aggregate weeks 1 up to j. When j = k we are at the end of the month, and we simply write x k,GT t to indicate the end-of-month aggregate value. In order to get the final model, we also include a measurement equation for the univariate auxiliary series of the claimant counts, assuming that its state vector, θ k,CC t , has the same composition of our population parameter θ k,y t (i.e., composed of a smooth trend and a seasonal component): (3.9) The last equality allows the innovations of the trends' slopes, R k,y t and R k,CC t , and of the factors of the Google Trends, to be correlated. Harvey and Chung (2000) show that there can be potential gains in precision, in terms of Mean Squared Error (MSE) of the Kalman filter estimators of θ k,y

Simulation study
We next conduct a Monte Carlo simulations study in order to elucidate to which extent our proposed method can provide gains in the nowcast accuracy of the unobserved components of interest. For this purpose, we consider a simpler model than the one used for the labour force survey. Here y k t is univariate following a smooth trend model, and x k t represents the (100 × 1)-dimensional auxiliary series with one common factor (r = 1).
We allow the slope and factor's innovations to be correlated, and we investigate the performance of the method for increasing values of the correlation parameter ρ ∈ [0, 0.2, 0.4, 0.6, 0.8, 0.9, 0.99]. The auxiliary variable x k t has the same frequency of y k t and it is assumed that all x k t are released at the same time without publication delays. The nowcast is done concurrently, i.e. in real-time based on a recursive scheme. This means that in each time point of the out-of-sample period, the hyperparameters of the model are re-estimated by maximum likelihood, extending the same used up to that period. This is done in the third part of the sample, always assuming that y k t is not available at time t, contrary to x k t . This implies that the available data set in period t equals The sample size is T = 150 and the number of simulations is n sim = 500.
We consider three specifications for the idiosyncratic components and the factor loadings: 1. Homoskedastic idiosyncratic components and dense loadings: 2. Homoskedastic idiosyncratic components and sparse loadings. The first half of the elements in the loadings are set equal to zero. This specification reflects the likely empirical case that some of the Google Trends are not related to the change in unemployment: 3. Heteroskedastic idiosyncratic components and dense loadings. The homoskedasticity assumption is here relaxed, again as not being realistic for the job search terms: , H ∼ U (0.5, 10), Λ ∼ U (0, 1) .
Let α k relative to the same measures calculated from the model that does not include the auxiliary series x k t . Recall that the latter comes down to making one-step-ahead predictions.
where h is the size of the of out-of-sample period.
In every setting, both the bias and the variance of the MSFE tend to decrease with the magnitude of the correlation parameter. The improvement is more pronounced for the slope rather than the level of the trend. For the largest value of the correlation, with respect to the model which does not include auxiliary information, the gain in MSFE for the level and the slope is, respectively, of around 25% and 75%. Moreover, for low values of ρ, the MSFE does not deteriorate with respect to the benchmark model. This implies that our proposed method is robust to the inclusion of auxiliary information that does not have predictive power for the state variables of interest. In Appendix C we report and examine additional simulation results with non-Gaussian idiosyncratic components, and draw the same conclusions discussed above for the MSFE and the variance of the state variables' nowcasts. The bias instead worsens while deviating from Gaussianity, but it does not affect the MSFE as it only accounts for a small part of the latter measure. We therefore conclude that the performance of our method is overall robust to deviations from Gaussianity of the idiosyncratic components.
The decision to focus the simulation study on the nowcast (rather than the in-sample) performance of our method, is motivated by the fact that the added value of the Google Trends over the claimant counts is their real-time availability, which can be used to nowcast the unemployment. Nonetheless, for completeness, in the empirical application of the next section we report the results also for the in-sample performance of our method.

Application to Dutch unemployment nowcasting
In this section we present and discuss the results of the empirical application of our method to nowcasting the Dutch unemployment using the auxiliary series of claimant counts and Google Trends related to job-search and economic uncertainty.
As explained in Section 3.2, the Google series used in the model must be I(1). We therefore test for nonstationarity in the Google Trends with the Elliott et al. (1996) augmented Dickey-Fuller (ADF) test, including a constant and a linear trend. We control for the false discovery rate as in Moon and Perron (2012), who employ a moving block bootstrap approach that accounts for time and cross-sectional dependence among the units in the panel.
Before proceeding with the estimation of the model by only including the Google Trends that resulted as being I(1) from the multiple hypotheses testing, we carry out an additional selection of the I(1) Google Trends by "targeting" them as explained and motivated below.
Homoskedastic idiosyncratic components and dense loadings  Bai and Ng (2008) point out that having more data to extract factors from is not always better. In particular, if series are added that have loadings of zero and are thus not influenced by the factors, these will make the estimation of factors and loadings by PCA deteriorate, as PCA assigns a non-zero weight to each series in calculating the estimated factor as a weighted average. Bai and Ng (2008) recommend a simple strategy to filter out irrelevant series (in our case Google search terms) and improve the estimation of the factors, which they call "targeting the predictors". In this case an initial regression of the series of interest is performed on the high-dimensional input series to determine which series are (ir)relevant. The series that are found to be irrelevant are discarded and only the ones that are found to be relevant are kept to estimate the factors and loadings from. In particular, they recommend the use of the elastic net (Hastie and Zou, 2005), which is a penalized regression technique that performs estimation and variable selection at the same time by setting the coefficients of the irrelevant variables to 0 exactly. After performing the elastic net estimation, only the variables with non-zero coefficients are then kept. As we do not observe our series of interest directly, we need to adapt their procedure to our setting. To do so we approximate the unobserved unemployment by its estimation from the labour force model without auxiliary series. Specifically, we regress the differenced estimated change in unemployment from the labour force model without auxiliary series, ∆R k,y t , on the differenced I(1) Google Trends using the elastic net penalized regression method, which solves the following minimization problem: The tuning parameters λ and α are selected from a two-dimensional grid in order to minimize the Schwarz (1978) Bayesian information criterion (BIC). Notice that performing the penalized regression on the differenced (and therefore stationary) data, also allows us to avoid the inclusion in the model of Google Trends that have spurious relations with the change in unemployment. We both consider estimating the final model with all Google Trends included and with only the selected Google Trends included, thereby allowing us to assess the empirical effects of targeting. The final number of nonstationary Google Trends included in the model, n, may differ depending on whether we use the weekly Google Trends aggregated to the monthly frequency according to equation (3.4), or the monthly Google Trends. Whenever we apply PCA, the Google Trends are first differenced and standardized.
We further need to make sure that the stationarity assumption of the idiosyncratic components is maintained. Therefore, after having estimated the factors by PCA in model (3.3), we test which of the idiosyncratic components ε t are I(1) with an ADF test without deterministic components, by controlling for multiple hypotheses testing as in Moon and Perron (2012). The I(1) idiosyncratic components are modelled as state variables in (3.5), with the following transition equation: with usual normality assumptions on the ξ k t . The covariance matrix of the idiosyncratic components Ψ is therefore estimated on the levels of the I(0) idiosyncratic components and the first differences of the I(1) idiosyncratic components. Appendix A.2.3 provides a toy example that elucidates the estimation procedure.
Finally, we notice that although the first step of the two-step estimation procedure is meant to avoid estimating Ψ and Λ by maximum likelihood (since they are large matrices), this pre-estimation may affect the explanatory power of the Google Trends. We here propose two different ways to obtain (possibly) more accurate estimates of these two matrices: • In Section 3.2 we mention that the first step of the two-step estimator, which estimates Ψ and Λ by PCA, can be carried out on the weekly Google Trends (which are therefore aggregated to the monthly frequency after the first step). Since the sample size of the high frequency data is larger, using weekly Google Trends might improve the estimation accuracy of Ψ and Λ.
• Doz et al. (2011) argue that from the Kalman filter estimates of the factors, it is possible to re-estimate Ψ and Λ (by least squares), which in turn can be used to re-estimate the factors, and so on. This iterative procedure is equivalent to the Expectation-Maximization (EM) algorithm, which increases the likelihood at each step and therefore converges to the maximum likelihood solution. Notice that since the Kalman filter can (in our setting) only provide monthly estimates, the iterative estimation is done on the low-frequency Google Trends.
Later in this section we check how sensitive our empirical results are to the different estimates of Ψ and Λ.
For the second type of estimation method discussed above, we only perform one additional iteration of the two-step procedure due to its computational burden. We present empirical results for the in-sample estimates and out-of-sample forecasts. With the in-sample estimates we evaluate to which extent the auxiliary series improve the precision of the published monthly unemployment estimates after finalizing the data collection. With the out-of-sample forecasts we evaluate to which extent the auxiliary series improve the precision of provisional estimates in a nowcast procedure during the period of data collection. We always estimate four different models: the labour force model without auxiliary series (baseline), the labour force model with auxiliary series of claimant counts (CC), of Google Trends (GT) and of both (CC & GT). We compare the latter three models to the baseline one with the in-sample and out-of-sample exercises. The period considered for the estimation starts in January 2004 and ends in May 2019 (T = 185 months). The out-of-sample nowcasts are conducted in real-time (concurrently) in the last three years of the sample based on a recursive scheme: each week or month, depending on whether we use weekly or monthly Google Trends, the model, including its hyperparameters, is re-estimated on the enlarged sample now extended by the latest observations, while assuming that the current observations for the unemployed labour force and the claimant counts are missing. Analogously, when the Google Trends are first targeted with the elastic net, the targeting is re-executed in each week or month of the out-of-sample period on the updated sample.
We define the measure of in-sample estimation accuracy MSE(α k t|Ωt ) = 1 t|Ωt is the vector of Kalman filter estimates of the state variables,P k t|Ωt is its estimated covariance matrix in month t, and d is the number of state variables that are needed to estimated the labour force model without auxiliary series, and that need a diffuse initialization for their estimation (d = 17). The measure of nowcast accuracy, MSFE(α k is the nowcasted covariance matrix for the prediction in week j of month t, and Ω − j,t = {x k,GT j,t , y k t−1 , x k,CC t−1 , x k,GT t−1 , . . .} is in this case the available information set in week j of month t. This is because the nowcast is done recursively throughout the weeks of the out-of-sample period. We always report the relative MS(F)E with respect to the baseline model; values lower than one are in favour of our method. We note that nowcasting under the baseline model without auxiliary series and the baseline model extended with claimant counts comes down to making one-step-ahead predictions. Expressions forα k t|Ωt , α k t|Ω − t and their covariance matrices,P k t|Ωt andP k t|Ω − t , are given by the standard Kalman filter recursions, see e.g. Durbin and Koopman (2012, Chapter 4).
The initial values of the hyperparameters for the maximum likelihood estimation are equal to the estimates for the labour force model obtained in van den Brakel and Krieg (2015). We use a diffuse initialisation of the Kalman filter for all the state variables except for the 13 state variables that define the autocorrelation structure of the survey errors, for which we use the exact initialisation of Bollineni-Balabay et al. (2017).
We use the three panel information criteria proposed by Bai and Ng (2002) which we indicate, as in Bai and Ng (2002), with IC 1 , IC 2 and IC 3 , in order to choose how many factors of the Google Trends to include in the model 4 . When the Google Trends are targeted with the elastic net, the information criteria suggest to include one or two factors. In the empirical analysis we check the sensitivity of the results with respect to these two different numbers of factors included in the model.
We employ a Wilks (1938) likelihood ratio (LR) test to assess whether the correlation parameters are significantly different from zero, and hence adding the auxiliary information might yield a significant improvement from the baseline model. Specifically, we indicate with ρ CC = 0, ρ 1,GT = 0 and ρ 2,GT = 0 the null hypotheses for the individual insignificance of the correlation parameter with, respectively, the claimant counts, and the first and second factor (when present) of the Google Trends. With ρ GT = 0 and ρ = 0 we instead indicate the null hypotheses for the joint insignificance of, respectively, the correlations with the Google Trends' factors, and all correlation parameters. If the true distribution of the error terms is non-Gaussian, the LR test, based on the QML estimates, does not generally keep having, under the null hypothesis, an asymptotic χ 2 distribution with degrees of freedom equal to the number of restrictions. One exception is when the covariance matrix of the error terms from a regression involving observed variables, is replaced by a consistent estimator prior to the maximization of the log-likelihood (Gourieroux and Monfort, 1993). In our case, if the idiosyncratic components of the Google Trends, ε k,GT t , are the only error terms not being normally-distributed, we may fall into this exception. The covariance matrix Ψ is indeed replaced, for the maximization of the log-likelihood, by its consistent PCA estimator obtained in the first step of the two-step estimation procedure. Nonetheless, in the setting of Gourieroux and Monfort (1993) the regressors are observed, whereas in our case the latter are the unobserved factors. Consequently, it is not trivial to asses whether our model specification indeed falls into the above-mentioned exception. A formal proof for this is beyond the scope of this paper, but in Appendix C we conduct a simulation study in order to obtain the finite-sample probability density of the LR test under misspecifications of the distribution of the idiosyncratic components. We conclude that the distribution of the LR test is not affected by these misspecifications. At the end of this section we show that that there is no evidence that the error terms other than ε k,GT t , are not normally-distributed. We should therefore be able perform inference based on the usual asymptotic distribution of the LR test. Table 3 reports the estimated hyperparameters for the four models, as well as the respective value for the maximized log-likelihood, the relative measures of in and out-of-sample performance, and the p-values from the LR tests, when the monthly Google Trends are used.
The maximum likelihood estimates for the standard error of the seasonal components' disturbance terms tend to zero, indicating that the seasonal effects are time invariant.
Recall from equation (3.2) that the variances of the scaled sampling errors, σ 2 ν j , should take values close to one. Their estimates are divided by (1 -δ 2 ) and are always slightly larger than one, which is an indication that the variance estimates of the GREG estimates, used to scale the sampling errors in equation (3.2), somewhat underestimate the real variance of the GREG estimates.
The correlation with the claimant counts is estimated to be above 0.9, and remains large and significant when including the Google Trends. Similar conclusions can be drawn for the correlations with the Google Trends' factors, when the Google Trends are targeted with the elastic net, and 39 of them are included in the model. When the additional targeting is not applied, and the 162 I(1) Google Trends are directly included in the model, the correlation parameter with the first factor of the Google Trends is instead always small and insignificant (in this setting we do not include more than one factor). Moreover, for the same number of factors, targeting the Google Trends always yield a better performance in terms of estimation and nowcast accuracy of the state variables of interest, with respect to not targeting them. For this reason, we focus the remaining analysis of the empirical results only on the targeted Google Trends.
The best results in terms of both estimation and nowcast accuracy of all the state variables, is achieved by the CC & GT model with one factor, yielding a gain of, respectively, around 40% and 20% forR k,y t|Ω − t , and around 20% and 25% for bothL k,y t|Ω − t andθ k,y t|Ω − t , with respect to the baseline model. Note that this implies that the above-mentioned model outperforms also the model that contains only the claimant counts as auxiliary series. In general, the models with Google Trends tend to achieve a better estimation and nowcast of the change in unemployment, R k,y t , rather than the other two state variables, with respect to the models that include the claimant counts.
Including two instead of one factor clearly increases the complexity of the model, which is reflected in smaller accuracy gains (in the CC & GT model probably also due to the decreased magnitude of the n = 162, IC 1 = 3, IC 2 = 1, IC 3 = 10 Targeted GT, n = 39, IC 1 = 2, IC 2 = 1, IC 3 = 2 r = 1 r = 1 r = 2  The abbreviation "all corr." denotes that the correlation between the claimant counts and the Google Trends is also estimated. "Targeted GT" indicates that the Google Trends have been targeted with the elastic net before including them in the model. correlation parameter with the claimant counts), especially for the nowcast of the state variables, with respect to including only one factor. Nonetheless, the correlations with both factors are individually and jointly significantly different from zero, indicating that both factors bring additional information about the Dutch unemployment.
Notice that in general all the relative measures of accuracy are below one, indicating that both the claimant counts and the Google Trends improve the estimation and nowcast accuracy of the unemployment and its change. Even when the Google Trends are not targeted and their factor is not significantly related to the unemployment, the measures are never drastically above one, meaning that our method tends to ignore auxiliary series that are not related to the target variable.
Finally, when we specified the covariance matrix (3.9) in Section 3.2, we did not let the claimant counts and the Google Trends be correlated because our goal is to improve the estimation/nowcast accuracy of the unobserved components of the labour force series, not of the claimant counts nor the Google Trends. Nonetheless, if the state variables of equation (3.8) are all cointegrated (i.e. the correlation parameters are all equal to one) a more efficient estimation method would be to only estimate the variance of their common source of error. We therefore estimate the CC & GT model with one factor, when all series are correlated. We call this model "CC & GT all corr.". Table 3 reports the empirical results also for this model. Although the nowcast accuracy is similar to the same model without the additional correlation between the claimant counts and the Google Trends (which we indicate as ρ 1,CC,GT ), the in-sample accuracy deteriorates (even with respect to the baseline model), and ρ 1,CC,GT is not significantly different from zero. We therefore conclude that the specification of the covariance matrix (3.9) is appropriate.
In Table 4 we report the empirical results for the GT and CC & GT models which employ the targeted Google Trends observed at the weekly frequency, and aggregated to the monthly frequency according to equation (3.4) in order to include them in the models. In this case we still look at the sensitivity of the results with respect to the number of factors included in the model, but also with respect to the two additional methods for the estimation of Λ and Ψ discussed at the beginning of this section.
The measures of accuracy are again broadly lower than one, but the gains are not as large as observed for the monthly Google Trends. Including two factors improves the accuracy in the GT model, but not in the CC & GT model, except for a more precise nowcast of R k,y t . The correlation parameter with the claimant counts remains large and significant. On the contrary, the correlation parameter with the first factor of the Google Trends is not significantly different from zero, and there is a weak evidence for the second factor being significantly related to the change in unemployment. For this reason we continue the analysis by considering two factors in the model.
Estimating Λ and Ψ on the weekly Google Trends improves the measures of accuracy only for the CC & GT model, and not for the GT model. An additional iteration of the two step estimator, in order to obtain more accurate estimates of Λ and Ψ , achieves instead better nowcasts for both the GT and the CC & GT models (and also better in-sample estimates for the latter model), and a similar performance to the models which employ the monthly Google trends and include two factors. Notice that the values of the log-likelihood for these two models increased with respect to the same model specifications that use the original two-step estimation (without the additional iteration). The latter result, as pointed out in the explanation of the iterated estimation of Λ and Ψ at the beginning of this section, is to be expected. Despite the above-mentioned improvements in estimation/nowcast accuracy, the correlation parameters with the Google Trends' factors are always insignificant. The aggregation of the Google Trends from the weekly to the monthly frequency yields time series that are more noisy with respect to the Google Trends that are directly observed at the monthly frequency, and detecting significant results therefore becomes harder.
Finally, even though weekly Google Trends allow to perform the monthly nowcasts on a weekly basis, we notice that, in general, the precision of the nowcast does not monotonically improve with the number of weeks. If the high-dimensional state space model could be expressed and estimated on the highest frequency, the weekly gains in nowcast accuracy could be more evident. Nonetheless, we are limited by the transition equations for the RGB and the survey errors, to estimate the model on the monthly frequency. Figures 2-4 compare the point nowcasts, respectively, of the change in unemployment, its trend, and the population parameter, obtained with the baseline, the CC, and the GT and CC & GT models which employ monthly Google Trends and include two of their factors. From the first graph, it is evident that the models including claimant counts tend to deviate from the baseline model. The latter, on the contrary, gives similar results as those of the GT model. The point nowcasts of L k,y t and θ k,y t are more similar throughout the model specifications, with a slight and positive difference between the models that include the Google Trends and the ones that do not, at the beginning of the out-of-sample period. Figures D.1 and D.2 show the selection frequency of, respectively, the monthly and weekly Google Trends in the out-of-sample period. Some of the most selected search terms in both cases are: werklozen (unemployed people), baan zoeken (job search), curriculum vitae voorbeeld (curriculum vitae example), ww uitkering (unemployment benefits), ww aanvragen (to request unemployment benefits), resume, tijdelijk werk (temporary job), huizenmarkt zeepbel (housing market bubble). Notice that the latter term (as well as "economische crisis" (economic crisis) or "failliet" (bankrupt), which are also frequently selected monthly Google Trends) is of economic uncertainty nature, rather than being job-search related. A previous version of this paper only used the latter type of search terms, and did not find them to have explanatory power Targeted GT, n = 37, IC 1 = 1, IC 2 = 1, IC 3 = 2  -likelihood -17954.398 -19621.000 -17767.456 -19438.409 -18651.434 -20318.457 -17745.335 -19413.916 p-value from the LR test   4)). The number of Google Trends and the number of their factors included in the model are denoted with n and r, respectively. "WeeklyΛ,Ψ " denotes that the latter estimates are obtained using the weekly Google Trends. "Iterated Λ,Ψ " means that the latter estimates are obtained from an additional iteration of the two-step estimator. "Targeted GT" indicates that the Google Trends have been targeted with the elastic net before including them in the model.
for the Dutch unemployment, which is now instead significantly improved by the inclusion of search terms related to economic uncertainty.
The results of the empirical analysis can be summarized as follows. Targeting the Google Trends improves the explanatory power of the latter series for the Dutch unemployment. Monthly Google Trends significantly improve the estimation and nowcast accuracy of the Dutch unemployment and its change, with both one and two factors. The largest gains are obtained when both the claimant counts and the Google Trends are included, and considering only one factor for the latter series. When two factors are considered, the gains are smaller but both factors seem to be significantly related to the change in unemployment, indicating that both of them should be included in the model in order to exploit all the information that the Google Trends give about the target variable. The sensitivity to the number of factors is somewhat similar for the weekly Google Trends, although there is a weak evidence only for their second factor to have a significant relation with the change in unemployment. The weekly Google Trends are less informative about the Dutch unemployment, yielding in general less improvements in estimation and nowcast accuracy, with respect to the monthly Google Trends. The contributions of the two types of Google Trends are comparable only when the two-step estimator is additionally re-iterated for the weekly Google Trends (in order to obtain more precise estimates of Λ and Ψ ). This result suggests that iterating the two-step estimation can improve the explanatory power of the Google Trends, and that the latter series are sensitive to the estimates of Λ and Ψ . Improvements are, instead, not always present when Λ and Ψ are estimated on the weekly data. In general, the claimant counts mainly have a positive impact on the estimation and nowcast accuracy of θ k,y t and L k,y t , whereas the Google Trends on R k,y t . The point nowcasts of the latter state variable are more sensitive to the type of auxiliary series included, with respect to the ones of θ k,y t and L k,y t . The assumptions of normality made and discussed throughout the paper can be tested on the standardized one-step ahead forecast errors (Durbin and Koopman, 2012, Chapter 7): where F k t is the covariance matrix of the prediction errors v k t estimated with the Kalman filter. The prediction errors for the labour force are defined as v k,y t = y k t − Z y tα k,y t|Ω t−1 , for the claimant counts as v k,CC t = x k,CC t − Z CCα k,CC t|Ω t−1 , and for the Google Trends as v k,y t = x k,GT t −Λf k t|Ω t−1 , for t = d+1, . . . , T (the expressions for Z y t and Z CC can be found in Appendix A). We test the assumptions on the estimated CC & GT models when two factors of the Google Trends are included, and which employ, respectively, the monthly Google Trends, and the weekly Google Trends with the additional iteration of the two-step estimator (as they yield the best results in terms of estimation and nowcast accuracy of the state variables of interest, when two factors of the Google Trends are included).
We test the null hypothesis of univariate normality for each of the prediction error, with the Shapiro and Wilk (1965) and Bowman and Shenton (1975) tests, as suggested, respectively, in Harvey (1989, Chapter 5) and Durbin and Koopman (2012, Chapter 2). The former test is based on the correlation between given observations and associated normal scores, whereas the latter test is based on the measures of skewness and kurtosis.
The p-values from the Shapiro-Wilk test are reported in Figures 5 and 6 for the two different model specifications discussed above, respectively. For both model specifications, there is no (strong) evidence against the normality assumptions for the error terms of the labour force and the claimant counts series, as their corresponding p-values are above the confidence level of 0.05. This result suggests that the model is correctly specified for these series. The test instead rejects the null hypothesis of normality for most of the idiosyncratic components of the Google Trends. The normality assumption seems therefore not appropriate for the latter series, but as discussed in Sections 3.1 and 5, and examined in the simulation study of Appendix C, this type of misspecification does not affect the consistency of the estimators of the state variables and the hyperparameters, and does not seem to influence the performance of our method, nor the distribution of the LR test which allows to perform inference on the correlation parameters 5 . The conclusions from the 2 0 1 6 -0 2 2 0 1 6 -0 6 2 0 1 6 -0 9 2 0 1 6 -1 2 2 0 1 7 -0 4 2 0 1 7 -0 7 2 0 1 7 -1 0 2 0 1 8 -0 1 2 0 1 8 -0 5 2 0 1 8 -0 8 2 0 1 8 -1 1 2 0 1 9 -0 3 2 0 1 9 -0 6

Conclusions
This paper proposes a method to include a high-dimensional auxiliary series in a state space model in order to improve the estimation and nowcast of unobserved components. The method is based on a combination of PCA and Kalman filter estimation to reduce the dimensionality of the auxiliary series, originally proposed by Doz et al. (2011), while the auxiliary information is included in the state space model as in Harvey and Chung (2000). In this way we extend the state space model used by Statistics Netherlands to estimate the Dutch unemployment, which is based on monthly LFS data, by including the auxiliary series of claimant counts and Google Trends related to job-search and economic uncertainty. The strong explanatory power of the former series, in similar settings, has already been discovered in the literature (see Harvey and Chung (2000) and van den Brakel and Krieg (2016)). We explore to which extent a similar success can be obtained from online job-search and economic uncertainty behaviour. The advantage of Google Trends is that they are freely available at higher frequencies than the labour force survey and the claimant counts, and, contrary to the latter, they are not affected by publications delays. This feature can play a key role in the nowcast of the unemployment, as being the only real-time available information.
A Monte Carlo simulation study shows that in a smooth trend model our proposed method can improve the MSFE of the nowcasts of the trend's level and slope up to, respectively, around 25% and 75%. These results are robust to misspecifications regarding the distribution of the idiosyncratic components of the auxiliary series. Therefore, our method does have the potential to improve the nowcasts of unobserved components of interest.
In the empirical application of our method to Dutch unemployment estimation and nowcasting, we find that our considered Google Trends (when first targeted with the elastic net) do in general yield gains in the estimation and nowcast accuracy (respectively up to 40% and 25%) of the state variables of interest, with respect to the model which does not include any auxiliary series. This result stresses the advantage of using the high-dimensional auxiliary series of Google Trends, despite involving a more complex model to estimate, which is especially relevant for countries that do not have any data sources related to the unemployment (such as the registry-sourced series of claimant counts), other than the labour force survey. We also find that, under certain model specifications, including both claimant counts and Google Trends outperforms the model which only includes the former auxiliary series. This result is explained by the fact that the two auxiliary series have a positive impact on the estimation/nowcast accuracy of different unobserved components which constitute the unemployment, thus yielding an overall improvement of the fit of the model. This also indicates that claimant counts and Google Trends do not bring redundant information about the Dutch unemployment.
The magnitude of the above-mentioned gains is, nonetheless, sensitive with respect to the following aspects of the data and the model specification. First, in our empirical application we employ both monthly and weekly Google Trends. The latter need to be aggregated to the monthly frequency in order to be included in the model, but allow to perform the nowcast on a weekly basis. We find that the former are less noisy and provide in general more accurate estimates/nowcasts of the state variables of interest, with respect to the latter. The explanatory power of the monthly Google Trends for the Dutch unemployment is further corroborated by results from LR testing, which are in favour of their inclusion in the model. There is, instead, not strong and consistent evidence for this when the weekly Google Trends are employed.
Second, PCA involves the estimation of common factors that drive the Google Trends, and in our method we relate these factors to the unobserved components that constitute the Dutch unemployment. Information criteria suggest that the Google Trends are driven by either one or two common factors. We find that including two factors yields, in general, less gains in accuracy, with respect to including one factor (due to the increased complexity of the model), but there is evidence that the second factor is related to the unemployment, and therefore it should be included in the model in order to exploit all the information that the Google Trends give about the unemployment.
Finally, our estimation method is based on a two-step procedure. In the first step, the matrix of factors' loadings and the covariance matrix of the idiosyncratic components of the Google Trends are estimated by PCA. In the second step, these matrices are replaced by their PCA estimates, in order to re-estimate the Google Trends' factors and the unobserved components of the labour force series, with the Kalman filter. Replacing these matrices by their estimates might affect the explanatory power of the Google Trends. We find that the explanatory power of the weekly Google Trends can be improved (in order to yield similar gains as the ones obtained with the monthly Google Trends), with an additional iteration of the two-step estimation procedure, which should provide more accurate estimates of the two matrices.
As already mentioned, we generally find estimation/nowcast accuracy gains from the inclusion of the Google Trends, when they are first "targeted", by selecting the ones that are relevant for the Dutch unemployment, based on the elastic net penalized regression. If the targeting is not first applied, we do not find gains and significant relationships between the Google Trends and the Dutch unemployment. Nonetheless, in this case the results do not deteriorate with respect to the model that does not include any auxiliary series, suggesting that our method is able to ignore the inclusion of irrelevant auxiliary series, in the estimation/nowcast of unobserved components of interest. This result is corroborated in our Monte Carlo simulation study. Hence, our proposed approach provides a framework to analyse the usefulness of "Big Data" sources, with little risk in case the series do not appear to be useful.
One limitation of the current paper is that it does not allow for time-variation in the relation between the unobserved component of interest and the auxiliary series. For example, legislative changes may change the correlation between unemployment and administrative series such as claimant counts. Additionally, one can easily imagine the relevance of both specific search terms as well as internet search behaviour overall to change over time. While such time-variation may partly be addressed by considering shorter time periods, decreasing the already limited time dimension will have a strong detrimental effect on the quality of the estimators. Therefore, a more structural method is required that extends the current approach by building the potential for time variation into the estimation method directly, while retaining the possibility to use the full sample size. Such extensions are currently under investigation by the authors.

A State space representations
For the sake of simplicity, in this appendix material the subscript t (without the superscript k) indicates that the model is expressed at the low (monthly) frequency.
A.1 Labour force model with univariate auxiliary series Throughout this section it is assumed that the univariate auxiliary series are the claimant counts, therefore x t = x CC t . The observation equation is: x . The state variables for y t (i.e., the level, the slope, the seasonality, the RGB and the survey errors) are: where E refers to the structure of the autocorrelated sampling errors that are modelled as state variables. The state variables for x t (i.e., the level, the slope and the seasonality) are: The transition equation takes the form: The transition matrix for y t is: The transition matrix for the level and slope components is: The transition matrix for the seasonal component is: , h l = πl/6, l = 1, ..., 6.
The transition matrix for the RGB component is: The transition matrix for the autocorrelated survey errors is: , is the same as T y without the transition matrices for the RGB component and for the survey errors.

A.2 Labour force model with high-dimensional auxiliary series
Throughout this section it is assumed that the high-dimensional auxiliary series are the Google Trends, therefore x t = x GT t . n is the number of Google Trends. It is assumed only r = 1 factor for the Google Trends.
Z y t is the same as in Appendix A.1. The transition equation takes the form: T y is the same as in Appendix A.1, and T x = 1. The vector of innovations is: where η y t and the first (30 × 30) diagonal elements of Q are the same as in Appendix A.1.

A.2.1 Extension of the model to incorporate the lags of f t
Consider a regression of η y R,t on the past values of u t : In the measurement equation Z x = Λ n×1 0 n×q .

A.2.2 Extension of the model to incorporate the seasonality/cycle in f t with a (seasonal) ARIMA model
Assume an ARIMA(3, 1, 1) process for f t : The state space representation of the above model is based on Durbin and Koopman (2012, Chapter 3) and illustrated below. Let f t be the state vector The transition equation for f t takes the form: Consequently, the observation equation becomes: x t =Λ 1 1 0 0 + ε t .
Note that the transition equation of the full state space model is now expressed in the form: We here allow u t to be correlated with η y R,t .

A.2.3 I(1) idiosyncratic components
Consider the following toy example to have a clearer understanding of the estimation procedure when some of the idiosyncratic components are I(1).
The covariance matrix between the innovation terms in the observation equation is

A.3 Labour force model with univariate and high-dimensional auxiliary series
Throughout this section both the claimant counts and the Google Trends are included in the model as auxiliary series.
The observation equation is: Z y t is the same as in Appendix A.1, and Z CC is the same as Z x in Appendix A.1. The transition equation takes the form: T y is the same as in Appendix A.1, and T CC and α CC t are, respectively, the same as T x and α x t in Appendix A.1.
The vector of innovations is: where η y t is the same as in Appendix A.  From the set of search terms listed above we discard the Google Trends which have zero values for more than half of the time, before performing the empirical analysis. The final dataset is composed of 182 monthly Google Trends and 173 weekly Google Trends. and we calculate the values of the LR test. We do this for 1000 simulation runs. Notice that under the null hypothesis that ρ = 0, and a correct specification of the model, the LR test should be asymptotically χ 2 1distributed, which (as mentioned in Section 5) does not necessarily hold if the model is misspeficied, e.g. if the true distribution of the error terms is not Gaussian, but a Gaussian distribution is instead used in order to estimate the model. In Figure C.1 we therefore compare the probability densities of the LR tests obtained as described above, to a χ 2 1 distribution. We notice that the density of the LR test is not sensitive to deviations of the idiosyncratic components from Gaussianity. For all three distributions of the error terms considered, i.e., Gaussian, Exponential, and Student's t with 4 degrees of freedom, the density of the LR test is close to a χ 2 1 distribution. These simulation results suggest that our method allows to conduct inference as usual based on the results of the LR test, even if the distribution of the idiosyncratic components is misspecified. ρ = 0 ρ = 0.2 ρ = 0.4 ρ = 0.6 ρ = 0.8 ρ = 0.9 ρ = 0.99  LR with Gaussian ε k,x t LR with Exponential ε k,x t LR with t 4 -distributed ε k,x t Figure C.1: Probability densities of the LR tests obtained under the null hypothesis that ρ = 0, for n sim = 1000 and for the three model specifications discussed at the beginning of Appendix C, respectively, a Gaussian, an Exponential, and a Student's t 4 distribution for the idiosyncratic components, ε k,x t . We compare these probability densities to a χ 2 1 distribution, which is the asymptotic distribution of the LR test under the null hypothesis, and a correct specification of the model.  Figure D.1: Frequency of monthly Google search terms selection by the elastic net in the out-of-sample period. A value of 1 means that the variable has been selected in every month of the out-of-sample period. We only report search terms that have been selected at least 50% of the times.