gdpc: An R Package for Generalized Dynamic Principal Components

gdpc is an R package for the computation of the generalized dynamic principal components proposed in Peña and Yohai (2016). In this paper, we brieﬂy introduce the problem of dynamical principal components, propose a solution based on a reconstruction criteria and present an automatic procedure to compute the optimal reconstruction. This solution can be applied to the non-stationary case, where the components need not be a linear combination of the observations, as is the case in the proposal of Brillinger (1981). This article discusses some new features that are included in the package and that were not considered in Peña and Yohai (2016). The most important one is an automatic procedure for the identiﬁcation of both the number of lags to be used in the generalized dynamic principal components as well as the number of components required for a given reconstruction accuracy. These tools make it easy to use the proposed procedure in large data sets. The procedure can also be used when the number of series is larger than the number of observations. We describe an iterative algorithm and present an example of the use of the package with real data. This vignette is based on Peña, Smucler, and Yohai (2020).


Introduction
Dimension reduction is important for the analysis of multivariate time series, particularly for the high-dimensional data sets that are becoming increasingly common in applications, because the number of parameters in the usual multivariate time series models grows quadratically with the number of variables. The first proposal of a dynamic factor model for time series is due to Brillinger (1964Brillinger ( , 1981 who proposed to apply standard techniques of factor analysis to the spectral matrix. He also proposed dynamic principal components, which are two sided linear combinations of the data which provide an optimal reconstruction. They are obtained by the inverse Fourier transform of the principal components of the spectral density matrices for each frequency. Geweke (1977) proposed a one sided generalization of the static factor model where the factors and the innovations are mutually independent and follow covariance stationary linear processes, and applied standard estimation methods for factor analysis to the spectral density matrix instead of the covariance matrix. A similar model was used by Sargent and Sims (1977), who named their model the index model. In this model the factors account for all the cross-correlations among the series, whereas the factors and the innovations account for the autocorrelation of the observed series. Peña and Box (1987) propose a dynamic factor model for a vector of stationary time series with white noise innovations and proved that we can estimate the loading matrix by the eigenvectors corresponding to non null eigenvalues of the lag covariance matrices of the data. Stock and Watson (2002) use dynamic factors for forecasting, by assuming that the variable to forecast and the explanatory variables useful for its forecasting follow a dynamic factor model. Bai and Ng (2002) develop a criterion to consistently estimate the number of factors, which will be used in our package. Hu and Chou (2004) proposed a test for the number of factors and explore the generalization of the model for integrated factors that was carried out by Peña and Poncela (2006). Pan and Yao (2008) include general nonstationary processes for the factors. Lam and Yao (2012) proposed a test for the number of factors based on the eigenvalues of the lag covariance matrices. Forni, Hallin, Lippi, and Reichlin (2000) generalized Geweke's dynamic factor model by allowing the idiosyncratic components to be autocorrelated and contain weak cross-correlations. The authors proposed to estimate the common components by the projection of the data on the dynamic principal components proposed by Brillinger. Forni, Hallin, Lippi, and Reichlin (2005) proposed a one sided method of estimation of a dynamic factor model and used the method for forecasting. The forecasts generated with this procedure have been compared to the ones derived by Stock and Watson (2002) and the results are mixed (see Forni, Hallin, Lippi, and Zaffaroni (2017)). A modified forecasting approach was proposed by Forni, Hallin, Lippi, and Zaffaroni (2015), although again the finite sample results are mixed. Hallin and Lippi (2013) give a general presentation of the methodological foundations of dynamic factor models. Peña and Yohai (2016) proposed a new approach for defining dynamic principal components that is different from the one taken by Brillinger in three ways. First, their generalized dynamic principal components (GDPC, from now on) are built without the assumption that the series are stationary. If the data is not stationary, the GDPC minimize a meaningful reconstruction criterion, unlike Brillinger's approach, which minimizes the expected reconstruction mean square error. Second, the GDPC need not be a linear combination of the original series. Third, they are computed directly in the time domain, instead of working in the frequency domain. These GDPC are optimal reconstructors of the observed data but they can also be used to estimate the common part in high-dimensional dynamic factor models. In fact, it has been shown that the GDPC provide consistent estimates (Smucler 2019) of the common part of dynamic factor models. Since the GDPC are based on both leads and lags of the data, like the dynamic principal components defined by Brillinger, they are not useful for forecasting. However, they are useful in general as a way to summarize the information in sets of dependent time series in which the factor structure may not apply. Finally, the optimal reconstructor property of the GDPC is useful for data compression to reduce resources required to store and transmit data. This makes them potentially useful for compression of dependent data and they could be applied for image analysis and other types of spatial data in which dependence is important.
A large number of R packages are available for working with time series. Besides the 'mts' class provided by the stats package, R Core Team (2016), several R packages provide classes and methods for handling time-indexed data. For example, the zoo package, Zeileis and Grothendieck (2005), provides an S3 class and methods for handling totally ordered indexed observations, in particular irregular time series. The xts package, Ryan and Ulrich (2018), is able to uniformly handle R's different time-based data classes. Our gdpc package supports mts, xts and 'zoo' objects: if the original data is stored in an object of class 'mts', xts or zoo, then the principal components and reconstructed time series will also be stored in an object of class 'mts', 'xts' or 'zoo' respectively, with the same date attributes. Among the multivariate time series R packages, the MTS package, Tsay and Wood (2018), is a general tool-kit for multivariate time series that includes vector autoregressive moving average (VARMA) models, factor models, and multivariate volatility models. freqdom, Hormann and Kidzinski (2017), implements Brillinger's dynamic principal components. pcdpca, Kidzinski, Jouzdani, and Kokoszka (2017), extends Brillinger's dynamic principal components to periodically correlated multivariate time series. An extensive comparison of Brillinger's approach to dynamic principal components and GDPC can be found in Peña and Yohai (2016). The BigVAR package, Nicholson, Matteson, and Bien (2019), estimates vector autoregressive (VAR) models with structured LASSO penalties. Many more packages can be found at https://CRAN.R-project.org/view=TimeSeries. To the best of our knowledge, gdpc (Peña, Smucler, and Yohai 2018) is the only publicly available implementation of GDPC.
This article is organized as follows. In Section 2 we briefly review the definition of the GDPC and presents a new proposal to apply the procedure in an automatic way. Several possible methods for choosing the number of components and the number of lags used for each component are discussed. In Section 3 we review the iterative algorithm used to compute the GDPC. In Section 4 we illustrate the use of the gdpc package using artificial and real data sets. We compare the performance regarding computation time and the reconstruction of non-stationary time series of the implementation of Brillinger's method found in the freqdom package to that of GDPC in Section 5. In Section 6 we compare the performance of the different criteria available in the gdpc package to choose the number of lags that define the GDPC. Section 7 includes a small simulation study to estimate the typical computation times of the main algorithm for both stationary and non-stationary time series. Finally, conclusions are provided in Section 8.

Definition of generalized dynamic principal components
Consider a time series vector z t = (z 1,t , . . . , z m,t ), where 1 ≤ t ≤ T, and let Z be the T × m matrix whose rows are z 1 , . . . , z T . Here T stands for the number of periods and m for the number of series. We define the first dynamic principal component with k lags as a vector f =(f t ) −k+1≤t≤T , so that the reconstruction of series z j,t , 1 ≤ j ≤ m, as a linear combination of (f t−k , . . . , f t−1 , f t ) is optimal with respect to the mean squared error (MSE) criterion. More precisely, given a possible factor f of length (T + k), an m × (k + 1) matrix of coefficients β = (β j,h ) 1≤j≤m,1≤h≤k+1 and α = (α 1 , . . . , α m ), the reconstruction of the original series z j,t is defined as The R superscript in z R,b j,t stands for reconstruction and the b superscript stands for backward, due to the reconstruction being defined using the lags of f t , in constrast with the reconstruction using leads, to be defined later.
The MSE loss function when we reconstruct the m series using f , β and α is given by and the values that minimize this MSE are called ( f , β, α). The value of f is not identified as we can multiply the coefficients β j,h+1 by a constant and divide f t−h by the same contant and the model is the same. To solve this issue we take f with zero mean and unit variance. f is then the first GDPC. Note that for k = 0, f is the first ordinary (non-dynamic) principal component of the data. The second GDPC is defined as the first GDPC of the residuals Higher order GDPC are defined in a similar fashion.
Note that the reconstruction given in Equation 1 can be written using leads instead of lags. Suppose to simplify that α j = 0 and k = 1 so that we have that is the same equation but now using leads. We just shift one position the series of the principal components and interchange the values of the β coefficients to obtain an equivalent representation but now using leads instead of lags. In general given f , β and α we can define and write Clearly j,t (f , β, α) and hence we see that, without loss of generality, we can use either lags or leads of the principal component to reconstruct the series. Moreover, minimizing the function in Equation 2 is equivalent to minimizing Since the notation is less cumbersome using leads, the derivation for the optimal solution in Section 3 is presented using leads. Moreover, all internal computations in the gdpc package are performed using leads. However, since the reconstruction using lags is more intuitive, the final output is passed to the form using lags via Equations 3, 4, 5. It can be shown that the GDPC can also be equivalently defined using a k/2 leads and k/2 lags of f t to define the reconstruction, see Peña and Yohai (2016) for details. This explains the noisy character of the GDPC at the ends of the sample, see Figure 4 for example, and why, like Brillinger's DPC, the GDPC are not directly useful for forecasting.
Note that if we are considering p dynamic principal components of order k i each, i = 1, . . . , p, the number of values required to reconstruct the original series is p i=1 (T +k i +m(k i +1)+m). In practice, the number of components and the number of lags used for each component need to be chosen. The MSE of the reconstruction decreases when either of these two numbers is increased, but the amount of information needed to be stored for the reconstruction will also increase. The number of components can be chosen so that a pre-specified fraction of the total variance of the data is explained, as is usual in ordinary principal component analysis. Regarding the choice of the number of lags for each component, let k be the number of lags used to fit the component under consideration. Let y j,t = α j + k h=0 β j,h+1 f t+h be the interpolator used where y j,t = z j,t for the first component and will be equal to the residuals from the fit with the previous components otherwise. Let r j,t = y j,t − y j,t , be the residuals from the fit with the component, and R be the corresponding matrix of residuals from this fit and let Σ = (R ⊤ R)/T . Given k max , k can be chosen among 0, . . . , k max as the value that minimizes some criterion. This criterion should take into account the MSE of the reconstruction and the amount of information to be stored. The following four criteria are available in gdpc: • Leave-one-out (LOO) cross-validation: where h k,tt are the diagonal elements of the hat matrix • An AIC type criterion (Akaike (1974)): • A BIC type criterion (Schwarz (1978)): • A criterion based on the IC p3 proposal of Bai and Ng (2002): The AIC and BIC type criteria are known not to work well when the ratio T /m is small; this is to be expected since they are based on the assumption that m is fixed and T → ∞. For the cases in which the ratio T /m is small it is interesting to use a criterion based on both m and T going to infinity. For this reason, we included the criterion BN G k , based on a proposal of Bai and Ng (2002) for choosing the number of factors in a factor model. We will compare the performance of the criteria in Section 6.

Computing the GDPC
To compute the GDPC we note that, given the component, the β coefficients are regression coefficients that can be computed by least squares, and given the β coefficients we have again linear equations that can be easily computed. To write the equations we have to solve we introduce some notation.
Differentiating Equation 2 with respect to f , it is easy to show that where β j , j = 1, . . . , m, are the rows of β . Differentiating Equation 2 with respect to β j and α j we obtain where To define an iterative algorithm to compute ( f , β, α) it is enough to provide f (0) and a rule describing how to compute β (h) , α (h) and f (h+1) once f (h) is known. The following two steps based on Equations 6 and 7 describe a natural rule to perform this recursion.
The initial value f (0) can be chosen equal to the ordinary first principal component, completed with k leads. We stop after niter_max iterations or when for some user-supplied values tol and niter_max.
All numerically intensive computations are performed in C++, using the Armadillo linear algebra library, Sanderson and Curtin (2016). The C++ code is integrated with R using packages Rcpp, Eddelbuettel and François (2011), and RcppArmadillo, Eddelbuettel and Sanderson (2014).
In step 1 of the algorithm we need to solve m least squares problems, each with T observations and k + 2 predictor variables. The worst case complexity for solving each of these least . Hence the worst case complexity for step 1 of the algorithm is O(mk 2 T ). The worst case complexity for solving the linear system in step 2 of the algorithm is O((T + k) 3 ). Hence, at it each iteration the worst case complexity is O((T + k) 3 + mk 2 T ), which is linear in m. Thus, the algorithm is able to deal with very highdimensional problems. Note that there are no restrictions on the values of f and, in particular, we do not assume, as in Brillinger (1981), that the GDPC must be linear combinations of the series. Note that in Equation 6 the computation of the component is linear in the observed data given the β parameters. Also, the β parameters are estimated as linear functions of the observations given the GDPC by Equation 7. However, these two step estimation leads to a component which, in general, will not be a linear function of the observed series. It is shown in Peña and Yohai (2016) in particular stationary cases that the components are approximately linear combinations of the data. This is a similar result to the one found in the static case with standard principal components where we do not impose this restriction but the optimal solution verifies it. However, when the series are not stationary this additional flexibility leads to values of f better adapted to the possible nonstationary character of the time series. We will back up this claim with experimental results in Section 5.

Using the gdpc package
There are two main functions available to the user: gdpc and auto.gdpc. We first describe gdpc.

The gdpc function and class
The function gdpc computes a single GDPC with a number of lags that has to be provided by the user. It has the following arguments: • Z: Data matrix. Each column is a different time series.
• k: Integer. Number of lags to use.
• f_ini: (Optional). Numeric vector. Starting point for the iterations. If no argument is passed the ordinary first principal component completed with k lags is used.
• crit: A string specifying the criterion to be used to evaluate the reconstruction. Options are "LOO", "AIC", "BIC" and "BNG". Default is "LOO".
The output of this function is an S3 object of class 'gdpc', that is, a list with entries: • expart: Proportion of the variance explained.
• crit: The value of the criterion of the reconstruction, according to what the user specified.
• k: Number of lags used.
• alpha: Vector of intercepts corresponding to f.
• beta: Matrix of loadings corresponding to f. Column number j is the vector of j − 1 lag loadings.
• f: Coordinates of the first dynamic principal component corresponding to the periods 1, . . . , T .
• call: The matched call.
fitted, plot and print methods are available for this class. We illustrate the use of this function with the an artificial data set. First, we load the package and generate the artificial data. The result is stored in fit and the code fit produces the object to be printed. This shows the number of lags used, the value of the criterion specified by the user (the default "LOO" in this case), the MSE of the reconstruction and the fraction of explained variance. It is seen that the reconstruction is excellent, with more than 99% of the variance of the data explained.
Now we can plot the loadings and the principal component by using the plot method for the 'gdpc' class. The method has the following arguments • x: An object of class 'gdpc', usually the result of gdpc or one of the entries of the result of auto.gdpc.
• which_load: Lag number indicating which loadings should be plotted. Only used if which = "Loadings". Default is 0.
• ...: Additional arguments to be passed to the plotting functions.
Continuing with our example, we plot the loadings and the principal component.
The reconstruction of the original series can be obtained using the fitted method for the 'gdpc' class. We store it in recons.

R> recons <-fitted(fit)
We can compare the approximate amount of storage needed for the original data set and for the fit objectx and the gdpc object fit.

bytes
Hence, the amount of memory needed to store fit is about 1.6% of that needed to store x.

The auto.gdpc function and the 'gdpcs' class
The function auto.gdpc computes several GDPC. The number of components can be supplied by the user or chosen automatically so that a given proportion of variance is explained. The number of lags is chosen automatically using one of the criteria listed in Section 2. The function has the following arguments • Z: Data matrix. Each column is a different time series.
• crit: A string specifying the criterion to be used to choose the number of lags. Options are "LOO", "AIC", "BIC" and "BNG". Default is "LOO".
• normalize: Integer. Either 1, 2 or 3. Indicates whether the data should be standardized. Default is 1. See details below.
• auto_comp: Logical. If TRUE compute components until the proportion of explained variance is equal to expl_var, otherwise use num_comp components. Default is TRUE.
• num_comp: Integer. Number of components to be computed (only used if auto_comp==FALSE). Default is 5.
• ncores: Integer. Number of cores to be used for parallel computations. Default is 1.
• verbose: Logical. Should progress be reported? Default is FALSE.
The argument normalize indicates whether the data should be normalized. If normalize = 1, the data is analysed in the original units, without mean and variance standardization. If normalize = 2, the data is standardized to zero mean and unit variance before computing the principal components, but the intercepts and loadings are those needed to reconstruct the original series. If normalize = 3 the data are standardized as in normalize = 2, but the intercepts and the loadings are those needed to reconstruct the standardized series. Default is normalize = 1, and hence the data is analysed in its original units.
The choice of "LOO" as the default criterion for choosing the number of lags is justified in Section 6. The optimal number of lags for each component can be computed in parallel, using the R packages doParallel, Analytics and Weston (2018) and foreach, Kane, Emerson, and Weston (2013). The argument ncores indicates the number of cores to be used for the parallel computations. The default value is 1, and hence by default no parallel computations are performed. If verbose=TRUE a message is printed each time a component is succesfully computed.
The output of auto.gdpc is an S3 object of class 'gdpcs', that is, a list of length equal to the number of computed components. The i-th entry of this list is an object of class 'gdpc' that stores the information of the i-th dynamic principal component, that is, a list with entries • expart: Proportion of the variance explained by the first i components.
• mse: Mean squared error of the reconstruction using the first i components.
• crit: The value of the criterion of the reconstruction, according to what the user specified.
• k: Number of lags chosen.
• alpha: Vector of intercepts corresponding to f.
• beta: Matrix of loadings corresponding to f. Column number j is the vector of j − 1 lag loadings.
• f: Coordinates of the i-th dynamic principal component corresponding to the periods 1, . . . , T .
• call: The matched call.
components, fitted, plot and print methods are available for this class. We illustrate the use of this function with a real data set, pricesSP50, that is part of the package. The data set if formed by fifty series corresponding to the stock prices of the first 50 components of the Standard&Poor's 500 index. Five hundred daily observations starting 1/1/2010 are available. The class of the object storing the data set is 'mts'.
The data set can be loaded and part of it can be plotted using the following commands R> data ("pricesSP50") R> plot(pricesSP50[,1:4

], main = "Four components of the S&P500 index")
The plot is shown in Figure 2.
Next, we apply auto.gdpc to the data set, storing the result in fit_SP. Since some of the series are significantly more variable than others, we apply the procedure to the normalized data set using the option normalize=2. Since convergence is somewhat slow for this data set, we increased the maximum number of iterations from the default 500 to 1000 by setting niter_max=1000. We used 8 cores to perform the computation, by setting ncores=8. The rest of the arguments are left to their default values. In particular the number of lags for each component is chosen as the value than minimizes the LOO criterion and the number of components is chosen so that the reconstruction explains at least 90% of the variance of the data. We can obtain the reconstruction of the original time series using the fitted method for the 'gdpcs' class. The method has the following arguments • object: An object of class 'gdpcs', usually the result of auto.gdpc.
• num_comp: Integer indicating how many components to use for the reconstruction. Default is 1. • . . . : Additional arguments for compatibility.
Note that the gdpc package supports 'mts', 'xts' and 'zoo' objects. The principal components and reconstructed time series will be stored in an object of class 'mts', 'xts' or 'zoo' respectively, the same as the original data, with the same date attributes. Thus, since the pricesSP50 data was stored in an object of class 'mts', the reconstructed time series will be of class 'mts', with the same date attributes.
We store the reconstructed time series using both computed components in an mts object called recons, assign each of the series its corresponding name, and plot the first four series. The result is shown in Figure 3.
• which_comp: Numeric vector indicating which components to get. Default is 1.
Since the original data was stored in an object of class 'mts', the components will also be of class 'mts'. We store the components in an 'mts' object called comps.

R> comps <-components(fit_SP, which_comp = c(1, 2))
We can either directly plot the comps time series matrix or use the plot method for the class 'gdpcs'. The method has the following arguments • x: An object of class 'gdpcs', usually the result of auto.gdpc.
• plot.type: Argument to be passed to the plot for 'zoo' objects. Used only when the original data set was stored in an object of class 'zoo'. Default is "multiple".
• which_comp: Numeric vector indicating which components to plot. Default is 1.
• . . . : Additional arguments to be passed to the plotting functions.
Using the plot method for the 'gdpcs' we can plot both principal components. The result is shown in Figure 4.

R> plot(fit_SP, which_comp = c(1, 2))
We can compare the approximate amount of storage needed for the original data set pricesSP50 and the 'gdpcs' object fit_SP.

bytes
Hence, the amount of memory needed to store fit_SP is about 14% of that needed to store pricesSP50.

Reconstructing non-stationary data
We compare the performance of GDPC and the dynamic principal components proposed by Brillinger (1981) by conducting a small simulation study. We consider the following VARI(1,1) model. The data is generated according to where x t satisfies a stationary VAR(1) model The u t are i.i.d. with a standard multivariate normal distribution. In each replication the matrix A is generated randomly as A = VΛV ⊤ , where V is an orthogonal matrix generated at random with uniform distribution and Λ is a diagonal matrix with diagonal elements are independent random variables with uniform distribution on the [0, 0.9] interval. Note that the generated vector time series is not stationary.
We computed GDPC using one component and k = 10 lags and the dynamic principal components of Brillinger (DPC) using one component with 5 leads and 5 lags. We used the dpca function from the freqdom package (Hormann and Kidzinski 2017)  lower. For example, for the case of T = m = 100, GDPC takes on average 1.69 seconds to compute a solution whereas the same task takes 90 seconds using DPC.

The performance of the lag selection criteria
In this section we compare the performance of the four different criteria available in the gdpc package to automatically choose the number of lags used to define the GDPC. Since we propose to choose the number of components to explain a given fraction of the variance in the data, we focus on the case in which one component is to be computed. We consider the following two scenarios.
• DFM1: The data is generated as f t follows a stationary AR(1) model, f t = θf t−1 + u t , with standard normal innovations and θ generated at random on the (−1, 1) interval at each replication. The loadings b 0,j , b 1,j , j = 1, . . . , m are generated at random uniformly on the (−1, 1) interval. The variables e j,t are generated as i.i.d. standard normal. This is a dynamic factor model with one factor and one lag.
• DFM2: The data is generated as The factor f t follows a stationary MA(1) model, f t = θu t−1 + u t with standard normal innovations and with θ generated at random at each replication, uniformly on the (−1, 1) interval. The loadings and the errors are generated as in model DFM1. This is a dynamic factor model with one factor and two lags.
We take (m, T ) ∈ {5, 10, 20, 200, 800} × {200, 400} and do 500 replications. We report the average number of lags chosen by each criterion and the resulting average reconstruction mean squared error. Tables 2 and 3 show the results.
We see that in the cases in which the dimension of the vector time series is small, say m = 5 or m = 10, and the sample size is large, the AIC does well, choosing a number of lags close  In Table 3 we see that: in the cases where the methods choose on average a number of lags close to the true values, the reconstruction errors are close to the variance of the idiosyncratic part, which is 1 in this case. In cases where the methods overestimate the number of lags needed, the reconstruction error is smaller than the idiosyncratic variance, see for example the case of T = 200, m = 5 for the LOO criterion. This is to be expected, since if a larger number of lags than needed is used, the components will also explain part of the variance in the idiosyncratic part. In cases where the number of lags needed is underestimated, the reconstruction error is larger than 1, see for example the case of T = 200 and m = 800 for the BIC criterion.
Since we believe the main appeal of GDPC is for moderate to large sized panels of time series, the default criterion used in auto.gdpc is LOO. However, for small panels better results can be obtained by using the AIC criterion.

Computing times
In this section, we estimate the run times of our implementation of the algorithm described in Section 3 for different problem sizes by conducting a small simulation study. Timings were carried out for the gdpc function, since it is the one that implements the main numerical algorithm. Timings were carried out on R version 3.4.4 on a computer running Linux Ubuntu 18.04 64-bit with an Intel Core i7-7560U @ 2.40GHz x4.
We take (m, T ) ∈ {100, 1000, 2000} × {200, 400} and consider the following two models.  Table 3: Average reconstruction mean squared error for scenarios DFM1 and DFM2, when the number of lags is chosen by each method.
• DFM3: The data is generated according to z j,t = sin(2π(j/m))f t + cos(2π(j/m))f t−1 + e j,t , 1 ≤ t ≤ T, 1 ≤ j ≤ m, where the e j,t are i.i.d. standard normal variables. The vector of factors f t = (f 1t , f 2t ) is generated according to a vector autoregressive model f t = Af t−1 + v t b, where the v t are i.i.d. standard normals, A is a diagonal matrix with diagonal equal to (−0.8, 0.7) and b = (1, 1) ⊤ . Note that the resulting time series is stationary.
• DFM4: The data is generated according to z j,t = sin(2π(j/m))f t + cos(2π(j/m))f t−1 + (j/m)f t−2 + e j,t , 1 ≤ t ≤ T, 1 ≤ j ≤ m, where the e j,t are i.i.d. standard normal variables. The factor f t follows an integrated MA(1) model, f t = f t−1 + θf t−1 + u t , with standard normal innovations and where θ is generated at random uniformly on the (−0.8, 0.8) interval at each replication. Note the the resulting time series is non-stationary.
We used k = 5 lags for the first model and k = 2 for the second model. We report the average computation times in seconds over 500 replications.
The results are shown in Table 4. We see that in the stationary case (DFM3) the algorithm can compute solutions for data sets with thousands of time series in under 15 seconds. It is seen that the convergence of the algorithm is slower in the non-stationary case (DFM4), since the computing times increase drastically for problems of the same size as before. However, even in the non-stationary case, the algorithm can compute solutions to problems with thousands of time series in under 1 minute. Moreover, note that computations were performed on an ordinary desktop computer, on a high-performance cluster we expect the algorithm to be able to compute the GDPC for tens of thousands of series in a matter of minutes.  Table 4: Average computing times in seconds for stationary and non-stationary factor models.

Conclusions
The gdpc package provides functions to compute the generalized dynamic principal components for a set of time series. These components are useful to reconstruct the series and to describe their underlying dynamic structure. Also, they can be used as estimators of the common part in a dynamic factor model, as shown in Section 6 and by the theoretical results in Smucler (2019). The package is useful for the analysis of large data sets, even when the number of series is much larger than the length of the series. The gdpc package is available from the Comprehensive R Archive Network at https://CRAN.R-project.org/package=gdpc, including the pricesSP50 data set.