Implementing Modifed Burg Algorithms in Multivariate Subset Autoregressive Modeling

The large number of parameters in subset vector autoregressive models often leads one to procure fast, simple, and efficient alternatives or precursors to maximum likelihood estimation. We present the solution of the multivariate subset Yule-Walker equations as one such alternative. In recent work, Brockwell, Dahlhaus, and Trindade (2002), show that the Yule-Walker estimators can actually be obtained as a special case of a general recursive Burg-type algorithm. We illustrate the structure of this Algorithm, and discuss its implementation in a high-level programming language. Applications of the Algorithm in univariate and bivariate modeling are showcased in examples. Univariate and bivariate versions of the Algorithm written in Fortran 90 are included in the appendix, and their use illustrated.


Introduction
AutoRegressive Moving Average (ARMA) models are well-known and popular in the time series literature. Among others, they are extensively used to model economic and electrical systems whose evolution in time (modulo preliminary transformations, de-trending, and de-seasonalizing) can be well approximated by that of a stationary process. Stationarity constrains a process to have second order properties that do not evolve with time, i.e. constant mean, and a covariance structure between any two observations that depends only on the distance in time (the lag) separating them.
In the multivariate setting, one attempts to model the joint behavior of several univariate series over the same span of time. The usefulness of multivariate ARMA models here though, is stymied by identifiability issues concerning the model parameters (see for example Brockwell and Davis 1991, sec. 11.5). A common alternative is to restrict attention to two special cases: invertible Moving Average (MA) models, and causal AutoRegressive (AR) models, whose parameters are uniquely determined by the second order properties of the process. AR models (called Vector Autoregressive (VAR) in the multivariate case) are often favored over MA models due to their interpretability, simplicity of estimation, and ease of forecasting. They are extensively used in signal processing for modeling various phenomena associated with speech and audio; see for example Godsill and Rayner (1998).
The d-dimensional vector process {X t , t = 0, ±1, . . .}, is said to be a VAR process of order p, VAR(p), if it is a stationary solution of the equations, where, Φ(1), . . . , Φ(p), are (d × d) constant matrices (the VAR coefficient matrices), and {Z t } is a sequence of zero-mean uncorrelated random vectors, each with covariance matrix Σ. We call the process {Z t } white noise, and write {Z t } ∼ WN(0, Σ). The autocovariance function of X t is, E[X t+h X t ] = Γ(h), h = 0, ±1, . . . .
A VAR(p) is therefore a linear regression of the current value of the series on its previous p values i.e. an autoregression. We say that we are modeling the series on the lagged set {1, . . . , p}.
One can generalize this to modeling on a lagged subset K = {k 1 , . . . , k m } ⊆ {1, . . . , p}, with k 1 < · · · < k m ≡ p, and the coefficient matrices pertaining to the lags not present in the set K, constrained to be zero. Such models are called Subset Vector AutoRegressive (SVAR; SAR in the univariate case), and take the form SVAR models are appropriate in situations where one does not wish to include all the lags of the complete (full-set) VAR model. Two such instances are: Modeling of seasonal time series. If B denotes the backward shift operator, i.e. B k X t = X t−k for any positive integer k, then causal SAR models of the form, will exhibit approximate cyclical behavior for appropriate values of the coefficients ψ, φ 1 , . . . , φ p , and orders s, p, as evidenced by sharp peaks in the spectral density. This suggests that some seasonal time series can effectively be modeled as SAR processes. Figure 1 shows a realization from the SAR (3), along with a plot of the spectral density function of the process on the interval (0, π). The spectral density peaks at a frequency of 2π/3 radians per unit time, which corresponds to a period of length 3.
Fitting best subset models. As in linear regression, one can search for the "best" subset AR/VAR model up to some maximum order, p. "Best" can be measured by one's favorite information criterion, such as Akaike (AIC), Bayesian (BIC), Schwarz (SIC), or even Minimum Description Length (MDL). Researchers have devised efficient algorithms to perform this search. One of the earliest attempts was made by McClave (1975), who used an algorithm adapted from linear regression. Penm and Terrell (1982), introduced an algorithm recursive in the maximum lag for best subset identification. Zhang and Terrell (1997) refine the search by inspecting certain statistics. Rather than performing an exhaustive search through all 2 p models, Sarkar and Sharma (1997) propose a statistical method for identifying the best subset. Figure 2 shows the celebrated Canadian Lynx Trappings data. Ecological oscillations in predator-prey populations, mean that the logarithms of this data set are often modeled as a SAR process; a perennial favorite in the SAR literature. The lower part of the figure shows a spectral density estimator (the periodogram) for this data, which suggests the period of the oscillations to be approximately 2π/0.6 ≈ 10.5 years. In section 4, we will apply the algorithm of section 2 to perform an exhaustive search for the best SAR model.
For a given SVAR model order, one typically wishes to find maximum likelihood (ML) estimates of the parameters. Using standard arguments, Figure 1: The process X t − 0.99X t−3 = Z t , {Z t } ∼ WN(0, 1). Top: a realization of the process with Gaussian noise. Bottom: the corresponding spectral density function.

Remark 1
The potentially large number of parameters involved in ML estimation (d 2 k m + d 2 +d 2 of them), and the possible existence of many local minima which are much larger than the global minimum, makes the numerical search for the minimizers a difficult problem. The feasibility of ML estimation is therefore highly dependent upon good initial estimates.
For this, and the reason that one may wish to avoid ML estimation altogether, it is important to consider alternative fast and simple SVAR estimation methods for obtaining models with high likelihoods. Recently, Brockwell, Dahlhaus, and Trindade (2002), introduced a method for doing just that. Their method, the BDT Algorithm which we consider in section 2, is recursive in the model order, parameter estimates of larger order models being constructed from those of smaller order models. Since the  paper focuses mostly on theoretical aspects, the main purpose of this article is to serve as a pragmatic complement to it in the following ways: (i) Elucidate the recursive structure of the Algorithm.
(ii) Discuss the main issues involved in implementing the Algorithm in a high-level programming language like Fortran 90.
(iii) Provide coded versions of the Algorithm along with examples that illustrate its usage.
Section 3 accomplishes the first two goals, in the framework of a binary tree of pointer-linked nodes. The examples are presented in Section 4, which also illustrates some meta applications of the Algorithm. (Accompanying data sets are provided in Appendix B.) Fortran 90 programs implementing a univariate (BDT.F90) and a bivariate (BDT2.F90) version of the Algorithm are provided in Appendix C. Appendix A summarizes the function of the principal subroutines in these programs.

Estimation Methods
In this section we discuss alternative SVAR parameter estimation methods to ML. The first is a generalization of the well-known Yule-Walker method of moments estimator for full-set modeling, and has not previously appeared in the literature in this form. The second is a flexible recursive Burg-type algorithm, introduced by , whose structure and implementation is the main focus of this paper. In order to introduce both methods, we will need to consider not only the (forward) SVAR model (1), but also the backward SVAR model where K * = {k m − k m−1 , . . . , k m − k 1 , k m }, and suppose that the process {X t } is causal (meaning that the current value of the series can be expressed as a function of current and past values of the white noise sequence as,

Non-Recursive Estimation: The Yule-Walker Equations
If we multiply both sides of (1) by X t−i , i = 0, k 1 , . . . , k m in turn, and (noting the causal representation) take expectations, we obtain the so-called Yule-Walker (YW) equations: For the backward SVAR model (3), the corresponding YW equations are Now define R K and G K to be matrices of autocovariances as follows: with k 0 = 0, define the (i, j)th, i, j = 1, . . . , m + 1, block-matrix entry of R K to be, and G K obtained from R K by striking out the first block row and column. The m + 1 forward YW equations can now be succinctly written in blockmatrix form as, and the backward YW equations as, we can write (4)- (5) in the reduced block-matrix form These can now be solved for Φ K and U K : where G −1 K denotes any generalized inverse of G K . The solution Φ K , gives the minimum mean-squared error linear predictor of X t in terms of X t−i , i ∈ K. Its mean-squared error is U K . Analogous results hold for the backward YW equations. When fitting SVAR model (1) to a set of observations x 1 , . . . , x n from the zero-mean random vectors X 1 , . . . , X n , one of the simplest approaches is to substitute sample estimates for the autocovariances in (10) and (11). Taking the usual estimator of the autocovariance matrix at lag h to be, the resulting method of moments estimates are, These are the so-called YW estimates in the full-set case, and we will refer to their subset generalization by the same name. The fitted YW SVAR model is therefore,

Recursive Estimation: The BDT Algorithm
By defining the empirical forward and backward prediction error residualŝ ε K andη K * , associated with models (1) and (3) as, , introduce a family of SVAR model parameter estimators, based on Burg's (1968) recursive algorithm. Their BDT Algorithm, takes the following form.
with initial conditions, and the sets J and J * , formed from the sets K and K * , respectively, by omitting k m .
A variety of different estimators can be obtained by an appropriate selection of the boxed reflection coefficient expression in (16). , note that the choicê gives precisely the YW estimators (13) and (14) (reformulated via similar recursions, the resulting Algorithm is known as Levinson-Durbin), but that selectingΦ K (k m ) to be the minimizer of the weighted sum of forward and backward prediction errors tends to produce models with consistently higher Gaussian likelihoods. By selecting different weight matrices A and B, they propose a total of three additional methods: Burg, Vieira-Morf, and Nuttall-Strand, each being a plausible subset generalization of existing full-set analogues with the same name.
The BDT Algorithm necessarily couples together the forward and backward modeling problems. Arranging the elements of K on the number line as shown in Figure 3, allows us to better visualize this coupling. The for- Figure 3: The set of lags K = {k 1 , . . . , k m } arranged on the number line.
ward set of lags, K, are simply the distances of the elements of K from the origin; while the backward set of lags, K * , are the corresponding distances from k m . Note that the YW estimator,Φ K (k m ), obtained from (13), requires the inversion ofĜ K , which is of dimension md. Recursive algorithms are better suited to searching for a best subset model with a specified maximum number of lags, and involve inversion of matrices whose dimension is at most d (d 2 in some instances).

Building a binary tree of linked nodes
The recursive solution of the equations defining the BDT Algorithm, generates a collection of estimators of SVAR models of increasing orders, until the required order is reached. Suppose modeling on the set of lags {1, 3, 7} is desired. To determine where application of the algorithm should begin, we first need to work down to derived subsets of lags comprised of just one lag. This is done by successively forming the J and J * subsets of lags for each parent set of lags K, as shown in Figure 4. Each of the subsets J and J * then assumes the role of a parent lag, K, and the procedure is repeated. In the resulting binary tree structure, we will refer to all the modeling information pertaining to a set of lags as a node. The number of lags in a node will define its level in the tree. The strategy for this recursive tree-building will then be as follows: for each node in current level do compute lags in J and J * subnodes direct pointers to J and J * subnodes od level := level − 1 od For programming, it will be necessary to rewrite the backward model equations (17), (18), (19), and (21), in the unstarred lags format: obtained by noting that (K * ) * = K, and (J * ) * = J. The unstarred lags representation carifies how the backward model estimates should be computed for any given node. Letting node be a user-defined derived data type, and K = {k 1 , . . . , k m } denote a generic set of lags for a given node, this data type should then consist of the following components: (i) lags -vector containing the current set of lags, K, on which modeling is desired (type integer).
(ii) level -scalar specifying the level of the node in the tree (type integer).
(iv) U K , V K -estimates of the white noise covariance matrices for the forward and backward modeling problems (type real).
(v) ε K (t), η K (t) -vectors of prediction error residuals for the forward and backward modeling problems (type real).
(vi) reg, str -pointers to the J and J * subnodes, respectively, one level below the current level (type node, defined recursively).
The tree can be linked by recursive calls to a RECURSIVE SUBROUTINE, with level, lags, and a pointer of type node as arguments. This routine should also set a flag to signal when a particular node has been initialized (linked in the list), but the remaining components, (iii)- (vi), not yet evaluated (node unfilled). The flag, setting the first row and column entry of U K to zero for example, will be used by a subsequent node-filling routine. Two pointers should emanate from each node, reg pointing to J, and str to J * . At level 1, these pointers should point nowhere (NULLIFY). From Figure 4, we note that both nodes at level 2 have the set {2} as their J * subnode. Two copies of this subnode can be made, each linked to its appropriate parent node. This duplication of nodes can be avoided by a more complex program, since otherwise exactly 2 m nodes are created for modeling on m initial lags.

Filling the nodes
Once the tree with all appropriate linking pointers is in place, we will need to evaluate the remaining components, (iii)-(vi), of each node. This can be done by "walking" through the tree, following the linked list of nodes. Once again, a RECURSIVE SUBROUTINE taking a pointer as argument can be employed to achieve this, certifying first that each node has not yet been filled by checking the flag alluded to earlier.
The recursion should be implemented in such a way that the tree is walked to level 1 where the node-filling can begin. From Figure 4, we note that this involves filling nodes {1}, {2}, and {4} first. With these, we can now fill nodes {1, 3} and {4, 6}, at level 2. The operation terminates at the top node, {1, 3, 7}, if the pointer to this node is passed as the original argument to the recursive node-filling subroutine. The pseudo-code for this phase of the implementation could therefore be: call node f illing routine with pointer to top node as argument while level of current node > 1 do call node f illing routine with reg pointer as argument if current node unf illed then f ill it fi call node f illing routine with str pointer as argument if current node unf illed then f ill it fi od if level of current node = m then f ill top node od This coding will give filling-precedence to nodes at low levels that emanate from parents with respect to which they are J subnodes. In example 4, this would result in the following filling order: Note that in the univariate case there is no distinction between forward and backward model parameters for the same set of lags; that is, the YW equations giveÛ K ≡V K , andΦ K ≡Ψ K , for any set K. Both this and the fact that all parameters are scalars, greatly simplifies the programming task when d = 1.

Examples
Included in Appendix C are BDT.F90 and BDT2.F90. These are, respectively, univariate and bivariate Fortran 90 programs implementing the BDT Algorithm. The programs utilize a few linear algebra subroutines in the International Mathematical and Statistical Library (IMSL). In this section we document how to run the programs in order to fit a particular SAR/SVAR model to a given data set, and illustrate some potential meta applications that involve repeated modeling with each of the programs in the inner loop. (We have not provided the programs for Examples 3-5, but they are available from the author upon request.)

Example 2: Running BDT2.F90
To fit the bivariate SVAR model to the sun2.tsm data (Appendix B), we ran the compiled version of BDT2.F90 with the following inputs at the prompt (>):

Example 3: Best Subset Searching
The lynx data of Figure 2 is often cited in the literature in connection with SAR modeling. Using YW estimation and their own respective nonexhaustive search algorithms, Tong (1977), Penm andTerrell (1982), Zhang andTerrell (1997), and others, identify a SAR (1,2,4,10,11), i.e. K = {1, 2, 4, 10, 11}, as the best SAR model according to a variety of information criteria. It is important to realize that some of these search methods are non-exhaustive and statistical in nature, and will therefore not guarantee a correct identification with certainty.
Using the BDT Algorithm, we performed an exhaustive search for the minimum AICC SAR model, for the mean-corrected base 10 logarithms of the lynx data. Letting p denote the maximum lag considered in the search (meaning that 2 p models had to be checked), we considered p = 4, 8, and 12, in turn. This set of searches was performed for each of the YW and Burg estimation methods. The number of subsets out of the 2 p that resulted in non-causal fitted models, as well as the corresponding computational (CPU) times taken by each search, were recorded. The AICC (RSS/n) of the best SAR model was also computed. The results are summarized in Table 1. The best SAR model with maximum lag 12 found by the YW method, coincides with that identified by other researchers as discussed above; but that arrived at by the new subset Burg method, adds lag 3, and has a slightly lower value of AICC. Note also that the proportion of subsets resulting in non-causal fitted models (meaning that a SAR model with these lags is inappropriate for the data), remained steady at approximately 1/3 across all searches. At the same value of p, CPU times for Burg are slightly higher than those for YW, both growing exponentially with p. The computations were carried out on a Sun Enter-prise 450 unix server, equipped with about 4G of memory. The most severe limiting factor in this type of computation is available memory, since at least 2 p pointers have to be allocated.
In Table 2, we present the parameter estimates of the best SAR model identified by each respective method when p = 12. The constrained ML estimates were obtained via the ITSM2000 package (Brockwell and Davis, 2002), starting with the Burg estimates. Table 2: Best SAR models fitted to the mean-corrected base 10 logarithms of the Canadian lynx data, as identified by each respective method. The Maximum Likelihood estimates were obtained by starting with the Burg estimates, and constraining the ML search to the same SAR lags.

Parameter
Estimates

Examples 4 and 5: Adaptive Behavior
In these simulated bivariate examples, we illustrate the component-wise convergence of the Burg and YW estimates to their true values, as a function of observation number or time. This adaptive behavior is important in online applications where it is desirable to monitor the convergence of the estimates, particularly when observation number is still low. Note however that the BDT Algorithm is recursive in the model order, not observation number, and is therefore not truly adaptive in that sense. The entire Algorithm needs to be re-run from scratch whenever a new observation becomes available.
The characteristic polynomial of SVAR model (1) is The model is causal if the zeroes of its characteristic polynomial are all greater than one in magnitude. It is well-known that in the univariate fullset case, the YW estimators can be severely biased if the roots of the AR characteristic polynomial are close to the unit circle (quasi-non-stationarity).
To allow for the expected dependence of performance on the location of the zeroes of P (z), we considered causal models with different configurations of these zeroes. A total of 250 observations were sequentially simulated from the basic SVAR(2) model, and we started estimation at observation number 10.
The results are displayed in Figure 5, where we plot the componentwise departures of the estimated SVAR coefficient matrices from their true values, Φ ij −Φ ij , i, j = 1, 2, as a function of observation number. The pattern of convergence between the two methods is similar in Example 4, but dramatically different in Example 5. Although based on a single simulated realization presented only to illustrate a meta application of the bivariate BDT Algorithm, this phenomenon is nevertheless consistent with what has been noted about the behavior of YW versus Burg in quasi-non-stationary modeling. Since both estimators and the MLE all have the same asymptotic distribution, there is little cause for concern with large samples; it is with small samples that one should exercise caution when selecting an estimation method. The more extensive analysis by , suggests that Burg is in general a better estimator than YW.

Conclusion
We have discussed the popularity of multivariate subset autoregressive models, and highlighted the importance of fast, simple, and efficient methods for the estimation of their parameters. One such set of estimators is obtained via the classical Yule-Walker method-of-moments, which we have presented as the solution to a system of simultaneous linear matrix equations. A recently introduced more general estimation method, the BDT Algorithm, is recursive in the order of the fitted model, thus avoiding the (potentially large) matrix inversions required in solving the Yule-Walker equations. By suitably modifying the reflection coefficient calculation, this Algorithm can produce a variety of estimators with different finite sample properties, among them Yule-Walker. We have illustrated the recursive structure of this Algorithm, and discussed its implementation in a high-level programming language like Fortran 90. The speed of the Algorithm was assessed in finding a best subset model for the Canadian lynx data, and shown, in problems of moderate size, to be a feasible alternative to non-exhaustive search techniques which do not guarantee correct subset identification. We concluded with two simulated bivariate examples that illustrate the adaptive performance of the Yule-Walker and Burg estimators, implemented via the BDT Algorithm. We find that the Burg estimates tend to stabilize more quickly than Yule-Walker, and are far less affected by proximity of the model to non-stationarity.

Acknowledgements
The author would like to acknowledge NSF for partial support of this research through grant number DMS-9972015, and the invaluable suggestions provided by the referee who also reviewed the code.

A Description of Principal Program Subroutines
As already stated, the core of the subset modeling programs Burg and Burg2 is the globally visible MODULE Tree, with SUBROUTINE Make Tree its driving subroutine. In this section, we will provide a brief description of the essential functions of its main constituent subroutines.

A.1 Build Node Tree
This is a RECURSIVE SUBROUTINE that initializes the tree of nodes by allocating pointers to and from nodes. It takes on the level, lags, and a pointer of type node as arguments. It begins execution at the unique node of level m (top node), creating pointers to the J and J * subnodes (this node%reg and this node%str, respectively). Following these pointers to level m − 1, Build Node Tree subsequently allocates pointers to the subnodes in level m − 2. It achieves this by calling itself with the appropriate arguments:level should be the current level minus one, and pointers this node%reg and this node%str. The procedure is repeated, always following pointer this node%reg before this node%str, until level 1 is reached. At this point, the two pointers are initialized and made to point nowhere (NULLIFIED). By the order of precedence inherent in it, the routine then backs up one level and proceeds to follow pointer this node%str to the "dead end" at level 1.
In this fashion, the tree is initialized from left (J) to right (J * ), with the pointer to the subnode J * of the rightmost node being allocated last. If we refer back to figure 4, the nodes for the tree of this example will be initialized in the following order: In order for subsequent routines to identify an initialized but unfilled (constituents of node empty) node, Build Node Tree will set this node%v (this node%vf%mat (1,1) in BDT2.F90) to zero, upon allocation of pointers.

A.2 Fill Tree
A RECURSIVE SUBROUTINE, taking on a pointer of type node as argument. Its function is to traverse the now initialized tree, and using the flag for an unfilled node, fill it by calling Fill Node.

A.3 Fill Node
A RECURSIVE SUBROUTINE, called by Fill Tree, whose function is to fill the particular node that its pointer argument points to. It is in this routine that the BDT Algorithm proper is applied, modifying the reflection coefficient calculation according to the selected method. Care must be taken when calculating the forward and backward prediction errors, ε K (t) and η K (t), before termination of the routine. Each should be calculated over a sufficiently large range of t values (1 ≤ t ≤ n + k m for Yule-Walker, and 1 + k m ≤ t ≤ n for the remaining methods), since subsequent nodes may use them.

A.4 Print Node Tree
With its pointer argument, the RECURSIVE SUBROUTINE Print Node Tree will traverse the now completed tree of nodes, and proceed to print the estimated coefficients and white noise variance stored in each node.

A.5 Causal Check
This routine is needed in the bivariate program only, in order to ensure the obtained VAR model is causal before proceeding with the likelihood calculations. In the univariate program, this function is performed within the likelihood calculation routine itself. The strategy is to use the state space representation to write a VAR(p) as a VAR (1), as follows: Random vectors {X t , . . . , X t−km } from model (1), will satisfy the relationships which can be written in the compact form In block matrix form, vectors Y t and W t have length k m , while the square matrix A has dimension k m . Note that the only nonzero entries of the first block matrix row of A are {Φ K (k 1 ), Φ K (k 2 ), . . . , Φ K (k m−1) , Φ K (k m )}, occurring at block matrix column numbers {k 1 , k 2 , . . . , k m−1 , k m }, respectively. The covariance matrix of W t is where we use Σ in place of U K . (28) is now a VAR(1) of dimension dk m , and its causality (and thus that of the original process) can be assessed by determining if all eigenvalues of A are less than 1 in absolute value.