In this vignette, I will introduce you to the main features of the
ldt
package for dealing with Seemingly Unrelated Regression
models. I will demonstrate how to perform common tasks such as
estimating an SUR model and automatically finding significant
coefficients. I will also discuss model uncertainty and how to define an
SUR model set and automatically search for the best models within this
set. Additionally, we will explore the use of Principal Component
Analysis as an alternative approach when dealing with a large number of
potential explanatory variables.
One of the main ideas behind ldt
is to minimize user
discretion. As such, an analysis in ldt
is generally based
on a dataset and a set of rules that convert this dataset into a list of
potential regressors and/or predictors. This rule-based approach to
selecting data not only avoids discretion but is also expected due to
the word “automatically” used in the previous paragraph.
In this example, I will create an artificial dataset with dependent variables and both relevant and irrelevant explanatory variables. The dependent and relevant explanatory variables are a sample from a known SUR model. While we can discuss how well the estimation process can find the true parameters, this is not the main goal, here. Instead, I will focus on how to estimate, search, project, and report the results.
Let’s get started!
Let’s start by assuming that we know the structure of the system. We can do this by simulating data from a known SUR model. The following command generates a sample from such a system of equations:
num_obs <- 100
num_eq <- 2L
num_exo <- 3L
sample <- sim.sur(num_eq, num_exo, num_obs, TRUE)
print(sample$coef)
> Y1 Y2
> Intercept -0.5604756 0.07050839
> X1 -0.2301775 0.12928774
> X2 1.5587083 1.71506499
We know the parameters of the system because they are included in the
output of the sim.sur
function. This system has 2 equations
or dependent variables. Each equation has an intercept and 2 exogenous
variables. There are no restrictions imposed on the system. The sample
size is 100. All coefficients of the model are generated randomly and
are listed in the output sample
. This includes a positive
definite matrix of regression covariance and a matrix of coefficients
(where the first row includes the intercept coefficients). The LaTeX
code for the equations of the system is in the eqsLatex
element. It results in the following representation:
Furthermore, the matrix representation is in the
eqsLatexSys
element and it results in the following
formula:
Remember that these are the parameters of the system.
We can use the systemfit
package to estimate the model.
This includes preparing the equations of the system (the first two lines
of the following code) and estimating (the third line):
exp_names <- paste0(colnames(sample$x), collapse = " + ")
fmla <- lapply(1:ncol(sample$y),
function(i) as.formula(paste0("Y", i, " ~ -1 +", exp_names)))
fit_sf <- systemfit::systemfit(fmla, data = data.frame(sample$y, sample$x),
method = "SUR")
fit_sf_sys <- sim.sur(fit_sf$residCov,
matrix(fit_sf$coefficients, ncol = 2), 10, TRUE)
The last line is used for reporting purposes. Like before, I report the LaTeX formula:
\[\begin{aligned} \begin{pmatrix} Y_1 \\ Y_2 \end{pmatrix} = \begin{bmatrix} -0.47 & -0.28 & 1.59 \\ 0.05 & 0.10 & 1.71 \end{bmatrix} \begin{pmatrix} 1 \\ X_1 \\ X_2 \end{pmatrix} + \begin{pmatrix} E_1 \\ E_2 \end{pmatrix}, \Sigma = \begin{bmatrix} 1.72 & 0.19 \\ 0.19 & 0.67 \end{bmatrix} \end{aligned}\]Note that this representation is not very appropriate for the estimated model because it does not report the estimated standard errors and there are better ways to do so. However, it suits our purpose here. Also, note that you can get better results (in terms of lower difference between actual and estimated coefficients) by increasing the sample size.
The following code does the same by using the estim.sur
function in the ldt
package:
fit <- estim.sur(sample$y, sample$x[,-1], addIntercept = TRUE)
res_table <- coefs.table(fit, regInfo = c("obs", "sigma2"),
formatLatex = FALSE)
The second line converts the estimated result into a table for
presentation. It has some other arguments that can be used to control
the format or level of information in the table. Here is the result of
reporting the contents of the table with kable
function:
Y1 | Y2 | |
---|---|---|
Intercept | -0.47*** | 0.05 |
X1 | -0.28* | 0.10 |
X2 | 1.59*** | 1.71*** |
obs | 100 | 100 |
sigma2 | 1.67 | 0.65 |
Difference between the results of systemfit
and
estim.sur
functions (apart from other factors such as
difference in the estimator or the initialization of covariance matrix
in FGLS estimation) here is probably due to the adjustment of degrees of
freedom.
Before we move on from this section, let me introduce the coefs.plot function in the ldt package. This function can be used to graphically plot the estimated coefficients or to compare them with the actual parameters. In this example, we will focus on the coefficient of X2 in the first equation. The following code can be used to plot it:
coefs <- fit$estimations$coefs
stds <- fit$estimations$stds
coefs.plot(intervals = list(list(value = coefs[3,1], label = "X2->Y1", col = "red",
xmin = coefs[3,1] - 2 * stds[3,1],
xmax = coefs[3,1] + 2 * stds[3,1])),
points = list(list(value = sample$coef[3,1], col = "blue",
label = "Actual")),
xlab = "Value", ylab = "")
In the previous subsection, we knew the exact structure of the
system’s equations. Apart from common assumptions such as linearity and
normality, we also knew the explanatory variables in the equations.
Unless you have selected a very low number of observations
(num_obs
parameter), the estimated parameters are
significant and relatively close to the actual values.
However, this may not be realistic enough. Let’s consider a situation where we do not know the explanatory variables of the equations. We assume that there are some irrelevant variables in the system. Of course, we limit the level of uncertainty and other practical issues by restricting the number of these variables. The following code reflects our assumptions:
num_obs <- 100
exo <- matrix(c(-1,1,-0.5,0,0,0,0,1,-1,0.5), ncol = 2)
sample <- sim.sur(ncol(exo), exo, num_obs, TRUE)
As you can see, there are 4 zero restrictions on the system’s parameters. The first dependent variable is not a function of the last two exogenous variables and the second dependent variable is not a function of the first two exogenous variables.
Similar to the previous section, we can estimate this model using
systemfit
. However, I will focus on ldt
. In
this section, I ask ldt
to search for significant
coefficients in the system. See the following code:
fit <- estim.sur(sample$y, sample$x[,-1], addIntercept = TRUE,
searchSigMaxIter = 10, searchSigMaxProb = 0.05)
res_table <- coefs.table(fit, regInfo = c("obs", "sigma2"),
formatLatex = FALSE)
There are two new arguments in the estimation function:
searchSigMaxIter
and searchSigMaxProb
. This
means that a significant coefficient is one with t-statistics larger
than , and we are omitting insignificant variables and re-estimating the
regression at most times. As before, the second line converts the
estimated result into a table for presentation. The result is reported
in the following table:
Y1 | Y2 | |
---|---|---|
Intercept | -0.95*** | 0(r) |
X1 | 1.02*** | 0(r) |
X2 | -0.45** | 1.00*** |
X3 | 0(r) | -0.94*** |
X4 | 0(r) | 0.48*** |
obs | 100 | 100 |
sigma2 | 3.43 | 0.85 |
In the previous subsection, we had all the relevant information about
the system, except for the error term generated in sim.sur
.
Unless you have selected a very low number of observations
(num_obs
parameter) or the system is relatively large or
complex, you should be able to find the structure or actual parameters.
Note that we are not discussing the power of t-test or identification
issues.
However, this setup may not be realistic enough. In reality, there is
much uncertainty about the choice of variables and theory might not be
enough to reduce it significantly. As mentioned in the introduction,
ldt
tries to minimize user discretion and automate things.
This might include defining rules over an external dataset. Although I
used external datasets in previous versions of ldt
vignettes, I decided to rely on simulated data to reduce confusion. To
simulate a dataset, I will add some irrelevant data to the analysis in
this section. Let’s use the data from the previous subsection, but with
an additional line that overrides the matrix of explanatory data in the
sample and adds many irrelevant data:
sample$x <- cbind(sample$x, matrix(rnorm(num_obs * 50), ncol = 50,
dimnames = list(NULL,paste0("z",1:50))))
Therefore, there are 53 irrelevant and 2 relevant variables in the sample. Note that the number of irrelevant data is relatively large.
First, let’s see how significance search works in this situation. As before, we use:
fit <- estim.sur(sample$y, sample$x[,-1], addIntercept = TRUE,
searchSigMaxIter = 10, searchSigMaxProb = 0.05)
res_table <- coefs.table(fit, regInfo = c("obs", "sigma2"), formatLatex = FALSE,
expList = c("Intercept", "X1", "X2", "X3", "X4",
"z3", "z6", "z14", "z32", "z42", "z47"))
Y1 | Y2 | |
---|---|---|
Intercept | -0.59*** | 0.18** |
X1 | 1.05*** | 0(r) |
X2 | 0(r) | 1.18*** |
X3 | 0(r) | -0.94*** |
X4 | 0(r) | 0.49*** |
z3 | 0.46*** | 0.24*** |
z6 | -0.20*** | 0(r) |
z14 | -0.35** | -0.21*** |
z32 | 0(r) | 0.09** |
z42 | -0.50*** | -0.25*** |
z47 | -0.38** | -0.18** |
obs | 100 | 100 |
sigma2 | 2.70 | 0.64 |
To reduce the length of the table, I ran coefs.table
command twice: first without specifying expList
, and after
finding the name of unrestricted coefficients, with
expList
. With this top-to-bottom approach, we can get a
more parsimonious model. However, you can see the effect of variable
selection uncertainty in the table: a relevant variable is omitted and
some irrelevant variables’ coefficients are significant.
Given a specific goodness-of-fit criterion or performance metric,
there is a function in ldt
that can do brute-force search
to find the best model. It systematically and efficiently tries to
estimate and evaluate all possible combinations of variables and selects
the best one. We can discuss whether this best model is the true model
or not. A more important issue is that while theoretically attractive,
for large datasets this approach is not computationally feasible due to
the large number of possible combinations of variables and therefore the
relatively large size of the theoretical model set.
Assuming that we know the maximum number of relevant variables, the code to search the theoretical model set is as follows:
search_res <- search.sur(sample$y, sample$x, numTargets = ncol(sample$y),
xSizes = c(1:4),
metricOptions = get.options.metric(typesIn = c("sic")))
search_res
This code is time-consuming and is not evaluated here. However, on my system, the elapsed time is 17 seconds and the number of searched models is 368830. Note that if we change our previous guess and assume that the maximum number of relevant variables is larger, for example 5, the size of the practical model set becomes 3847591 (10 times larger) and this is estimated in 210 seconds (12 times larger). We are using SIC as the performance metric and these elapsed times increase if we rely on out-of-sample simulation and their corresponding metrics. Therefore, we need to focus on some subsets of the model set. But how should we prioritize one part of the model set over another before the estimation and evaluation stage?
One might reduce the number of potential explanatory variables by
using theory or statistical testing, similar to the top-to-bottom
approach we talked about before. Since ldt
dislikes user
discretion, it provides a more systematic approach. The idea behind it
is simple: smaller models are estimated, variables are selected, larger
models are estimated with a lower number of potential explanatory
variables. Here is the code:
x_size_steps = list(c(1, 2), c(3), c(4), c(5))
count_steps = c(NA, 20, 10, 9)
search_step_res <-
search.sur.stepwise(sample$y, sample$x, numTargets = ncol(sample$y),
xSizeSteps = x_size_steps, countSteps = count_steps,
metricOptions = get.options.metric(typesIn = c("sic", "aic")),
searchItems = get.items.search(bestK = 10)
)
search_step_res
> method: sur
> expected: 1,876, searched: 1,876 (100%), failed: 0 (0%)
> elapsed time: 0.06728618 minutes
> --------
> 1. sic:
> Y1 (best=514.36)
> Y2 (best=514.36)
> 2. aic:
> Y1 (best=488.308)
> Y2 (best=488.308)
The first two lines define the steps. We use all variables
(NA
in count_steps
means all) to estimate
models with sizes defined as the first element of
x_size_steps
(1, 2). Then we select 20 number of variables
(2nd element in count_steps) from the information provided by the best
models and estimate models with sizes determined by the second element
of x_size_steps
(3). And so on.
The size of the model subset and running time are greatly reduced. However, let’s see its performance.
To study or report results, we should use the summary
function. The output of a search project in ldt
does not
contain estimation results but only the minimum level of information to
replicate them. The summary
function does the job and
estimates the models. Here is the code:
Usually, there is more than one model in the summary. This is because
the output is first “target-variable-specific” and second
“evaluation-specific”. I report some of them by creating a list of
estimated models and like before by using coefs.table
function:
mod_list <- list("SIC" = ssum$sic$target1$model$bests$best1,
"AIC" = ssum$aic$target1$model$bests$best1)
res_table <- coefs.table(mod_list, regInfo = c("obs", "sigma2"), formatLatex = FALSE)
Like before, the second line gets the coefficient table for presentation.
SIC | SIC | AIC | AIC | |
---|---|---|---|---|
dep. | Y1 | Y2 | Y1 | Y2 |
Intercept | -0.74*** | 0.11 | -0.74*** | 0.11 |
X1 | 1.11*** | 0.04 | 1.11*** | 0.04 |
X2 | -0.45** | 1.01*** | -0.45** | 1.01*** |
X3 | -0.31* | -1.09*** | -0.31* | -1.09*** |
X4 | -0.06 | 0.45*** | -0.06 | 0.45*** |
obs | 100 | 100 | 100 | 100 |
sigma2 | 3.28 | 0.82 | 3.28 | 0.82 |
Note that the headers indicate the metric used to compare the models. The first row shows the name of the target variable in the search.
The algorithm has omitted all irrelevant variables and found the restrictions on the second equation. However, for the first equation, X3 is significant at the 90% confidence level, even though it should be insignificant.
It’s important to note that the results are specific to the selected seed. Drawing general conclusions is beyond the scope of this text.
In this section, we will discuss another aspect of ldt
:
out-of-sample evaluation. Since ldt
is rather atheoretical,
it tries to compare models based on their predictive power. Of course,
we are not generally talking about time and future in SUR models, and by
out-of-sample evaluation, we are referring to the cross-validation
practice.
The following code is similar to the code in the previous section, but we define a out-of-sample process:
metric_options <- get.options.metric(typesOut = c("rmse", "crps"),
seed = -seed, simFixSize = 5, trainRatio = 0.75)
search_step_res <-
search.sur.stepwise(sample$y, sample$x, numTargets = ncol(sample$y),
xSizeSteps = x_size_steps, countSteps = count_steps,
metricOptions = metric_options,
searchItems = get.items.search(bestK = 10)
)
search_step_res
> method: sur
> expected: 1,926, searched: 1,926 (100%), failed: 0 (0%)
> elapsed time: 0.06706197 minutes
> --------
> 1. rmse:
> Y1 (best=1.704)
> Y2 (best=0.966)
> 2. crps:
> Y1 (best=0.969)
> Y2 (best=0.551)
We use 0.75 ratio of the observations (determined by
trainRatio
) for estimating and the rest for testing, and we
repeat this experiment 5 times (determined by simFixSize
).
I report the result similar to the previous discussion:
ssum <- summary(search_step_res, y = sample$y, x = sample$x)
mod_list <- list("RMSE_T1" = ssum$rmse$target1$model$bests$best1,
"RMSE_T2" = ssum$rmse$target2$model$bests$best1,
"CRPS_T1" = ssum$crps$target1$model$bests$best1,
"CRPS_T2" = ssum$crps$target2$model$bests$best1)
res_table <- coefs.table(mod_list, regInfo = c("obs", "sigma2"), formatLatex = FALSE)
res_table <- res_table[,c(1,4,5,8)]
Please note that a column name such as RMSE_T1
indicates
that the column belongs to a model selected based on the “RMSE”
criteria, and the evaluation pertains to the first target variable. The
last line is included for presentation purposes and serves to remove any
unimportant columns. Below is the table of results:
RMSE_T1 | RMSE_T2 | CRPS_T1 | CRPS_T2 | |
---|---|---|---|---|
dep. | Y1 | Y2 | Y1 | Y2 |
Intercept | -0.61*** | -0.61*** | ||
X1 | 1.16*** | 1.16*** | ||
z35 | -0.35* | -0.35* | ||
z42 | -0.49*** | -0.49*** | ||
z47 | -0.55*** | -0.55*** | ||
X2 | 1.00*** | 1.00*** | ||
X3 | -1.05*** | -1.05*** | ||
obs | 100 | 100 | 100 | 100 |
sigma2 | 3.00 | 1.02 | 3.00 | 1.02 |
You may achieve better results by modifying the arguments of the
get.options.metric function and using the
get.items.modelcheck
function to change the default value
of searchItems
(Recall that our goal is to exclude
irrelevant z?
variables and find zero restrictions). One
important parameter to consider is the number of simulations,
simFixSize
. Its current value is 5. By increasing this
value, you can allow the models to demonstrate their capabilities more
effectively during the search process. Of course, increasing the
simFixSize
is expensive!
When searching and estimating a model set, ldt
gathers
various types of information, including the coefficients in the context
of seemingly unrelated systems. You can adjust the arguments of the
search.sur
function to report the evaluation metric of all
estimations or to summarize the estimation information for a
coefficient. For example, you might want to know the combined
distribution of a coefficient in all estimated models or perform an
extreme bound analysis.
Let’s use the sample from the previous example for this purpose. We can use the get.items.search function to request other types of information:
search_items <- get.items.search(type1 = TRUE, bestK = 10,
inclusion = TRUE,
cdfs = seq(-0.8, 0.3, 0.01),
extremeMultiplier = 1.95,
mixture4 = TRUE)
Apart from the bestK parameter, which is set to a high value to meet
step-wise search requirements, we changed the default value of five
parameters. We set type1
to TRUE
, meaning we
need the first type of information (coefficients in SUR).
inclusion = TRUE
means inclusion weights are saved and
returned. cdfs = seq(...)
means we want the weighted
average of CDFs in all estimated models at the given points. The
sequence value helps us get the correct shape of the combined
distribution. Also note that this sequence is selected after running the
code a second time when we have information about extreme bounds.
extremeMultiplier = 1.95
calculates and returns the
combined confidence interval of a coefficient in all models.
mixture4 = TRUE
returns the first four moments of the
combined distribution, which we will use to estimate a GLD
distribution.
Let’s repeat the estimation with these new options:
search_step_res <-
search.sur.stepwise(sample$y, sample$x, numTargets = ncol(sample$y),
xSizeSteps = x_size_steps, countSteps = count_steps,
metricOptions = get.options.metric(typesIn = c("aic")),
searchItems = search_items)
Inclusion weight is a simple concept: it is the average weight of models that contain a specific variable. Therefore, if a variable has a higher inclusion weight, it contributes more information to explain the dependent variable.
Focusing on the second target variable, Y2
, we can see
from search_step_res$aic$target2$model$inclusion
matrix
that the average inclusion weight in all models is 4.9210088^{-110}.
Let’s plot some of the highest inclusion weights:
Recall that the second dependent variable is not a function of
z?
variables (i.e., irrelevant ones), intercept and
X1
.
The CDF, extreme bounds, and mixture4 convey different levels of
information about the combined density of a coefficient across all
estimated models. Let’s focus on the effect of X2
on
Y1
.
First, we extract the actual and interval estimation:
actual <- list(value = sample$coef[3,1], label = "Actual")
b <- search_step_res$aic$target1$coefs$bests$item3$best1
interval <- list(value = b$mean, label = "X2 -> Y1",
xmin = b$mean - 1.95 * sqrt(b$var),
xmax = b$mean + 1.95 * sqrt(b$var))
Since we will use the coefs.plot
function discussed
earlier, we set the values as elements of a list. Next, we get the
extreme bounds:
e <- search_step_res$aic$target1$coefs$extremeBounds
bound <- list(xmin = e[3,1], xmax = e[3,2], label= "E.B.")
Next, we use the CDFs to plot the true density function:
cdfs <- list(type = "cdfs",
cdfs = sapply(search_step_res$aic$target1$coefs$cdfs, function(c)c[3,1]),
xs = search_items$cdfs,
label = "CDFs", smoothFun = function(y) smooth(y), col = "brown", lty = 1)
Then we get the GLD distribution using the gld
package:
g <- search_step_res$aic$target1$coefs$mixture[3,]
g <- s.gld.from.moments(g[1], g[2], g[3], g[4], type = 3)
glds <- list(type = "gld",
p1 = g[1], p2 = g[2], p3 = g[3], p4 = g[4],
label = "GLD", quantiles = seq(0.001, 0.999, 0.001), lty = 2, col = "black")
Plotting is done using the coefs.plot
function, similar
to the previous example.
coefs.plot(intervals = list(interval), bounds = list(bound),
points = list(actual), distributions = list(cdfs, glds),
xlab = "Value", ylab = "")
Consider a system of equations where there are many relevant variables, meaning that we cannot divide the data into relevant and irrelevant and each variable has something to offer. Estimating a large model might not be wise due to low degrees of freedom or other problems such as multicollinearity. In such a system, we might still want to search for the most significant variables or select a subset of the most important variables. However, we can also use principal component analysis (PCA).
As before, let’s start by defining our system:
As you can see, we assume that there are 60 relevant exogenous variables in the system. The following code shows the estimation with PCA:
By using get.options.pca
, I am asking ldt
not to include the first 2 variables in the PCA analysis. Also, I want
it to include 4 principal components in the regression. The following
table shows the result:
Y1 | Y2 | |
---|---|---|
Intercept | -1.09 | -0.06 |
X1 | -0.27 | 3.51*** |
X_PC1 | -0.91** | 1.97*** |
X_PC2 | -1.44*** | 1.64*** |
X_PC3 | 1.05** | 0.77 |
X_PC4 | 0.02 | 0.60 |
obs | 100 | 100 |
sigma2 | 53.38 | 81.09 |