Expectation-Maximization Algorithm Estimation Method in Automated Model Selection Procedure for Seemingly Unrelated Regression Equations Models

Model selection is the process of choosing a model from a set of possible models. The model's ability to generalise means it can fit both current and future data. Despite numerous emergences of procedures in selecting models automatically, there has been a lack of studies on procedures in selecting multiple equations models, particularly seemingly unrelated regression equations (SURE) models. Hence, this study concentrates on an automated model selection procedure for the SURE model by integrating the expectation-maximization (EM) algorithm estimation method, named SURE(EM)-Autometrics. This extension procedure was originally initiated from Autometrics, which is only applicable for a single equation. To assess the performance of SURE(EM)-Autometrics, simulation analysis was conducted under two strengths of correlation among equations and two levels of significance for a two-equation model with up to 18 variables in the initial general unrestricted model (GUM). Three econometric models have been utilised as a testbed for true specification search. The results were divided into four categories where a tight significance level of 1% had contributed a high percentage of all equations in the model containing variables precisely comparable to the true specifications. Then, an empirical comparison of four model selection techniques was conducted using water quality index (WQI) data. System selection to select all equations in the model simultaneously proved to be more efficient than single equation selection. SURE(EM)-Autometrics dominated the comparison by being at the top of the rankings for most of the error measures. Hence, the integration of EM algorithm estimation is appropriate in improving the performance of automated model selection procedures for multiple equations models.


Introduction
To make scientific discoveries or anticipate future outcomes, data analysts use a variety of statistical models and methodologies to analyse observable data. Regardless of the data and fitting processes used, selecting the best acceptable model or approach from a pool of candidates is an important step. An essential part of data analysis for scientific investigations is model selection, which is essential for obtaining accurate statistical inferences or predictions [1], [2]. The model selection procedure begins with an estimation of a model, which the researcher first stated. Afterwards, the results of hypothesis tests of the single parameters are used to identify significant variables or to conduct diagnostic checking for the assumptions of the model [3]. This entire procedure may be done automatically or manually. However, it is quite difficult to perfect this intuitive judgement in manual model selection. Repetitive manual retraining and recalibrating of models are frequently prohibitively expensive, time-consuming, and in some circumstances impossible [4].
During the model development process, different researchers may develop distinct modelling paradigms or strategies. Therefore, a number of alternative models for the same data set are created even though they are based on the same methodology. The different models that arise illustrate that variability may occur when the model specification procedure is performed manually. Finally, this circumstance contributes to the creation of a gap between specialists and novice users, where novice users may include beginners in statistical or economic modelling who struggle to understand the model itself [1]. This difference in understanding has inspired the demands for an automated technique for a more convenient and definitive answer. With the advancement and spreading of data modelling, the necessity of automated model specification has grown exponentially. In bridging the gap between the experts and novices in model selection, the employment of expert systems approaches seems an optimal solution. The specification of multiple equation models is an example of this challenging task [5].

Seemingly Unrelated Regression Equations (SURE) Models
Multiple equation models apply to a group of related variables in some situations. With these complex factors present, it is reasonable to evaluate the models simultaneously as a system of equations. To view the equations together rather than individually, the word "system" is used. The advantages may be appreciated further by combining all the equations to describe the dynamic composition of the real operation. Additional information may be provided via a group of equations rather than by the addition of single equations. The more knowledge about the causal linkages and structures included are known, for instance, the more accurate the forecasts will be obtained [6].
Zellner [7] proposed the SURE model, which uses multiple equations and is an extension of the standard linear regression model. Every equation can be estimated independently, even though the error terms are considered to be correlated across equations. Because each equation must be able to stand on its own, each equation has a dependent variable as well as potentially separate sets of regressors. The results proved that these equations were unconnected because of their individuality but were nonetheless linked because of the error terms.
The SURE model is frequently used in the fields of economics and financial modelling [8]- [10]. It may, however, be used in other disciplines as well, such as transportation [11]- [13], agricultural [14], management [15] and medical [16]. More studies had also been spurred by improvements and expansions to the original equations [17]- [19]. As a result, the SURE model is applicable in nearly every element of life. SURE modelling is presented to gain estimation efficiency by integrating information on several equations and imposing or testing restrictions that include parameters in separate equations [20]. Assume that the equations are in series, 1 = 11 1 ,1 + 12 1 ,2 + ⋯ + 1 1 1 , 1 + 1 2 = 21 2 ,1 + 22 2 ,2 + ⋯ + 2 1 2 , 2 + 2 ⋮ = 1 ,1 + 2 ,2 + ⋯ + 1 , + which can generally be written as, where y i is the vector of T identically distributed observations for each random variable, X i is a nonstochastic matrix of fixed variables of rank k i , β i is the vector of unknown coefficients, and u i is a vector of disturbances.

Estimation Methods
In economic analysis, the SURE model may be identified in several instances [7]. One example is when the equations have uncorrelated disturbances occurring at the same time. If all of the response variables are believed to be connected by the same regressors, this is a typical multivariate linear regression model. Ordinary least squares (OLS) provide the most effective and best linear unbiased estimators. Another example is when the equations have the same regressor, but the disturbances are contemporaneously linked. Because the OLS ignores the correlations of the disturbances, it is commonly accepted that the generalised least squares (GLS) estimate technique is more efficient. Nevertheless, in the majority of cases, the covariance of disturbances Ω is unidentified, making GLS impractical. As a result, feasible generalised least squares (FGLS) using a consistent estimator in place of Ω was presented [7].
Srivastava and Giles [20] found that most studies comparing FGLS to OLS used models with two equations. However, when sample sizes are limited and/or disturbance correlation coefficients in SURE models are near zero, FGLS can be less efficient than OLS. Moreover, OLS outperforms FGLS when disturbance correlations are weak. The effectiveness of FGLS may be lost in cases when all equations have identical regressors, as in a vector regressive system. FGLS can potentially be risky when missing variables are specified. Because the equations are treated as a system, one error in one equation can affect the estimations in others. Unlike OLS estimation, which only applies to one equation at a time. Other equations with no specification errors will be unbiasedly calculated. As a result, some researchers still oppose FGLS estimation since single equation misspecification might affect all estimates in the system Additionally, the FGLS does not exist in the high-dimensional SURE model [21]. Thus, FGLS is not always an effective estimator, and therefore another estimation method for the SURE model is being investigated in this study, which is the EM algorithm.
The EM algorithm is also applicable to the SURE model, which is considered a repeated measures analysis. The SURE model's regression parameters and variance-covariance matrix may be estimated using a two-stage Aitken estimation procedure. Additionally, the EM method is considered simple to implement because it does not entail any evaluation of the likelihood or its derivatives, besides being able to estimate the 'missing' data values [22].
McLachlan and Krishnan [22] and Ng et al. [23] listed some attractive properties of the EM algorithm in relation to other iterative algorithms such as Newton-Raphson and Fisher's scoring method for searching estimators including: i. The EM algorithm is numerically stable. ii.
The EM algorithm normally has consistent global convergence. iii.
The EM algorithm is easy as it relies on calculations of complete data. iv.
The EM algorithm is normally easy to programme due to no evaluation of the likelihood or its derivatives are involved. v.
Only small storage is needed for the EM algorithm and thus, can be done on a small computer. vi.
Low cost for each iteration offsets the bigger number of iterations required for the EM algorithm compared to other procedures. vii.
Estimated values of the 'missing' data can be computed using the EM algorithm.
Sohn [24] discovered numerous useful characteristics of the EM method when used to estimate integrated choice and latent variable models. The EM algorithm significantly reduced the calculation time since it does not involve any time-consuming numerical computation of derivatives or the Hessian of the simulated likelihood function. Additionally, it avoids a lack of empirical identification. Additionally, the EM algorithm excelled at decreasing sampling errors when the sample size was small (250 and 500). Even when large samples of 1000 and 2000 were used, the EM algorithm outperformed its competitor. Furthermore, Ryzin and Vulcano [25] demonstrated that the convergence time for the EM algorithm is between twice and six times faster than that of the direct ML estimator, while the quality of the estimates is comparable. Additionally, their findings were validated in experiments including many parameters or a high degree of censoring.
Due to the extensive advantages of the EM algorithm, many studies did employ EM in a wide range of applications, for example, pattern recognition [26], biomedical and health [27]- [29], nature [30] and crime [31]. Overall, the EM algorithm is an outstanding tool to be studied.

Automated Model Selections
The work by Hoover and Perez [32] was a pioneer in the field of automated model selection. Hendry and Krolzig [33] built on Hoover Perez's work by improving data mining algorithms and developed PcGets, a software for empirical modellers. Multiple equations models can be selected by using one equation at a time or all equations at the same time. The automated approach has facilitated these selections in a fraction of the time required by the human approach. SURE-PcGets by Ismail [34] is one such method in which PcGets selection stages are integrated with the SURE model. SURE-PcGets updated testing for contemporaneous correlation disturbances, and the model formulation of PcGets and showed that the simulation results demonstrated the algorithm's efficacy in finding the true model. Doornik and Hendry [35] then created Autometrics by utilising the notion of the tree search strategy. Autometrics algorithm is comparable to PcGets. Autometrics, on the other hand, employs a tree search approach with adjustments to the pre-search simplification and goal function. Yusof [36] created SURE-Autometrics in response to the success of SURE-PcGets by utilising Autometrics, another single equation selection technique. This method also makes use of the SURE model. The tree search strategy incorporated in Autometrics enables extensive exploration of pathways and enhances the likelihood of discovering terminal models using pruning, bunching, and chopping reduction principles. This would finally result in the final model being the 'best' of all.
SURE-Autometrics was constructed by adapting Ismail's algorithm [34]. In this study, two approaches were considered: (i) a model selected by OLS compared a model selected by FGLS; and (ii) a model selected by OLS and subsequently estimated by FGLS, and against a model selected and estimated by FGLS concurrently. The former strategy was to establish how substantially estimation of single-equation and system varied, whilst the latter sought to illustrate the distinctions between single and simultaneous selection processes. Yusof [36] demonstrated that when the correlation strength was strong, SURE-Autometrics was able to remove more irrelevant variables when the model was selected and estimated concurrently.
This had prompted another approach in utilizing iterative feasible generalized least squares (IFGLS) estimation method instead of FGLS by using an extended version of SURE-Autometrics, known as SURE(IFGLS)-Autometrics [37], [38]. The results were found to support the earlier findings of SURE-Autometrics but showed slightly better performance than SURE-Autometrics. As a consequence of these initial attempts to use Autometrics to integrate the joint estimation and model selection of the whole SURE system, a new avenue of study into the SURE system has been discovered further. Because of this, the purpose of this study was to evaluate the performance of SURE-Autometrics, but with the use of the EM algorithm estimation method. The algorithm has been renamed SURE(EM)-Autometrics to better reflect its capabilities.

SURE(EM)-Autometrics
The development of SURE(EM)-Autometrics continues to take on original SURE-Autometrics with five phases described in this section. The original SURE-Autometrics algorithm was changed by the usage of the EM in contrast to the FGLS estimation method for SURE models. Therefore, the EM algorithm estimation method is applied in each phase of SURE(EM)-Autometrics.

Phase 1: Specification of initial GUMS
The algorithm begins with Phase 1 where every equation in the SURE model is specified by the modeller for its initial specification. This includes the number of equations, the main level of significance and the variables with their lags. Meanwhile, OLS is used to estimate the single equations in the SURE model separately. Any misspecification of the equations is also checked through diagnostic testing such as tests for normality errors, parameters constancy, autocorrelation, unconditional homoscedasticity, and conditional homoscedasticity. This phase also consists of testing of contemporaneous correlation of disturbances among equations and EM algorithm estimation.

Phase 2: Pre-search Reduction
The purpose of this phase is to lessen the computational effort where the highly insignificant variables are removed. Nevertheless, the whole algorithm can still function even with the absence of Phase 2. There are three types of reductions used to remove insignificant variables: closed, common, and common-X lag reductions. The closed lag reduction is designed to test a set of lags starting with the biggest lag and ending when a lag cannot be removed. Meanwhile, the common lag reduction is to test grouped lags with the same lag number and organize the joint (common) significances beginning with the least significant lag group. Lastly, the common-X lag reduction follows similar steps as common lag reduction, but the lag of Y is excluded from the procedure. The three lag reductions are employed in chronological order and again in reverse chronological order. Finally, the encompassing test is implemented to make certain that the reduced model is a legitimate reduction of the initial system of GUM.

Phase 3: Variable Reduction over Root Branches
Phase 3 involves a tree search in which the whole space of models created by variables in the initial model is searched. In the tree search, three main principles are involved; (i) pruning is executed when a single variable is being evaluated for removal (ii) bunching is used when variables are combined for deletion rather than deleting one variable at a time (iii) chopping happens when the search process permanently removes a highly insignificant brunch.

Phase 4: Search for Nested Terminal
Phase 4 deals with additional inspections on the terminal with the aim of finding variables that ought to be in the GUMS system. In order to find various terminal models, a minimal bunch is eliminated along the present path.

Phase 5: Selections of Final Model
In this phase, the algorithm finishes iterating when the new GUMS is the same as the previous GUMS. The model with the lowest information criterion values is chosen as the final model if there are numerous terminal candidate models.

Expectation-Maximization (EM) Algorithm
Dempster et al. [39] extensively described the EM algorithm method. Each algorithm iteration includes two steps: the E-or expectation step and the M-or maximisation phase. θ 0 is assumed to be an initial set of parameter estimations. The E-step entails calculating the conditional expectation of the log-likelihood for all of the data. This is where the expectation relevant to the distribution of the 'missing' data, conditional on θ0 and on the observed data is gained. The M-step is the next step. The 'expected log-likelihood generated in the E-step is maximised in relation to θ, yielding a new estimate θ 1 . Continue by returning to the E-step, substituting θ 0 with θ 1 , and cycling through the E-and M-steps until convergence is achieved.
Mclachlan and Krishnan [22] agreed that computing the ML estimator using the EM method is frequently made achievable by intentionally presenting it as an incomplete data problem, even if it does not appear to be an incomplete data problem at first. Given that this is an early attempt at using an EM algorithm to select the SURE model of time series missing data, this research was constructed with only one missing rate of 30%, in accordance with the rates used in previous simulation designs such as in Emura and Shiu [40]. The following EM algorithm was used accordingly: Step 1: Start from initial values based on the means calculated from the available data.
Step 2: Perform the FGLS estimates of SURE model parameters for the full data set.
Step 3: Forecast missing values using the estimates found above. Step 4: Replace forecasted values for missing values.
Step 5: Go to Step 2 and continue until convergence of parameter estimates.

Simulation Analysis
The simulation study was designed to evaluate the overall performance of the suggested algorithm in searching for the true model specification based on experiments in [34], [41] and [42]. The first model, M1 was also denoted as 'empty model' since it was purely random errors, while M2 comprised the first lag of the dependent variable. In the meantime, M3 was an augmentation of M2 by inserting one independent variable (x t ) and its first lag (x t-1 ). The three econometric models in Table 1 have been utilised as a testbed for true specification search.  During Phase 1 of the procedure, additional unnecessary variables were then added to these true models. As the data-generating process was known, the performances of SURE(EM)-Autometrics were evaluated by computing the percentages of the final models selected comparable to the true models. The goal was to achieve a high percentage of these outcomes. Table 2 summarises all the conditions that were put up for the simulation study: The simulation results are divided into four categories, with the criteria adopted from [36]. Table 3 lists the various categories. If all the equations in the model met the requirements, the results fall into the assigned category.

Empirical Analysis
On top of the SURE(EM)-Autometrics procedures, this empirical study considered three other selections: Autometrics-SURE, Autometrics-SURE(EM), and SURE-Autometrics. These selections were categorised according to how the equations were selected individually or collectively, and the estimation method was utilised in the final models, as listed in Table 4. Autometrics is a single equation model algorithm that is built into the PcGive programme.
Given that the SURE model contains multiple equations, each one was estimated independently using OLS and selected several times. Nevertheless, the FGLS was used to estimate the final model in Autometrics-SURE. In the meantime, Autometrics-SURE(EM) is similar to Autometrics-SURE, but it uses the EM algorithm to estimate the final model. SURE-Autometrics and SURE(EM)-Autometrics are automated model selection algorithms that concentrate on the multiple equations model. In SURE-Autometrics, model selection was done concurrently using the FGLS estimation, whereas in SURE(EM)-Autometrics, EM estimation was integrated. The dependent variable used in this study was the weekly data of the water quality index (WQI) of a river in Malaysia of 80 observations. Meanwhile, the independent variables were Dissolved Oxygen (DO) (% saturation), Dissolved Oxygen (DO) (mg/L), Biochemical Oxygen Demand (BOD), Chemical Oxygen Demand (COD), Suspended Solids (SS), pH and Ammoniacal Nitrogen (NH 3 N). The sub-indices in the analysis were created using these independent variables. Data sets were collected from two sampling stations, namely S7 and S8 which yielded a two-equation model.

Simulation Analysis
This simulation started with all WQI data available in hand. Two monitoring stations represent two equations in models. True models of M1, M2 and M3 were used as initial GUMS consisting of 18 independent variables at 1% and 5% levels of significance in a sample of 550 observations with strong (0.9) and weak (0.2) disturbance correlations. Table 5 presents percentages gained for each category by SURE(EM)-Autometrics of two equations with strong disturbance correlation.
The results reveal that SURE(EM)-Autometrics displayed a gradual increment from M1 to M3 at a 1% level. Category 1 still managed to provide considerably high percentages of more than 80% while these values continued to surge up to 90% in M2 and slightly more than 90% in M3. The highest percentage in Category 1 of M3 could be caused by the inclusion of the most correlated variable with the dependent variable, x i4t and strong disturbance correlation. Nevertheless, only less than 10% of final models were grouped in each Category 2 and 4 but none in Category 3. The tight significance level of 1% had possibly successfully avoided any significant irrelevant variables to be in the final model.
Performance of SURE(EM)-Autometrics for 5% level of significance showed that percentages in Category 1 fell drastically compared to the same category for 1% level and thus triggered more final models to be classed in Category 2. This is followed by M2 with moderate accomplishment and finally, M1 with the least scores. All of the final models included true specifications; thus, no models were found under Category 3. Finally, there were still models which had different equations under different categories and therefore still contributed percentages under Category 4.
As for the results of final models for SURE(EM)-Autometrics with a weak correlation of disturbance (ρ = 0.2) at the 1% level, M3 still covered most models in Category 1 compared to the other two initial GUMS, while M2 was not far behind M3. The strength of disturbance correlation did not influence the model selection processes in this case, unlike the consequence of x i4t inclusion in the model. In an empty model as in M1, the algorithm produced 12% of final models exactly as in true specifications.
On the other hand, at a 5% level of significance, none of the final models was found in Category 1 for M1, but close to 10% for M2 and M3 under weak correlation of disturbance. Overall, most of the true specifications were found to be nestled in the final models, Category 2. No model fell in Category 3, but mixed results were attained for Category 4. Consequently, a weak correlation had made the SURE model less efficient. The overall findings were found to be consistent with the results by Yusof [36] where the algorithm performed well when the number of equations and number of predictors in the true specification models was as minimal as possible.

Empirical Analysis
Stations S7 and S8 are represented by a system of two equations. The estimated models using different model selection procedures are displayed in Table 6 and Table 7.    Table 6. Estimated Models of WQI for S7 (cont.) ***Significant at 1%, ** Significant at 5%, * Significant at 10% Table 7. Estimated Models of WQI for S8   ***Significant at 1%, ** Significant at 5%, * Significant at 10% Following the experiments by Suzilah [34] and Yusof [36], two types of error measurements were used for this study: root mean square error (RMSE) and geometric root mean square error (GRMSE). The presence of low values for these indicators indicates good forecasting performance. When it comes to evaluating the effectiveness of forecasting models, the RMSE is one of the most used measures among practitioners. When forecasts are performed across successive periods with the same forecast horizon and the cost function is quadratic, the RMSE is an appropriate measure of accuracy. Between now and then, Fildes [13] proposed that the GRMSE be used in situations in which there are infrequent outliers in the data (and errors), as well as when dealing with a significantly large error term due to an exceptionally terrible forecast.

Models selection procedures
After computing the two error measures for each procedure up to three steps ahead, the medians for all equations in the SURE model were determined and ranked from 1 (the best) to 4 (the worst). This provided an assessment of the performance of each selection procedure. The results for one, two, and three-step-ahead forecasts using a two-equation model are shown in Table 8.
SURE(EM)-Autometrics has consistently performed at the top in this empirical analysis ranking first or second for all RMSEs and GRMSEs. This was followed by SURE-Autometrics, which rated first in two circumstances. Both Autometrics-SURE(EM) and Autometrics-SURE received worse rankings than their competitors. These findings revealed that system selections for multiple equations systems are more competent than individual selections. These results support the earlier findings by Ismail, Yusof and Muda [43] and Yusof [36] where multiple models selection algorithms performed well in one and two step-ahead forecasts. In addition, the EM algorithm proved to be more superior to FGLS in selecting the equations simultaneously.

Conclusions
SURE(EM)-Autometrics has successfully demonstrated excellent performance in selecting a multiple equations model, in particular two-equation SURE models. Because the model is made up of many equations, the performance of the algorithm is reflected by the percentages calculated when all equations are comparable to the true model in the simulation analysis. The high percentages in searching the final models similar to true specifications were mainly found when the correlation of disturbance was strong at a 1% level of significance.
On top of the simulation analysis, an empirical analysis was also conducted, and it was discovered that SURE(EM)-Autometrics proved to be the most successful selection procedure with the first or second ranking relative to other procedures. This suggested model selection has shown that selection of the system is considerably more efficient than individual selections.
The 'best' model may be selected concurrently from several equations of the SURE model by using the EM algorithm estimation in the algorithm. This indicates that in this study, the EM algorithm estimation method has been verified to be more trustworthy rather than FGLS. The findings of this study offer a valuable understanding of the EM algorithm estimation method in multiple equations model selection, notably in the SURE model. Consequently, this new estimation method also gives meaningful outcomes in identifying the 'best' parsimonious model from a general model and finally able to improve the model's forecast ability.