Inputting Data

Help for inputting data in the multiple regression analysis and forecasting template for Excel

User Guide Navigation:
Business Spreadsheets User Guides: Project Planning and Management Multiple Regression Analysis and Forecasting Investment and Business Valuation Real Options Valuation Portfolio Optimization Portfolio Performance Tracking Customer Invoicing
Multiple Regression Analysis and Forecasting Help Topics: Inputting Data Analyzing the Output Forecasting the Dependent Variable

Arriving at suitable input data for the Regression Forecasting model can be somewhat of an iterative process. Depending on the strength of the resulting output, independent variables can eliminated, replaced, or added until a statistically sound and satisfactory predictive regression equation eventuates. In selecting suitable independent variables, several factors should be taken into consideration:

Availability. Clearly, the independent data corresponding to the same observation periods as the dependent must be available to run the model. However, the availability of forecast data should also be considered. This is because the forecast data can be utilized in the regression equation to forecast the dependent variable. While the model allows automatic forecasting of independent variable data based on underlying trend lines, already calculated forecast data is likely to be based on stronger underlying analysis. Such forecasts might be sourced from in-house business plans or third party data providers.
Reliability. Input data should be acquired from a reliable source. This is particularly important if the data is estimated. If it is anticipated that forecasts of the independent data will be utilized to forecast the dependent variable, investigation should be made into the soundness of formulas and estimation processes.
Relevance. Independent data should have a relevant relationship to the dependent variable. While seemingly unrelated variables may possess correlations when testing for suitable independent 'value driver' data, care should be given that the relationship is not merely coincidental. Thought should be made that the independent variable data has at least a logical cause to influence the dependent variable.

Data should be input into the 'Input' sheet. The Dependent variable data and title should be entered into column A, and the Independent variables’ data and titles should be entered into columns B outward. There is no limit to the number of independent variables or observations and additional variables can be accommodated across the model automatically. Row formatting can be copied down is desired for observations.

Text based categorical and logistic/binary data types are supported and are automatically detected and converted to continuous data proxies for regression. The numerical variables created from categorical and logistic variables are added as additional variables to retain the mapping with the original variable columns which are marked OFF to be excluded from the analysis.

Feature Selection

Feature selection automatically identifies the combination of input variables that provides the best fit predictive regression equation without multicollinearity. The Minimum R-squared is first applied to each independent variable against the dependent and those under the R-squared are flagged as OFF. A value of zero will ignore this part of the feature selection. The Variance Inflation Factor (VIF) is used to test the adjusted R-Squared relationship of each independent variable with all other independent variables. An iterative process is undertaken to disable variables where the maximum VIF calculated as 1/(1-RSQ) to over this threshold. A rule of thumb for the VIF value is 10 lower values disabling more variables and higher values being more lenient on the multicollinearity tolerance. A VIF value of zero will ignore this part of the feature selection.

Variables can also be flagged as OFF manually for testing. Setting both feature selection variables to zero essentially disabled the feature selection and forces all numeric variables to be processed. The results will then highlight relative impact and multicollinearity for variables to then be manually turned OFF.

Screenshot: Inputting Data

Back