Introduction
In the previous tutorial we just figured out how to solve a simple linear regression model. A dependent variable guided by a single independent variable is a good start but of very less use in real world scenarios. Generally one dependent variable depends on multiple factors. For example, the rent of a house depends on many factors like the neighborhood it is in, size of it, no.of rooms, attached facilities, distance of nearest station from it, distance of nearest shopping area from it, etc. How do we deal with such scenarios? Let's jump into multivariate linear regression and figure this out.
Multivariate Linear Regression
This is quite similar to the simple linear regression model we have discussed previously, but with multiple independent variables contributing to the dependent variable and hence multiple coefficients to determine and complex computation due to the added variables. Jumping straight into the equation of multivariate linear regression,
$$$Y_i = \alpha + \beta_{1}x_{i}^{(1)} + \beta_{2}x_{i}^{(2)}+....+\beta_{n}x_{i}^{(n)}$$$
$$Y_i$$ is the estimate of $$i^{th}$$ component of dependent variable y, where we have n independent variables and $$x_{i}^{j}$$ denotes the $$i^{th}$$ component of the $$j^{th}$$ independent variable/feature. Similarly cost function is as follows,
$$$E(\alpha, \beta_{1}, \beta_{2},...,\beta_{n}) = \frac{1}{2m}\sum_{i=1}^{m}(y_{i}-Y_{i})$$$
where we have m data points in training data and y is the observed data of dependent variable.
As per the formulation of the equation or the cost function, it is pretty straight forward generalization of simple linear regression. But computing the parameters is the matter of interest here.
Computing parameters
Generally, when it comes to multivariate linear regression, we don't throw in all the independent variables at a time and start minimizing the error function. First one should focus on selecting the best possible independent variables that contribute well to the dependent variable. For this, we go on and construct a correlation matrix for all the independent variables and the dependent variable from the observed data. The correlation value gives us an idea about which variable is significant and by what factor. From this matrix we pick independent variables in decreasing order of correlation value and run the regression model to estimate the coefficients by minimizing the error function. We stop when there is no prominent improvement in the estimation function by inclusion of the next independent feature. This method can still get complicated when there are large no.of independent features that have significant contribution in deciding our dependent variable. Let's discuss the normal method first which is similar to the one we used in univariate linear regression.
Normal Equation
Now let us talk in terms of matrices as it is easier that way. As discussed before, if we have $$n$$ independent variables in our training data, our matrix $$X$$ has $$n+1$$ rows, where the first row is the $$0^{th}$$ term added to each vector of independent variables which has a value of 1 (this is the coefficient of the constant term $$\alpha$$). So, $$X$$ is as follows,
$$$
X =
\begin{bmatrix}
X_{1} \\
.. \\
X_{m} \\
\end{bmatrix}
$$$
$$X^{i}$$ contains $$n$$ entries corresponding to each feature in training data of $$i^{th}$$ entry. So, matrix X has $$m$$ rows and $$n+1$$ columns ($$0^{th} column$$ is all $$1^s$$ and rest for one independent variable each).
$$$
Y =
\begin{bmatrix}
Y_{1} \\
Y_{2} \\
..\\
Y_{m} \
\end{bmatrix}
$$$
and coefficient matrix C,
$$$
C =
\begin{bmatrix}
\alpha \\
\beta_{1} \\
.. \\
\beta_{n} \\
\end{bmatrix}
$$$
and our final equation for our hypothesis is,
$$$Y = XC$$$
To calculate the coefficients, we need n+1 equations and we get them from the minimizing condition of the error function. Equating partial derivative of $$E(\alpha, \beta_{1}, \beta_{2}, ..., \beta_{n})$$ with each of the coefficients to 0 gives a system of $$n+1$$ equations. Solving these is a complicated step and gives the following nice result for matrix C,
$$$
C = (X^{T}X)^{-1}X^{T}y
$$$ where y is the matrix of the observed values of dependent variable.
This method seems to work well when the n value is considerably small (approximately for 3-digit values of n). As n grows big the above computation of matrix inverse and multiplication take large amount of time. In future tutorials lets discuss a different method that can be used for data with large no.of features.