“The road to machine learning starts with Regression. Are you ready?”
If you are aspiring to become a data scientist, regression is the first algorithm you need to learn master. Not just to clear job interviews, but to solve real world problems. Till today, a lot of consultancy firms continue to use regression techniques at a larger scale to help their clients. No doubt, it’s one of the easiest algorithms to learn, but it requires persistent effort to get to the master level.
Running a regression model is a no-brainer. A simple model <- y~x
does the job. But optimizing this model for higher accuracy is a real challenge. Let’s say your model gives adjusted R² = 0.678; how will you improve it?
In this article, I’ll introduce you to crucial concepts of regression analysis with practice in R. Data is given for download below. Once you are finished reading this article, you’ll able to build, improve, and optimize regression models on your own. Regression has several types; however, in this article I’ll focus on linear and multiple regression.
Note: This article is best suited for people new to machine learning with requisite knowledge of statistics. You should have R installed in your laptops.
Regression is a parametric technique used to predict continuous (dependent) variable given a set of independent variables. It is parametric in nature because it makes certain assumptions (discussed next) based on the data set. If the data set follows those assumptions, regression gives incredible results. Otherwise, it struggles to provide convincing accuracy. Don’t worry. There are several tricks (we’ll learn shortly) we can use to obtain convincing results.
Mathematically, regression uses a linear function to approximate (predict) the dependent variable given as:
Y = ?o + ?1X + ?
where, Y – Dependent variable
X – Independent variable
?o – Intercept
?1 – Slope
? – Error
?o and ?1 are known as coefficients. This is the equation of simple linear regression. It’s called ‘linear’ because there is just one independent variable (X
) involved. In multiple regression, we have many independent variables (Xs
). If you recall, the equation above is nothing but a line equation (y = mx + c
) we studied in schools. Let’s understand what these parameters say:
Y – This is the variable we predict.
X – This is the variable we use to make a prediction.
?o – This is the intercept term. It is the prediction value you get when X = 0.
?1 – This is the slope term. It explains the change in Y when X changes by 1 unit.
? – This represents the residual value, i.e. the difference between actual and predicted values.
Error is an inevitable part of the prediction-making process. No matter how powerful the algorithm we choose, there will always remain an (?) irreducible error which reminds us that the “future is uncertain.”
Yet, we humans have a unique ability to persevere, i.e. we know we can’t completely eliminate the (?) error term, but we can still try to reduce it to the lowest. Right? To do this, regression uses a technique known as Ordinary Least Square (OLS).
So the next time when you say, I am using linear /multiple regression, you are actually referring to the OLS technique. Conceptually, OLS technique tries to reduce the sum of squared errors ?[Actual(y) - Predicted(y')]²
by finding the best possible value of regression coefficients (?0, ?1, etc).
Is OLS the only technique regression can use? No! There are other techniques such as Generalized Least Square, Percentage Least Square, Total Least Squares, Least absolute deviation, and many more. Then, why OLS? Let’s see.
Let’s understand OLS in detail using an example:
We are given a data set with 100 observations and 2 variables, namely Heightand Weight. We need to predict weight(y) given height(x1). The OLS equation can we written as:
Y = ?o + ?1(Height)+ ?
When using R, Python or any computing language, you don’t need to know how these coefficients and errors are calculated. As a matter of fact, most people don’t care. But you must know, and that’s how you’ll get close to becoming a master.
The formula to calculate these coefficients is easy. Let’s say you are given the data, and you don’t have access to any statistical tool for computation. Can you still make any prediction? Yes!
The most intuitive and closest approximation of Y is mean of Y, i.e. even in the worst case scenario our predictive model should at least give higher accuracy than mean prediction. The formula to calculate coefficients goes like this:
?1 = ?(xi - xmean)(yi-ymean)/ ? (xi - xmean)² where i= 1 to n (no. of obs.)
?o = ymean - ?1(xmean)
Now you know ymean
plays a crucial role in determining regression coefficients and furthermore accuracy. In OLS, the error estimates can be divided into three parts:
Residual Sum of Squares (RSS) – ?[Actual(y) – Predicted(y)]²
Explained Sum of Squares (ESS) – ?[Predicted(y) – Mean(ymean)]²
Total Sum of Squares (TSS) – ?[Actual(y) – Mean(ymean)]²
The most important use of these error terms is used in the calculation of the Coefficient of Determination (R²).
R² = 1 - (ESS/TSS)
R² metric tells us the amount of variance explained by the independent variables in the model. In the upcoming section, we’ll learn and see the importance of this coefficient and more metrics to compute the model’s accuracy.
As we discussed above, regression is a parametric technique, so it makes assumptions. Let’s look at the assumptions it makes:
?t
must not indicate the at error at ?t+1
. Presence of correlation in error terms is known as Autocorrelation. It drastically affects the regression coefficients and standard error values since they are based on the assumption of uncorrelated error terms.Presence of these assumptions make regression quite restrictive. By restrictive I meant, the performance of a regression model is conditioned on fulfillment of these assumptions.
Once these assumptions get violated, regression makes biased, erratic predictions. I’m sure you are tempted to ask me, “How do I know these assumptions are getting violated?”
Of course, you can check performance metrics to estimate violation. But the real treasure is present in the diagnostic a.k.a residual plots. Let’s look at the important ones:
1. Residual vs. Fitted Values Plot
Ideally, this plot shouldn’t show any pattern. But if you see any shape (curve, U shape), it suggests non-linearity in the data set. In addition, if you see a funnel shape pattern, it suggests your data is suffering from heteroskedasticity, i.e. the error terms have non-constant variance.
2. Normality Q-Q Plot
As the name suggests, this plot is used to determine the normal distribution of errors. It uses standardized values of residuals. Ideally, this plot should show a straight line. If you find a curved, distorted line, then your residuals have a non-normal distribution (problematic situation).
3. Scale Location Plot
This plot is also useful to determine heteroskedasticity. Ideally, this plot shouldn’t show any pattern. Presence of a pattern determine heteroskedasticity. Don’t forget to corroborate the findings of this plot with the funnel shape in residual vs. fitted values.
If you are a non-graphical person, you can also perform quick tests / methods to check assumption violations:
There is little you can do when your data violates regression assumptions. An obvious solution is to use tree-based algorithms which capture non-linearity quite well. But if you are adamant at using regression, following are some tips you can implement:
The ability to determine model fit is a tricky process. The metrics used to determine model fit can have different values based on the type of data. Hence, we need to be extremely careful while interpreting regression analysis. Following are some metrics you can use to evaluate your regression model:
Let’s use our theoretical knowledge and create a model practically. As mentioned above, you should install R in your laptops. I’ve taken the data set from UCI Machine Learning repository. Originally, the data set is available in .txt file. To save you some time, I’ve converted it into .csv, and you can download it here.
Let’s load the data set and do initial data analysis:
#set working directory
> path <- "C:/Users/Data/UCI"
> setwd(path)
#load data and check data
> mydata <- read.csv("airfoil_self_noise.csv")
> str(mydata)
This data has 5 independent variables and Sound_pressure_level
as the dependent variable (to be predicted). In predictive modeling, we should always check missing values in data. If any data is missing, we can use methods like mean, median, and predictive modeling imputation to make up for missing data.
#check missing values
> colsums(is.na(mydata))
This data set has no missing values. Good for us! Now, to avoid multicollinearity, let’s check correlation matrix.
> cor(mydata)
After you see carefully, you’d infer that Angle_of_Attack
and Displacement
show 75% correlation. It’s up to us if we should consider this correlation % as a damaging level. Usually, correlation above 80% (subjective) is considered higher. Therefore, we can forego this combination and won’t remove any variable.
In R, the base function lm
is used for regression. We can run regression on this data by:
> regmodel <- lm(Sound_pressure_level ~ ., data = mydata)
> summary(regmodel)
lm(formula = Sound_pressure_level ~ ., data = mydata)
Residuals:
Min 1Q Median 3Q Max
-17.480 -2.882 -0.209 3.152 16.064
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.328e+02 5.447e-01 243.87 <2e-16 ***
`Frquency(Hz)` -1.282e-03 4.211e-05 -30.45 <2e-16 ***
Angle_of_Attack -4.219e-01 3.890e-02 -10.85 <2e-16 ***
Chord_Length -3.569e+01 1.630e+00 -21.89 <2e-16 ***
Free_stream_velocity 9.985e-02 8.132e-03 12.28 <2e-16 ***
Displacement -1.473e+02 1.501e+01 -9.81 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 4.809 on 1497 degrees of freedom
Multiple R-squared: 0.5157, Adjusted R-squared: 0.5141
F-statistic: 318.8 on 5 and 1497 DF, p-value: < 2.2e-16
~ .
tells lm
to use all the independent variables. Let’s understand the regression output in detail:
Chord_Length
. We can say, when Chord_Length
is increased by 1 unit, holding other variables constant, Sound_pressure_level
decreases by a value of -35.69.The adjusted R² implies that our model explains ~51% total variance in the data. And, the overall p value of the model is significant. Can we still improve this model ? Let’s try to do it. Now, we’ll check the residual plots, understand the pattern and derive actionable insights (if any):
> #set graphic output
> par(mfrow=c(2,2))
> #create residual plots
> plot (regmodel)
Among all, Residual vs. Fitted value catches my attention. Not exactly though, but I see signs of heteroskedasticity in this data. Remember funnel shape? You can see a similar pattern. To overcome this situation, we’ll build another model with log(y).
> regmodel <- update(regmodel, log(Sound_pressure_level)~.)
> summary(flm)
Call:
lm(formula = log(Sound_pressure_level) ~ `Frquency(Hz)` + Angle_of_Attack +
Chord_Length + Free_stream_velocity + Displacement, data = mydata)
Residuals:
Min 1Q Median 3Q Max
-0.146939 -0.023272 -0.000701 0.025425 0.122213
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.891e+00 4.393e-03 1113.31 <2e-16 ***
`Frquency(Hz)` -1.054e-05 3.396e-07 -31.05 <2e-16 ***
Angle_of_Attack -3.369e-03 3.137e-04 -10.74 <2e-16 ***
Chord_Length -2.878e-01 1.315e-02 -21.89 <2e-16 ***
Free_stream_velocity 8.071e-04 6.559e-05 12.31 <2e-16 ***
Displacement -1.244e+00 1.211e-01 -10.28 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.03878 on 1497 degrees of freedom
Multiple R-squared: 0.5235, Adjusted R-squared: 0.5219
F-statistic: 329 on 5 and 1497 DF, p-value: < 2.2e-16
Though, the improvement isn’t significant, we’ve increased our adjusted R² to 52.19%. Also, it looked like that funnel shape wasn’t completely evident, thus implying non-severe effect of non-constant variance.
Let’s divide the data set into train and test to check our final evaluation metric. We’ll keep 70% data in train and 30% in test file. The reason being that we should keep enough data in train so that the model identifies obvious emerging patterns.
#sample
> set.seed(1)
> d <- sample(x = nrow(mydata),size = nrow(mydata)*0.7)
> train <- mydata[d,] #1052 rows
> test <- mydata[-d,] #451 rows
#train model
> regmodel <- lm(log(Sound_pressure_level)~.,data = train)
> summary(regmodel)
#test model
> regpred <- predict(regmodel,test)
#convert back to original value
> regpred <- exp(regpred)
> library(Metrics)
> rmse(actual = test$Sound_pressure_level,predicted = regpred)
[1] 5.03423
Interpretation of RMSE error will be more helpful while comparing it with other models. Right now, we can’t say if 5.03 error is the optimal value we could expect. I’ve got you started solving regression problems. Now, you should spend more time and try to obtain a lower error rate than 5.03. For example, you should next check for outlier values using a box plot:
#save the output of boxplot
> d <- boxplot(train$Displacement,varwidth = T,outline = T,border = T,plot = T)
> d$out #enlist outlier observations
Outliers have a substantial impact on regression’s accuracy. Treating outliers is a tricky task. It requires in-depth understanding of data to acknowledge the existence of these high leverage points. I would suggest you read more about it, and if you are unable to find a way let me know in comments. I’m excited to see if you can do it!
My motive in writing this article is to get you started at solving regression problems, with a greater focus on the theoretical aspects. Running an algorithm isn’t rocket science, but knowing how it works will surely give you more control over what you do.
In this article, I’ve discussed the basics and semi-advanced concepts of regression. In addition, I’ve also explained best practices which you are advised to follow when facing low model accuracy. We learned about regression assumptions, violations, model fit, and residual plots with practical dealing in R. If you are a python user, you can run regression using linear.fit(x_train, y_train)
after loading scikit learn library.
Let me know if there is anything you don’t understand while reading this article. I’d love to answer your questions. I’d try to revert your queries in an hour. You can drop me an email here.
Organizations of all industries struggle with employee turnover. The high turnover rates cause increased hiring…
Virtual hiring events are becoming vital for modern recruitment, and the hiring world is changing…
The competition for talent today is intense, and this makes it very important for organizations…
Hiring trends are continuously evolving over the ages to keep pace with the latest technological…
Hiring practices have changed significantly over the past 30 years. Technological advancements and changing workforce…
In the current world, where the hiring process is ever-evolving, it has become crucial to…