How To Interpret An OLS Model In Python
Hello all, thank you for reading this blog. As you may know from my previous posts, I am an aspiring data science student attending a data science bootcamp with no previous experience in the world of technology. As I've begun to dive into the fundamentals of statistics for machine learning I have been overwhelmed by a myriad of statistic concepts that I have never seen or heard of before. I have just begun to experiment with Linear Regression and I am deepening my conceptual understanding of the metrics associated with the accuracy of a Linear Regression model. In my short journey, I have begun to see the Ordinary Least Squares model as my saving grace (and my worst enemy).
Ordinary Least Squares models give us loads of quality information about a Linear Regression, but if you're like me then you may have a difficult time interpreting the information it returns and understanding how it applies to your model. Follow along as I go over a broad explanation of some of the most important metrics within this 'StatModels' model.
What Is An OLS Model And When Will I Use It?
Ordinary Least Squares Regression, more commonly referred to as Linear Regression, is a popular statistical test used to depict the relationship between one or more independent variables, x, and one dependent variable, y. The ideal Linear Regression model is one in which the line of best fit passes through every single point. The easiest way to interpret the perfect Linear Regression is the independent variables perfectly explain the variance in the dependent variable, meaning there are no other independent variables that affect the dependent variable. Knowing the relationship between your x and y is very powerful because you can use the x variable(s) to predict y when we do not know what y is. Of course, in real life perfect Linear Regressions are rare and you will often have to put a lot of work and analysis into your model to increase your prediction accuracy.
The Ordinary Least Squares (OLS) model is the easiest and simplest way to view all of the important metrics for your Linear Regression in Python. If you know these metrics, you can better analyze your regression model as a whole and learn which features you will need to improve on to get your predictions as accurate as possible.
How Do I Interpret The Results Of An OLS Model?
For this example, I have built a Linear Regression model to predict home prices and I want to determine the accuracy and dependability of my model. You can clearly see in the OLS Regression Results that my dependent variable has been set to 'price' and my independent variables are 'sqft_living', 'grade', and 'bedrooms'.
It's also important to mention that I will not be going over the formulas used to retrieve these statistical variables in depth, There are plenty of documentations, articles, and videos available that will do this much better than I can.
Great, with that being said, let's dive deeper into the OLS model and break down some of the most significant variables it displays to you that may not be as intuitive for those of us without a background in statistics. Today we'll look at R Squared, Adjusted R Squared, and Conditional Numbers.
R Squared and Adjusted R Squared
The R Squared value, also known as the Coefficient of Determination, is one of the first metrics you will see after generating an OLS model. Like all metrics within an OLS model it is meant to show you how well your Linear Regression predictions are fitted to the actual values. More specifically, your R squared value is a percentage of your models average variation. As you can probably guess, the elusive "perfect Linear Regression model" I mentioned earlier would have an R squared value of 100%, because 100% of the change in the y variable can be explained by a change in the variable x. So how would you interpret an R Squared of 54%, as we see in the example OLS model?
In the case of predicting home values, an R squared of 54% would be considered low and we should consider building a new model by adding or deleting features or possibly even trying a different regression entirely. There may however be some instances where an R Squared value of 54% is not enough reason to change the model at all. This "Statistics With Jim" can explain some of those instances for you.
Adjusted R Squared is very similar to the R Squared value, a lot of the time the two numbers may even be the same as is seen in the OLS model example above. The difference is the Adjusted R Squared takes into account the number of independent variables a model uses and the value may actually increase as the number of variables increases. You may be tempted to add as many variables as possible to get your Adjusted R Squared value up, but this Investopedia.com article can explain to you why that may not be such a good idea.
Conditional Number
A high conditional number is an indicator that there is likely a strong multicollinearity between two or more of the independent variables in a linear regression. Multicollinearity is exactly what it sounds like, the variables are co-linear with one another, meaning there is likely some relationship between multiple x variables that we weren't aware of. This article by Conner Leavitt can further explain why multicollinearity in regression modeling is bad.
In our example, the OLS model explicitly mentions that our conditional number is high and multicollinearity may be present. In our case, we may want to try dropping an x variable to see if our conditional number decreases or we could use the .corr() method on the entire data frame to view the correlations between all variables. Generally, a correlation of 70% or more is an indicator that multicollinearity between those two variables is present.
And Many More...
These three metrics do not even scrape the surface, as you can see there are still so many valuable metrics to be explained from the OLS model. Until my next blog post, you can review this article to help explain the significance of a few of the other metrics.


Comments
Post a Comment