Housing data investigation and linear regression modeling

Saif Kasmani
4 min readJun 15, 2020

Project: As Flatiron’s online data science boot camp students we were given our module 2 project which was to find price predictors and create a linear regression model capable of predicting potential prices for new homes in the King County region of Seattle.

Data we worked with: kc_house_data.csv which can be also found here

-> https://github.com/learn-co-students/dsc-mod-2-project-v2-1-onl01-dtsc-ft-030220/blob/master/kc_house_data.csv

Here is a general outline of the data set

-> https://github.com/learn-co-students/dsc-mod-2-project-v2-1-onl01-dtsc-ft-030220/blob/master/column_names.md

Exploratory Data Analysis: We had various EDA’s in this project but we chose to interpret a few which will be explained below.

As we can see, there does appear to be a linear relationship with higher house grades and higher sale price. Here grades represent the construction quality of the houses and go from 1 to 13. The description of the grades can be found here : https://info.kingcounty.gov/assessor/esales/Glossary.aspx?type=r . Also here we see that the quality of houses started improving constantly from the year 1943.

We then looked into the affects of renovation and if more recent renovations had a stronger effect on the sale price of the house.

While not a strong relationship with our data, it was interesting to see that there did appear to be a slight trend with more recently renovated homes tending to have a slightly higher possible selling price than those that were renovated longer ago.

Due to the lack of information of waterfront properties, we were curious if the houses proximity to water would have an effect on the sale price as well as the houses distance from downtown Seattle.

There did indeed appear to be a correlation between the price of a house and its distance from Seattle, with houses that were closer to the city tending to have higher price possibilities.

Model: We first checked for multicollinearity using a heat map as shown below.

We then went on to create our OLS model using the function below:

def make_ols_sm(df, cols, add_constant=False, target=’price_o’):
x = df[cols]
if add_constant:
x = sm.add_constant(x)
ols = sm.OLS(df[target], x)
res = ols.fit()
print(res.summary())
return res
cols = ['time_since_renovated', 'condition_grade_sq', 'Renovated', 'basement', 'view']res = make_ols_sm(df, cols)

The features we used to predict the price are :

Time since renovation

Condition and Grade

Has the house been renovated

Does it have a basement

View count

Here we got a R² score of 0.89.

This means the predictive model is able to explain a good amount of variance in the data and can be taken into consideration for testing and accuracy calculation on Test Data.

R2 < 0.5 means tending towards Underfitting of the model.

R2 > 0.9 means tending towards Overfitting of the model.

While not all models were saved, numerous attempts were made with various adjustments that all seemed to provide similar or diminishing results.

Tried using recursive feature elimination method here but high amounts of multicollinearity seemed unavoidable with this method as in the data set we were provided, almost all the features had a correlation.

At first glance, it looked like we had finally made a great model for our provided data.

However, further investigation showed that our test results weren’t matching up with each other. Essentially we had forgotten to include a constant in our model as a house will never sell for $0. After including the constant we got a much different result that confirmed what our initial test results were trying to tell us, which you will see in the cross validation section below.

Calculating VIF scores to check for multicollinearity:

from statsmodels.stats.outliers_influence import variance_inflation_factorX = df[cols]
vif = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
list(zip(cols, vif))
VIF SCORES
____________________________________________________________________
[('time_since_renovated', 2.8410498876413746),
('condition_grade_sq', 3.1143028769918084),
('Renovated', 1.0503986065798132),
('basement', 1.7405733555157932),
('view', 1.1190630106116475)]

The VIF scores are great, indicating no multicollinearity.

Cross Validation:

from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LinearRegression
cv_scores = cross_val_score(LinearRegression(), X, df[‘price_o’], scoring=’neg_mean_squared_error’, cv=5, n_jobs=-1)
cv_scores

Using above cv_scores on a sample taken from the data frame.

x_sample = df.sample(n=1)
x_sample.head()

Conclusion:

Well this may explain why our residuals looked like they did and also why our predictions seemed to vary so much. While we may have solved for multicollinearity, it appears the predictive power of our model still leaves much to be desired. Ideally, I’d like to re-evaluate some of our feature selection to see if it’s possible to improve the model without increasing the likelihood of multicollinear relationships. The entire notebook can be found here : https://github.com/saifzkb/dsc-mod-2-project-v2-1-onl01-dtsc-ft-041320

--

--