Predict Housing Prices

The goal here is to build a machine learning model to predict housing prices in California using the California Census Data. The data has metrics such as population, median income, median housing prices, and so on. The data was mostly raw and lacked any form of preprocessing. Hence a lot of time was invested for processing and cleaning the data thus getting it ready for a mathematical model to gain insights. The data was tested with linear-regression, decision-tree-regressor and random-forest-regressor.

Importing few common libraries, ensure matplotlib plots figures inline and prepare a function to save the figures:

In [1]:
import numpy as np
import os

np.random.seed(42)

%matplotlib inline
import matplotlib
import matplotlib.pyplot as plt
plt.rcParams['axes.labelsize'] = 14
plt.rcParams['xtick.labelsize'] = 12
plt.rcParams['ytick.labelsize'] = 12

# Where to save the figures
PROJECT_ROOT_DIR = "."
CHAPTER_ID = "end_to_end_project"
IMAGES_PATH = os.path.join(PROJECT_ROOT_DIR, "images", CHAPTER_ID)

def save_fig(fig_id, tight_layout=True, fig_extension="png", resolution=300):
    path = os.path.join(IMAGES_PATH, fig_id + "." + fig_extension)
    print("Saving figure", fig_id)
    if tight_layout:
        plt.tight_layout()
    plt.savefig(path, format=fig_extension, dpi=resolution)

import warnings
warnings.filterwarnings(action="ignore", module="scipy", message="^internal gelsd")

Function to get the data from the server.

This function should only be called when new data is required to be fetched from the server

In [2]:
import os
import tarfile
from six.moves import urllib

DOWNLOAD_ROOT = "https://raw.githubusercontent.com/ageron/handson-ml/master/"
HOUSING_PATH = os.path.join("datasets", "housing")
HOUSING_URL = DOWNLOAD_ROOT + "datasets/housing/housing.tgz"

def fetch_housing_data(housing_url= HOUSING_URL, housing_path= HOUSING_PATH):
    if not os.path.isdir(housing_path):
        os.makedirs(housing_path)
    tgz_path = os.path.join(housing_path, "housing.tgz")
    urllib.request.urlretrieve(housing_url, tgz_path)
    housing_tgz = tarfile.open(tgz_path)
    housing_tgz.extractall(path= housing_path)
    housing_tgz.close()
In [3]:
#fetch_housing_data()

Reading the raw data

Read and study the unprocessed data. By plotting various graphs of the data, we can gain some insights. Plotting various figures will help us decide which features have will an impact on the housing prices. This will further allow us to determine which features should be removed and which new features to create, using the existing features.

In [4]:
import pandas as pd

def load_housing_data(housing_path= HOUSING_PATH):
    csv_path = os.path.join(housing_path, "housing.csv")
    return pd.read_csv(csv_path)
In [5]:
housing = load_housing_data()
housing.head()
Out[5]:
longitude latitude housing_median_age total_rooms total_bedrooms population households median_income median_house_value ocean_proximity
0 -122.23 37.88 41.0 880.0 129.0 322.0 126.0 8.3252 452600.0 NEAR BAY
1 -122.22 37.86 21.0 7099.0 1106.0 2401.0 1138.0 8.3014 358500.0 NEAR BAY
2 -122.24 37.85 52.0 1467.0 190.0 496.0 177.0 7.2574 352100.0 NEAR BAY
3 -122.25 37.85 52.0 1274.0 235.0 558.0 219.0 5.6431 341300.0 NEAR BAY
4 -122.25 37.85 52.0 1627.0 280.0 565.0 259.0 3.8462 342200.0 NEAR BAY
In [6]:
housing.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20640 entries, 0 to 20639
Data columns (total 10 columns):
longitude             20640 non-null float64
latitude              20640 non-null float64
housing_median_age    20640 non-null float64
total_rooms           20640 non-null float64
total_bedrooms        20433 non-null float64
population            20640 non-null float64
households            20640 non-null float64
median_income         20640 non-null float64
median_house_value    20640 non-null float64
ocean_proximity       20640 non-null object
dtypes: float64(9), object(1)
memory usage: 1.6+ MB
In [7]:
housing["ocean_proximity"].value_counts()
Out[7]:
<1H OCEAN     9136
INLAND        6551
NEAR OCEAN    2658
NEAR BAY      2290
ISLAND           5
Name: ocean_proximity, dtype: int64
In [8]:
housing.describe()
Out[8]:
longitude latitude housing_median_age total_rooms total_bedrooms population households median_income median_house_value
count 20640.000000 20640.000000 20640.000000 20640.000000 20433.000000 20640.000000 20640.000000 20640.000000 20640.000000
mean -119.569704 35.631861 28.639486 2635.763081 537.870553 1425.476744 499.539680 3.870671 206855.816909
std 2.003532 2.135952 12.585558 2181.615252 421.385070 1132.462122 382.329753 1.899822 115395.615874
min -124.350000 32.540000 1.000000 2.000000 1.000000 3.000000 1.000000 0.499900 14999.000000
25% -121.800000 33.930000 18.000000 1447.750000 296.000000 787.000000 280.000000 2.563400 119600.000000
50% -118.490000 34.260000 29.000000 2127.000000 435.000000 1166.000000 409.000000 3.534800 179700.000000
75% -118.010000 37.710000 37.000000 3148.000000 647.000000 1725.000000 605.000000 4.743250 264725.000000
max -114.310000 41.950000 52.000000 39320.000000 6445.000000 35682.000000 6082.000000 15.000100 500001.000000
In [9]:
%matplotlib inline
import matplotlib.pyplot as plt
housing.hist(bins= 50, figsize=(20,15))
save_fig("attribute_histogram_plots")
plt.show()
Saving figure attribute_histogram_plots
In [10]:
from sklearn.model_selection import train_test_split

train_set, test_set = train_test_split(housing, test_size=0.2, random_state=42)
In [11]:
test_set.head()
Out[11]:
longitude latitude housing_median_age total_rooms total_bedrooms population households median_income median_house_value ocean_proximity
20046 -119.01 36.06 25.0 1505.0 NaN 1392.0 359.0 1.6812 47700.0 INLAND
3024 -119.46 35.14 30.0 2943.0 NaN 1565.0 584.0 2.5313 45800.0 INLAND
15663 -122.44 37.80 52.0 3830.0 NaN 1310.0 963.0 3.4801 500001.0 NEAR BAY
20484 -118.72 34.28 17.0 3051.0 NaN 1705.0 495.0 5.7376 218600.0 <1H OCEAN
9814 -121.93 36.62 34.0 2351.0 NaN 1063.0 428.0 3.7250 278000.0 NEAR OCEAN
In [12]:
housing["median_income"].hist()
Out[12]:
<matplotlib.axes._subplots.AxesSubplot at 0x22d9d84e4a8>
In [13]:
#Divide by 1.5 to limit the no. of income categories
housing["income_cat"] = np.ceil(housing["median_income"] / 1.5)

#Label those above 5 as 5
housing["income_cat"].where(housing["income_cat"] < 5, 5.0, inplace=True) #Where cond is True, keep the original value. Where False, replace with corresponding value from other.
In [14]:
housing["income_cat"].value_counts()
Out[14]:
3.0    7236
2.0    6581
4.0    3639
5.0    2362
1.0     822
Name: income_cat, dtype: int64
In [15]:
housing["income_cat"].hist()
Out[15]:
<matplotlib.axes._subplots.AxesSubplot at 0x22d9da3d1d0>
In [16]:
from sklearn.model_selection import StratifiedShuffleSplit

split = StratifiedShuffleSplit(n_splits= 1, test_size= 0.2, random_state= 42)
for train_index, test_index in split.split(housing, housing["income_cat"]):
    strat_train_set = housing.loc[train_index]
    strat_test_set = housing.loc[test_index]
In [17]:
strat_test_set["income_cat"].value_counts() / len(strat_test_set)
Out[17]:
3.0    0.350533
2.0    0.318798
4.0    0.176357
5.0    0.114583
1.0    0.039729
Name: income_cat, dtype: float64
In [18]:
housing["income_cat"].value_counts() / len(housing)
Out[18]:
3.0    0.350581
2.0    0.318847
4.0    0.176308
5.0    0.114438
1.0    0.039826
Name: income_cat, dtype: float64
In [19]:
for set_ in (strat_train_set, strat_test_set):
    set_.drop("income_cat",axis=1, inplace=True)

Discover and Visualize the data to gain insights

In [20]:
housing = strat_train_set.copy()  # make a copy to be on safe side
In [21]:
housing.plot(kind= "scatter", x="longitude", y="latitude", alpha=0.1)
save_fig("better_visualization_plot")
Saving figure better_visualization_plot
In [22]:
housing.plot(kind="scatter", x="longitude", y="latitude", alpha=0.4,
            s=housing["population"]/100, label="population", figsize=(10,7),
            c="median_house_value", cmap=plt.get_cmap("jet"), colorbar=True,
            sharex=False)
plt.legend()
save_fig("housing_prices_scatterplot")
Saving figure housing_prices_scatterplot
In [23]:
import matplotlib.image as mpimg
california_img=mpimg.imread(PROJECT_ROOT_DIR + '/images/end_to_end_project/california.png')
ax = housing.plot(kind="scatter", x="longitude", y="latitude", figsize=(10,7),
                       s=housing['population']/100, label="Population",
                       c="median_house_value", cmap=plt.get_cmap("jet"),
                       colorbar=False, alpha=0.4,
                      )
plt.imshow(california_img, extent=[-124.55, -113.80, 32.45, 42.05], alpha=0.5,
           cmap=plt.get_cmap("jet"))
plt.ylabel("Latitude", fontsize=14)
plt.xlabel("Longitude", fontsize=14)

prices = housing["median_house_value"]
tick_values = np.linspace(prices.min(), prices.max(), 11)
cbar = plt.colorbar()
cbar.ax.set_yticklabels(["$%dk"%(round(v/1000)) for v in tick_values], fontsize=14)
cbar.set_label('Median House Value', fontsize=16)

plt.legend(fontsize=16)
save_fig("california_housing_prices_plot")
plt.show()
Saving figure california_housing_prices_plot

Looking for Correlations

In [24]:
corr_matrix = housing.corr()
In [25]:
corr_matrix["median_house_value"].sort_values(ascending=False)
Out[25]:
median_house_value    1.000000
median_income         0.687160
total_rooms           0.135097
housing_median_age    0.114110
households            0.064506
total_bedrooms        0.047689
population           -0.026920
longitude            -0.047432
latitude             -0.142724
Name: median_house_value, dtype: float64
In [26]:
from pandas.plotting import scatter_matrix

attributes = ["median_house_value", "median_income", "total_rooms",
             "housing_median_age"]
scatter_matrix(housing[attributes], figsize=(12, 8))
save_fig("scatter_matrix_plot")
Saving figure scatter_matrix_plot
In [27]:
housing.plot(kind="scatter", x="median_income", y="median_house_value",
            alpha=0.1)
plt.axis([0, 16, 0, 550000])
save_fig("income_vs_house_value_scatterplot")
Saving figure income_vs_house_value_scatterplot

Experimenting with Attribute Combinations

Drawing out new features from the existing features may help us to create new meaningful data

In [28]:
housing["rooms_per_household"] = housing["total_rooms"]/housing["households"]
housing["bedrooms_per_room"] = housing["total_bedrooms"]/housing["total_rooms"]
housing["population_per_household"]=housing["population"]/housing["households"]
In [29]:
corr_matrix = housing.corr()
corr_matrix["median_house_value"].sort_values(ascending=False)
Out[29]:
median_house_value          1.000000
median_income               0.687160
rooms_per_household         0.146285
total_rooms                 0.135097
housing_median_age          0.114110
households                  0.064506
total_bedrooms              0.047689
population_per_household   -0.021985
population                 -0.026920
longitude                  -0.047432
latitude                   -0.142724
bedrooms_per_room          -0.259984
Name: median_house_value, dtype: float64

rooms_per_household:

In [30]:
housing.plot(kind="scatter", x="rooms_per_household", y="median_house_value",
            alpha=0.2)
plt.axis([0, 5, 0, 520000])
plt.axis()
Out[30]:
(0.0, 5.0, 0.0, 520000.0)
bedrooms_per_room:
In [31]:
housing.plot(kind="scatter", x="bedrooms_per_room", y="median_house_value",
            alpha=0.2)
plt.axis([0, 5, 0, 520000])
plt.axis()
Out[31]:
(0.0, 5.0, 0.0, 520000.0)
population_per_household
In [32]:
housing.plot(kind="scatter", x="population_per_household", y="median_house_value",
            alpha=0.2)
plt.axis([0, 5, 0, 520000])
plt.axis()
Out[32]:
(0.0, 5.0, 0.0, 520000.0)
In [33]:
housing.describe()
Out[33]:
longitude latitude housing_median_age total_rooms total_bedrooms population households median_income median_house_value rooms_per_household bedrooms_per_room population_per_household
count 16512.000000 16512.000000 16512.000000 16512.000000 16354.000000 16512.000000 16512.000000 16512.000000 16512.000000 16512.000000 16354.000000 16512.000000
mean -119.575834 35.639577 28.653101 2622.728319 534.973890 1419.790819 497.060380 3.875589 206990.920724 5.440341 0.212878 3.096437
std 2.001860 2.138058 12.574726 2138.458419 412.699041 1115.686241 375.720845 1.904950 115703.014830 2.611712 0.057379 11.584826
min -124.350000 32.540000 1.000000 6.000000 2.000000 3.000000 2.000000 0.499900 14999.000000 1.130435 0.100000 0.692308
25% -121.800000 33.940000 18.000000 1443.000000 295.000000 784.000000 279.000000 2.566775 119800.000000 4.442040 0.175304 2.431287
50% -118.510000 34.260000 29.000000 2119.500000 433.000000 1164.000000 408.000000 3.540900 179500.000000 5.232284 0.203031 2.817653
75% -118.010000 37.720000 37.000000 3141.000000 644.000000 1719.250000 602.000000 4.744475 263900.000000 6.056361 0.239831 3.281420
max -114.310000 41.950000 52.000000 39320.000000 6210.000000 35682.000000 5358.000000 15.000100 500001.000000 141.909091 1.000000 1243.333333

Prepare the data for ML Algorithms

In [34]:
housing = strat_train_set.drop("median_house_value", axis=1) # drop labels for training data
housing_labels = strat_train_set["median_house_value"].copy()
In [35]:
housing_labels.head()
Out[35]:
17606    286600.0
18632    340600.0
14650    196900.0
3230      46300.0
3555     254500.0
Name: median_house_value, dtype: float64
In [36]:
housing.head()
Out[36]:
longitude latitude housing_median_age total_rooms total_bedrooms population households median_income ocean_proximity
17606 -121.89 37.29 38.0 1568.0 351.0 710.0 339.0 2.7042 <1H OCEAN
18632 -121.93 37.05 14.0 679.0 108.0 306.0 113.0 6.4214 <1H OCEAN
14650 -117.20 32.77 31.0 1952.0 471.0 936.0 462.0 2.8621 NEAR OCEAN
3230 -119.61 36.31 25.0 1847.0 371.0 1460.0 353.0 1.8839 INLAND
3555 -118.59 34.23 17.0 6592.0 1525.0 4459.0 1463.0 3.0347 <1H OCEAN
In [37]:
from sklearn.preprocessing import Imputer

imputer  = Imputer(strategy="median")
In [38]:
housing_num = housing.drop("ocean_proximity", axis=1)
housing_num.head()
Out[38]:
longitude latitude housing_median_age total_rooms total_bedrooms population households median_income
17606 -121.89 37.29 38.0 1568.0 351.0 710.0 339.0 2.7042
18632 -121.93 37.05 14.0 679.0 108.0 306.0 113.0 6.4214
14650 -117.20 32.77 31.0 1952.0 471.0 936.0 462.0 2.8621
3230 -119.61 36.31 25.0 1847.0 371.0 1460.0 353.0 1.8839
3555 -118.59 34.23 17.0 6592.0 1525.0 4459.0 1463.0 3.0347
In [39]:
imputer.fit(housing_num)
Out[39]:
Imputer(axis=0, copy=True, missing_values='NaN', strategy='median', verbose=0)
In [40]:
imputer.statistics_
Out[40]:
array([-118.51  ,   34.26  ,   29.    , 2119.5   ,  433.    , 1164.    ,
        408.    ,    3.5409])
In [41]:
housing_num.median().values
Out[41]:
array([-118.51  ,   34.26  ,   29.    , 2119.5   ,  433.    , 1164.    ,
        408.    ,    3.5409])

Transforming the training set

In [42]:
X = imputer.transform(housing_num)
In [43]:
housing_tr = pd.DataFrame(X, columns=housing_num.columns)
housing_tr.head()
Out[43]:
longitude latitude housing_median_age total_rooms total_bedrooms population households median_income
0 -121.89 37.29 38.0 1568.0 351.0 710.0 339.0 2.7042
1 -121.93 37.05 14.0 679.0 108.0 306.0 113.0 6.4214
2 -117.20 32.77 31.0 1952.0 471.0 936.0 462.0 2.8621
3 -119.61 36.31 25.0 1847.0 371.0 1460.0 353.0 1.8839
4 -118.59 34.23 17.0 6592.0 1525.0 4459.0 1463.0 3.0347

Handling Text and Categorical Attributes

In [44]:
housing_cat = housing["ocean_proximity"]
housing_cat.head(10)
Out[44]:
17606     <1H OCEAN
18632     <1H OCEAN
14650    NEAR OCEAN
3230         INLAND
3555      <1H OCEAN
19480        INLAND
8879      <1H OCEAN
13685        INLAND
4937      <1H OCEAN
4861      <1H OCEAN
Name: ocean_proximity, dtype: object
In [45]:
                                 
from sklearn.preprocessing import LabelEncoder
encoder = LabelEncoder()
housing_cat_encoded = encoder.fit_transform(housing_cat)
housing_cat_encoded[:10]
Out[45]:
array([0, 0, 4, 1, 0, 1, 0, 1, 0, 0], dtype=int64)
In [46]:
from sklearn.preprocessing import LabelBinarizer  

encoder = LabelBinarizer()
housing_cat_1hot = encoder.fit_transform(housing_cat)
housing_cat_1hot
Out[46]:
array([[1, 0, 0, 0, 0],
       [1, 0, 0, 0, 0],
       [0, 0, 0, 0, 1],
       ...,
       [0, 1, 0, 0, 0],
       [1, 0, 0, 0, 0],
       [0, 0, 0, 1, 0]])

Custom Transformers

In [47]:
from sklearn.base import BaseEstimator, TransformerMixin

#column index
rooms_ix, bedrooms_ix, population_ix, household_ix = 3, 4, 5, 6

class CombinedAttributesAdder(BaseEstimator, TransformerMixin):
    def __init__(self, add_bedrooms_per_room = True):
        self.add_bedrooms_per_room = add_bedrooms_per_room
    def fit(self, X, y=None):
        return self #nothing to do
    def transform(self, X, y=None):
        rooms_per_household = X[:, rooms_ix] / X[:, household_ix]
        population_per_household = X[:, population_ix] / X[:, household_ix]
        if self.add_bedrooms_per_room:
            bedrooms_per_room = X[:, bedrooms_ix] / X[:, rooms_ix]
            return np.c_[X, rooms_per_household, population_per_household,
                        bedrooms_per_room]
        else:
            return np.c_[X, rooms_per_household, population_per_household]
In [48]:
attr_adder = CombinedAttributesAdder(add_bedrooms_per_room=False)
housing_extra_attribs = attr_adder.transform(housing.values)

housing_extra_attribs = pd.DataFrame(
    housing_extra_attribs,
    columns=list(housing.columns)+["rooms_per_household","population_per_household"])
housing_extra_attribs.head()
Out[48]:
longitude latitude housing_median_age total_rooms total_bedrooms population households median_income ocean_proximity rooms_per_household population_per_household
0 -121.89 37.29 38 1568 351 710 339 2.7042 <1H OCEAN 4.62537 2.0944
1 -121.93 37.05 14 679 108 306 113 6.4214 <1H OCEAN 6.00885 2.70796
2 -117.2 32.77 31 1952 471 936 462 2.8621 NEAR OCEAN 4.22511 2.02597
3 -119.61 36.31 25 1847 371 1460 353 1.8839 INLAND 5.23229 4.13598
4 -118.59 34.23 17 6592 1525 4459 1463 3.0347 <1H OCEAN 4.50581 3.04785

Building a Pipeline for preprocessing the numerical attributes:

In [49]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

num_pipeline = Pipeline([
        ('imputer', Imputer(strategy="median")),
        ('attribs_adder', CombinedAttributesAdder()),
        ('std_scaler', StandardScaler()),
    ])

housing_num_tr = num_pipeline.fit_transform(housing_num)
In [50]:
housing_num_tr
Out[50]:
array([[-1.15604281,  0.77194962,  0.74333089, ..., -0.31205452,
        -0.08649871,  0.15531753],
       [-1.17602483,  0.6596948 , -1.1653172 , ...,  0.21768338,
        -0.03353391, -0.83628902],
       [ 1.18684903, -1.34218285,  0.18664186, ..., -0.46531516,
        -0.09240499,  0.4222004 ],
       ...,
       [ 1.58648943, -0.72478134, -1.56295222, ...,  0.3469342 ,
        -0.03055414, -0.52177644],
       [ 0.78221312, -0.85106801,  0.18664186, ...,  0.02499488,
         0.06150916, -0.30340741],
       [-1.43579109,  0.99645926,  1.85670895, ..., -0.22852947,
        -0.09586294,  0.10180567]])
In [51]:
#Create a class to select numerical or categorical columns 
#since scikit-learn doesn't handle Dataframes yet
class DataFrameSelector(BaseEstimator, TransformerMixin):
    def __init__(self, attribute_names):
        self.attribute_names = attribute_names
    def fit(self, X, y=None):
        return self
    def transform(self, X):
        return X[self.attribute_names].values 

Note: The following code was copied from https://github.com/scikit-learn/scikit-learn/pull/7375/files#diff-1e175ddb0d84aad0a578d34553f6f9c6 since LabelBinarizer only take 2 arguments (self and labels), but we need to send 3 arguments (self, features and labels)

There are 3 ways to solve this issue:

1) use scikit-learn 0.20 OneHotEncoder which supports string features (still in development at the time of writing this code)

2) use scikit-learn 0.18

3) use custom LabelBinarizer which can handle 3 arguments

I have gone with the 3rd option, as the code was readily available on the scikit-learn github repo.

In [52]:
class LabelBinarizerPipelineFriendly(LabelBinarizer):
    def fit(self, X, y=None):
        """this would allow us to fit the model based on the X input."""
        super(LabelBinarizerPipelineFriendly, self).fit(X)
    def transform(self, X, y=None):
        return super(LabelBinarizerPipelineFriendly, self).transform(X)

    def fit_transform(self, X, y=None):
        return super(LabelBinarizerPipelineFriendly, self).fit(X).transform(X)
In [53]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler


num_attribs = list(housing_num)
cat_attribs = ["ocean_proximity"]

num_pipeline = Pipeline([
        ('selector', DataFrameSelector(num_attribs)),
        ('imputer', Imputer(strategy="median")),
        ('attribs_adder', CombinedAttributesAdder()),
        ('std_scaler', StandardScaler()),
    ])

cat_pipeline = Pipeline([
        ('selector', DataFrameSelector(cat_attribs)),
        ('label_binarizer_pipeline_friendly', LabelBinarizerPipelineFriendly()),
    ])
In [54]:
from sklearn.pipeline import  FeatureUnion

full_pipeline = FeatureUnion(transformer_list = [
        ("num_pipeline", num_pipeline),
        ("cat_pipeline", cat_pipeline),
    ])
In [55]:
housing_prepared = full_pipeline.fit_transform(housing)
housing_prepared
Out[55]:
array([[-1.15604281,  0.77194962,  0.74333089, ...,  0.        ,
         0.        ,  0.        ],
       [-1.17602483,  0.6596948 , -1.1653172 , ...,  0.        ,
         0.        ,  0.        ],
       [ 1.18684903, -1.34218285,  0.18664186, ...,  0.        ,
         0.        ,  1.        ],
       ...,
       [ 1.58648943, -0.72478134, -1.56295222, ...,  0.        ,
         0.        ,  0.        ],
       [ 0.78221312, -0.85106801,  0.18664186, ...,  0.        ,
         0.        ,  0.        ],
       [-1.43579109,  0.99645926,  1.85670895, ...,  0.        ,
         1.        ,  0.        ]])
In [56]:
housing_prepared.shape
Out[56]:
(16512, 16)

Select and train a model

Trying Linear Regression without cross-validation

In [57]:
#training
from sklearn.linear_model import LinearRegression

lin_reg = LinearRegression()
lin_reg.fit(housing_prepared, housing_labels)
Out[57]:
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)
In [58]:
# trying full pipeline on few training instances

some_data = housing.iloc[:5]
some_labels = housing_labels.iloc[:5]
some_data_prepared = full_pipeline.transform(some_data)

print("Prediction:", lin_reg.predict(some_data_prepared))
Prediction: [210644.60459286 317768.80697211 210956.43331178  59218.98886849
 189747.55849879]

Compare aganist the actual value

In [59]:
print("Labels:", list(some_labels))
Labels: [286600.0, 340600.0, 196900.0, 46300.0, 254500.0]
In [60]:
some_data_prepared
Out[60]:
array([[-1.15604281,  0.77194962,  0.74333089, -0.49323393, -0.44543821,
        -0.63621141, -0.42069842, -0.61493744, -0.31205452, -0.08649871,
         0.15531753,  1.        ,  0.        ,  0.        ,  0.        ,
         0.        ],
       [-1.17602483,  0.6596948 , -1.1653172 , -0.90896655, -1.0369278 ,
        -0.99833135, -1.02222705,  1.33645936,  0.21768338, -0.03353391,
        -0.83628902,  1.        ,  0.        ,  0.        ,  0.        ,
         0.        ],
       [ 1.18684903, -1.34218285,  0.18664186, -0.31365989, -0.15334458,
        -0.43363936, -0.0933178 , -0.5320456 , -0.46531516, -0.09240499,
         0.4222004 ,  0.        ,  0.        ,  0.        ,  0.        ,
         1.        ],
       [-0.01706767,  0.31357576, -0.29052016, -0.36276217, -0.39675594,
         0.03604096, -0.38343559, -1.04556555, -0.07966124,  0.08973561,
        -0.19645314,  0.        ,  1.        ,  0.        ,  0.        ,
         0.        ],
       [ 0.49247384, -0.65929936, -0.92673619,  1.85619316,  2.41221109,
         2.72415407,  2.57097492, -0.44143679, -0.35783383, -0.00419445,
         0.2699277 ,  1.        ,  0.        ,  0.        ,  0.        ,
         0.        ]])
In [61]:
# Calculating RMS for the entire dataset for Linear Regression

from sklearn.metrics import mean_squared_error

housing_predictions = lin_reg.predict(housing_prepared)
lin_mse = mean_squared_error(housing_labels, housing_predictions)
lin_rmse = np.sqrt(lin_mse)
lin_rmse
Out[61]:
68628.19819848923

Trying DecisionTreeRegressor without cross-validation

In [62]:
from sklearn.tree import DecisionTreeRegressor

tree_reg = DecisionTreeRegressor()
tree_reg.fit(housing_prepared, housing_labels)
Out[62]:
DecisionTreeRegressor(criterion='mse', max_depth=None, max_features=None,
           max_leaf_nodes=None, min_impurity_decrease=0.0,
           min_impurity_split=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           presort=False, random_state=None, splitter='best')
In [63]:
housing_predictions = tree_reg.predict(housing_prepared)
tree_mse = mean_squared_error(housing_labels, housing_predictions)
tree_rmse = np.sqrt(tree_mse)
tree_rmse
Out[63]:
0.0

Clearly the DecisionTreeRegressor has badly overfitted the data. Hence, now I will be using cross-validation and again use DecisionTreeRegressor

Trying DecisionTreeRegressor with cross-validation

In [64]:
from sklearn.model_selection import cross_val_score

scores = cross_val_score(tree_reg, housing_prepared, housing_labels, 
                         scoring="neg_mean_squared_error", cv=10)
tree_rmse_scores = np.sqrt(-scores)
In [65]:
def display_scores(scores):
    print("Scores:", scores)
    print("Mean:", scores.mean())
    print("Standard Deviation:", scores.std())

display_scores(tree_rmse_scores)
Scores: [68292.14894598 66782.53562278 70759.47269143 69889.91977163
 70589.76882605 74182.94790873 71548.27958524 70250.63692186
 76559.98348136 69500.68470599]
Mean: 70835.63784610563
Standard Deviation: 2654.4756661778547

Trying LinearRegression with cross-validation

In [66]:
lin_scores = cross_val_score(lin_reg, housing_prepared, housing_labels, 
                         scoring="neg_mean_squared_error", cv=10)
lin_rmse_scores = np.sqrt(-lin_scores)
display_scores(lin_rmse_scores)
Scores: [66782.73843989 66960.118071   70347.95244419 74739.57052552
 68031.13388938 71193.84183426 64969.63056405 68281.61137997
 71552.91566558 67665.10082067]
Mean: 69052.46136345083
Standard Deviation: 2731.6740017983425

Trying RandomForestRegressor with cross-validation

In [67]:
from sklearn.ensemble import RandomForestRegressor

forest_reg = RandomForestRegressor()
forest_reg.fit(housing_prepared, housing_labels)
Out[67]:
RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
           max_features='auto', max_leaf_nodes=None,
           min_impurity_decrease=0.0, min_impurity_split=None,
           min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
           oob_score=False, random_state=None, verbose=0, warm_start=False)
In [68]:
forest_scores = cross_val_score(forest_reg, housing_prepared, housing_labels,
                               scoring = "neg_mean_squared_error", cv=10)
forest_rmse_scores = np.sqrt(-forest_scores)
display(forest_rmse_scores)
array([52358.66679404, 49743.79903788, 51335.21486934, 54925.88657132,
       52257.98169035, 55720.86291923, 52842.69505175, 50082.9703802 ,
       55483.30713101, 53171.12873845])

Trying Support Vector Regressor without cross-validation

In [69]:
from sklearn.svm import SVR

svm_reg = SVR(kernel="linear")
svm_reg.fit(housing_prepared, housing_labels)
housing_predictions = svm_reg.predict(housing_prepared)
svm_mse = mean_squared_error(housing_labels, housing_predictions)
svm_rmse = np.sqrt(svm_mse)
svm_rmse
Out[69]:
111094.6308539982

Fine tuning the models

After that we have shortlisted some of the models, the next goal is to fine-tune them. One way to do that is to fiddle with the hyperparameters manually until we find a great combination of hyperparameter values. But that is a tedious and time consuming process. Instead we can take help of Scikit-Learn's:

1) GridSearchCV

2) RandomizedSearchCV

In [70]:
from sklearn.model_selection import GridSearchCV

param_grid = [
    {'n_estimators': [3, 10, 30], 'max_features':[2, 4, 6, 8]},
    {'bootstrap': [False], 'n_estimators':[3,10], 'max_features': [2, 3, 4]},
    ]

forest_reg = RandomForestRegressor(random_state=42)
grid_search = GridSearchCV(forest_reg, param_grid, cv=5, 
                           scoring='neg_mean_squared_error', return_train_score=True)
grid_search.fit(housing_prepared, housing_labels)
Out[70]:
GridSearchCV(cv=5, error_score='raise',
       estimator=RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
           max_features='auto', max_leaf_nodes=None,
           min_impurity_decrease=0.0, min_impurity_split=None,
           min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
           oob_score=False, random_state=42, verbose=0, warm_start=False),
       fit_params=None, iid=True, n_jobs=1,
       param_grid=[{'n_estimators': [3, 10, 30], 'max_features': [2, 4, 6, 8]}, {'bootstrap': [False], 'n_estimators': [3, 10], 'max_features': [2, 3, 4]}],
       pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
       scoring='neg_mean_squared_error', verbose=0)
The best hyperparameter combination found:
In [71]:
grid_search.best_params_
Out[71]:
{'max_features': 8, 'n_estimators': 30}
In [72]:
grid_search.best_estimator_
Out[72]:
RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
           max_features=8, max_leaf_nodes=None, min_impurity_decrease=0.0,
           min_impurity_split=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           n_estimators=30, n_jobs=1, oob_score=False, random_state=42,
           verbose=0, warm_start=False)
Looking at the score of each hyperparameter combination tested during the grid search:
In [73]:
cvres = grid_search.cv_results_
for mean_score, params in zip(cvres["mean_test_score"], cvres["params"]):
    print(np.sqrt(-mean_score), params)
63647.85444595992 {'max_features': 2, 'n_estimators': 3}
55611.50159876327 {'max_features': 2, 'n_estimators': 10}
53370.06407363344 {'max_features': 2, 'n_estimators': 30}
60959.138858487866 {'max_features': 4, 'n_estimators': 3}
52740.58416665252 {'max_features': 4, 'n_estimators': 10}
50374.14214614731 {'max_features': 4, 'n_estimators': 30}
58661.2866461823 {'max_features': 6, 'n_estimators': 3}
52009.973979776936 {'max_features': 6, 'n_estimators': 10}
50154.11777368494 {'max_features': 6, 'n_estimators': 30}
57865.36168014446 {'max_features': 8, 'n_estimators': 3}
51730.07550866553 {'max_features': 8, 'n_estimators': 10}
49694.85143334442 {'max_features': 8, 'n_estimators': 30}
62874.407393096284 {'bootstrap': False, 'max_features': 2, 'n_estimators': 3}
54643.49980834466 {'bootstrap': False, 'max_features': 2, 'n_estimators': 10}
59437.89228588419 {'bootstrap': False, 'max_features': 3, 'n_estimators': 3}
52735.358293621044 {'bootstrap': False, 'max_features': 3, 'n_estimators': 10}
57490.01682787995 {'bootstrap': False, 'max_features': 4, 'n_estimators': 3}
51008.261567163354 {'bootstrap': False, 'max_features': 4, 'n_estimators': 10}
In [74]:
pd.DataFrame(grid_search.cv_results_)
Out[74]:
mean_fit_time mean_score_time mean_test_score mean_train_score param_bootstrap param_max_features param_n_estimators params rank_test_score split0_test_score ... split2_test_score split2_train_score split3_test_score split3_train_score split4_test_score split4_train_score std_fit_time std_score_time std_test_score std_train_score
0 0.109160 0.001009 -4.051049e+09 -1.106013e+09 NaN 2 3 {'max_features': 2, 'n_estimators': 3} 18 -3.850668e+09 ... -4.194135e+09 -1.116843e+09 -3.906732e+09 -1.112813e+09 -4.169669e+09 -1.129842e+09 0.004321 0.001276 1.431223e+08 2.173798e+07
1 0.361192 0.015338 -3.092639e+09 -5.819353e+08 NaN 2 10 {'max_features': 2, 'n_estimators': 10} 11 -3.052380e+09 ... -3.124982e+09 -5.780873e+08 -2.865117e+09 -5.713421e+08 -3.169914e+09 -5.797944e+08 0.020881 0.002536 1.306954e+08 7.584886e+06
2 1.071234 0.039592 -2.848364e+09 -4.396234e+08 NaN 2 30 {'max_features': 2, 'n_estimators': 30} 9 -2.692176e+09 ... -2.943808e+09 -4.374429e+08 -2.619893e+09 -4.374715e+08 -2.968460e+09 -4.451903e+08 0.073755 0.007619 1.604534e+08 2.883885e+06
3 0.190187 0.003124 -3.716017e+09 -9.850011e+08 NaN 4 3 {'max_features': 4, 'n_estimators': 3} 16 -3.729600e+09 ... -3.736527e+09 -9.172986e+08 -3.404974e+09 -1.035901e+09 -3.914186e+09 -9.711998e+08 0.003927 0.006249 1.690029e+08 4.047487e+07
4 0.616044 0.012728 -2.781569e+09 -5.160154e+08 NaN 4 10 {'max_features': 4, 'n_estimators': 10} 8 -2.667093e+09 ... -2.891599e+09 -4.960301e+08 -2.613393e+09 -5.422542e+08 -2.949550e+09 -5.158794e+08 0.007450 0.006164 1.278498e+08 1.498960e+07
5 1.775046 0.045129 -2.537554e+09 -3.878685e+08 NaN 4 30 {'max_features': 4, 'n_estimators': 30} 3 -2.387199e+09 ... -2.663178e+09 -3.789712e+08 -2.397951e+09 -4.036920e+08 -2.649850e+09 -3.846171e+08 0.103493 0.004933 1.209935e+08 8.424973e+06
6 0.253776 0.005043 -3.441147e+09 -9.030212e+08 NaN 6 3 {'max_features': 6, 'n_estimators': 3} 14 -3.119576e+09 ... -3.587747e+09 -9.360639e+08 -3.331544e+09 -9.025026e+08 -3.577062e+09 -8.612945e+08 0.007995 0.006749 1.884229e+08 2.639683e+07
7 0.843010 0.015907 -2.705037e+09 -5.014210e+08 NaN 6 10 {'max_features': 6, 'n_estimators': 10} 6 -2.553481e+09 ... -2.762945e+09 -4.996537e+08 -2.519522e+09 -4.989516e+08 -2.906270e+09 -5.063617e+08 0.004442 0.000533 1.464963e+08 3.357661e+06
8 2.526129 0.049886 -2.515436e+09 -3.840197e+08 NaN 6 30 {'max_features': 6, 'n_estimators': 30} 2 -2.371924e+09 ... -2.607962e+09 -3.805596e+08 -2.351220e+09 -3.856159e+08 -2.662399e+09 -3.904866e+08 0.065170 0.003369 1.283580e+08 3.796810e+06
9 0.312826 0.000690 -3.348400e+09 -8.884890e+08 NaN 8 3 {'max_features': 8, 'n_estimators': 3} 13 -3.351347e+09 ... -3.396841e+09 -8.596460e+08 -3.131753e+09 -8.893698e+08 -3.509451e+09 -9.146734e+08 0.023116 0.001380 1.226683e+08 2.730057e+07
10 1.021553 0.014727 -2.676001e+09 -4.923247e+08 NaN 8 10 {'max_features': 8, 'n_estimators': 10} 5 -2.572358e+09 ... -2.844608e+09 -4.730979e+08 -2.462797e+09 -5.154156e+08 -2.777049e+09 -4.979127e+08 0.072044 0.001543 1.393253e+08 1.446900e+07
11 2.787518 0.040647 -2.469578e+09 -3.809175e+08 NaN 8 30 {'max_features': 8, 'n_estimators': 30} 1 -2.358884e+09 ... -2.591134e+09 -3.772512e+08 -2.319816e+09 -3.881153e+08 -2.528200e+09 -3.807496e+08 0.068374 0.002663 1.089395e+08 4.853344e+06
12 0.156347 0.007147 -3.953191e+09 0.000000e+00 False 2 3 {'bootstrap': False, 'max_features': 2, 'n_est... 17 -3.792367e+09 ... -4.050371e+09 -0.000000e+00 -3.668520e+09 -0.000000e+00 -4.087237e+09 -0.000000e+00 0.013939 0.004482 1.898516e+08 0.000000e+00
13 0.546384 0.016827 -2.985912e+09 -6.056027e-01 False 2 10 {'bootstrap': False, 'max_features': 2, 'n_est... 10 -2.808029e+09 ... -3.125519e+09 -0.000000e+00 -2.788623e+09 -0.000000e+00 -3.100391e+09 -2.967449e+00 0.049823 0.002991 1.535103e+08 1.181156e+00
14 0.205054 0.007173 -3.532863e+09 -1.214568e+01 False 3 3 {'bootstrap': False, 'max_features': 3, 'n_est... 15 -3.604830e+09 ... -3.552984e+09 -0.000000e+00 -3.610963e+09 -0.000000e+00 -3.455760e+09 -6.072840e+01 0.013442 0.002181 7.251650e+07 2.429136e+01
15 0.706700 0.015508 -2.781018e+09 -5.272080e+00 False 3 10 {'bootstrap': False, 'max_features': 3, 'n_est... 7 -2.756941e+09 ... -2.831963e+09 -0.000000e+00 -2.672258e+09 -0.000000e+00 -2.793018e+09 -5.465556e+00 0.021242 0.000898 6.329307e+07 8.093117e+00
16 0.259610 0.004598 -3.305102e+09 0.000000e+00 False 4 3 {'bootstrap': False, 'max_features': 4, 'n_est... 12 -3.143457e+09 ... -3.440323e+09 -0.000000e+00 -3.047980e+09 -0.000000e+00 -3.337950e+09 -0.000000e+00 0.023184 0.002688 1.867866e+08 0.000000e+00
17 0.864915 0.018158 -2.601843e+09 -3.028238e-03 False 4 10 {'bootstrap': False, 'max_features': 4, 'n_est... 4 -2.531436e+09 ... -2.606596e+09 -0.000000e+00 -2.437626e+09 -0.000000e+00 -2.726341e+09 -0.000000e+00 0.050260 0.007248 1.082086e+08 6.056477e-03

18 rows × 23 columns

In [75]:
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint

param_distribs = {
        'n_estimators': randint(low=1, high=200),
        'max_features': randint(low=1, high=8),
    }

forest_reg = RandomForestRegressor(random_state=42)
rnd_search = RandomizedSearchCV(forest_reg, param_distributions=param_distribs,
                                n_iter=10, cv=5, scoring='neg_mean_squared_error', random_state=42)
rnd_search.fit(housing_prepared, housing_labels)
Out[75]:
RandomizedSearchCV(cv=5, error_score='raise',
          estimator=RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
           max_features='auto', max_leaf_nodes=None,
           min_impurity_decrease=0.0, min_impurity_split=None,
           min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
           oob_score=False, random_state=42, verbose=0, warm_start=False),
          fit_params=None, iid=True, n_iter=10, n_jobs=1,
          param_distributions={'n_estimators': <scipy.stats._distn_infrastructure.rv_frozen object at 0x0000022DA23D7C18>, 'max_features': <scipy.stats._distn_infrastructure.rv_frozen object at 0x0000022D9B196710>},
          pre_dispatch='2*n_jobs', random_state=42, refit=True,
          return_train_score='warn', scoring='neg_mean_squared_error',
          verbose=0)
In [76]:
cvres = rnd_search.cv_results_
for mean_score, params in zip(cvres["mean_test_score"], cvres["params"]):
    print(np.sqrt(-mean_score), params)
49147.15241724505 {'max_features': 7, 'n_estimators': 180}
51396.876896929905 {'max_features': 5, 'n_estimators': 15}
50797.05737322649 {'max_features': 3, 'n_estimators': 72}
50840.744513982805 {'max_features': 5, 'n_estimators': 21}
49276.17530332962 {'max_features': 7, 'n_estimators': 122}
50775.46331678437 {'max_features': 3, 'n_estimators': 75}
50681.383924974936 {'max_features': 3, 'n_estimators': 88}
49612.152530468346 {'max_features': 5, 'n_estimators': 100}
50473.01751424941 {'max_features': 3, 'n_estimators': 150}
64458.25385034794 {'max_features': 5, 'n_estimators': 2}

Analysis the Best Models and Their Errors

In [77]:
feature_importances = grid_search.best_estimator_.feature_importances_
feature_importances
Out[77]:
array([7.33442355e-02, 6.29090705e-02, 4.11437985e-02, 1.46726854e-02,
       1.41064835e-02, 1.48742809e-02, 1.42575993e-02, 3.66158981e-01,
       5.64191792e-02, 1.08792957e-01, 5.33510773e-02, 1.03114883e-02,
       1.64780994e-01, 6.02803867e-05, 1.96041560e-03, 2.85647464e-03])

Evaluate Our System on the Test Set

In [78]:
final_model = grid_search.best_estimator_

X_test = strat_test_set.drop("median_house_value", axis=1)
y_test = strat_test_set["median_house_value"].copy()

X_test_prepared = full_pipeline.transform(X_test)
final_predictions = final_model.predict(X_test_prepared)

final_mse = mean_squared_error(y_test, final_predictions)
final_rmse = np.sqrt(final_mse)
In [79]:
final_rmse # Final RMS Error
Out[79]:
47766.00396643308