Project 1: Predict Real Estate Prices in Beijing

August 20, 2023

1. Overview

In this project, we will build a linear regression model to help people better understand the factors that can affect the price of a house in Beijing.

The main sections of this article are: (1) Exploratory Data Analysis (EDA), (2) Developing Linear Regression model for predicting real estate price, and (3) Use that model to make predictions.

The Python Notebook containing the complete model development process and the data used in this project can be found at Google Drive.

The model is also deployed on Microsoft Azure Machine Learning Studio, and the documentation could be found here.

2. Exploratory Data Analysis (EDA)

The dataset Google Drive of real estate price in Beijing is from Python Data Science: Technical Details and Business Practices (link). Since the computation was performed on Google Colab platform, we will first mount the Google Drive and load some useful packages.

from google.colab import drive
drive.mount('/content/drive')

import numpy as np
import pandas as pd
import os
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
import statsmodels.api as sm
# lower case ols means with intercept, upper case OLS means without intercept
from statsmodels.formula.api import ols 
import math
import matplotlib
from numpy import corrcoef,array
from IPython.display import HTML, display
pd.set_option('display.max_columns', None)

By copying the dataset from Google Drive to the working directory and load it (see command and results below), we can see that the dataset is composed of 16210 rows and 8 columns.

District – which district is the house located
Bedroom – number of bedrooms
Livingroom – number of living rooms
Area – area of house in sqft
Floor – location of house: low floor, mid floor, or high floor
Subway – whether the house is close to a subway station
School – whether the house is in a premium school district
Price – unit price in USD/sqft

These first several columns above represent the independent variables in this dataset, and the last column (i.e., price) is the dependent variable. In the following subsections, we will examine each of them closely.

!cp /content/drive/MyDrive/210-Projects/810_Linear_Regression/sndHsPr_conv.csv ./
sndHsPr = pd.read_csv(r'sndHsPr_conv.csv')
sndHsPr

	district	bedroom	livingroom	area	floor	subway	school	price
0	chaoyang	1	0	496	middle	1	0	623
1	chaoyang	1	1	636	middle	1	0	593
2	haidian	5	2	3003	high	1	1	913
3	haidian	3	2	2228	high	1	1	739
4	fengtai	2	1	574	low	1	1	908
…	…	…	…	…	…	…	…	…
16205	chaoyang	3	1	822	middle	1	0	668
16206	chaoyang	3	1	823	middle	1	0	667
16207	chaoyang	2	1	665	high	1	0	639
16208	chaoyang	1	1	462	high	1	0	672
16209	fengtai	2	1	968	middle	1	0	595

2.1. Price

As a first step, we will take a close look at the dependent variable, price, which is unit price in US dollar per square feet (USD/sqft).

fig = plt.figure(figsize=(10, 4))
ax1 = fig.add_subplot(1,2,1)
ax2 = fig.add_subplot(1,2,2)

sns.distplot(sndHsPr.price, fit=stats.norm, bins=40, ax=ax1) # Histogram
ax1.set_xlabel('Unit price (USD/sqft)')
ax1.set_ylabel('Density')

sm.qqplot(sndHsPr.price, fit=True, line='45', ax=ax2)
plt.show()

From the Histogram plot and the Q-Q plot above, we notice that the distribution of price is quite different from normal distribution. To alleviate this issue and improve the Normality of the distribution, we will apply log transformation to price by typing sndHsPr['log_price'] = np.log(sndHsPr.price).

The distribution of log(price) is shown below. Furthermore, if we check the skewness of the distribution by using sndHsPr.price.skew() and sndHsPr.log_price.skew(), we will get 0.679 and -0.012. , So after log transformation, the distribution of price is indeed more close to normal distribution.

2.2. District

Districts of Beijing source

As shown in the picture above, the city of Beijing is composed of Central urban area, Inner suburb area, Outer suburb area, and Remote area. In this study, we will consider 2 districts in the Central urban area (i.e., Dongcheng, Xicheng), as well as 4 districts in the Inner suburb area (i.e., Chaoyang, Haidian, Fengtai, Shijingshan).

fig = plt.figure(figsize=(12, 4))
ax1 = fig.add_subplot(1,2,1)
ax2 = fig.add_subplot(1,2,2)

sndHsPr.district.value_counts().plot(kind = 'pie',autopct='%.1f%%', ax=ax1)

dat1=sndHsPr[['district','price']]
dat1.district=dat1.district.astype("category")
dat1.district.cat.set_categories(["shijingshan","fengtai","chaoyang","haidian","dongcheng","xicheng"],inplace=True)
sns.boxplot(x='district',y='price',data=dat1, ax=ax2)
plt.ylabel("Unit price (USD/sqft)")
plt.xlabel("District")
plt.title("District vs. unit price")

The pie chart and the box plot are generated by the command above. From these plots, we can find that (1) each District is fairly represented in the dataset, there is no undersampling or oversampling issue; and (2) the unit price of real estate in each district is quite different, ranging from ~500 to ~1,100 USD/sqft. So the phrase “Location! Location! Location!” is indeed true.

2.3. Bedroom

Next we will examine the independent variable Bedroom, which means how many bedrooms does the house have. The command used to generate plots are very similar to the command shown above, so I will not show it here and you could find it in the Notebook. From the plots below, we can find that (1) more than 95% of houses are with 1, 2, or 3 bedrooms, it is very rare to find a house with 4 or 5 bedrooms; (2) Unit price of house is quite insensitive to the number of bedrooms.

2.4. Living room

The plots below visualize the independent variable Livingroom, which means how many living rooms does the house have. Similar to the plots above, we can see that (1) majority of houses have 1 to 2 living rooms, and that (2) number of living room has minor effect in determining the unit price of a house.

2.5. Floor

Independent variable Floor indicates that whether the house is in low floor, mid floor, or high floor. First of all, it is quite intuitive that each category takes about 1/3 of the dataset. In addition, we also notice that the effects of floor level in determining the unit price of a house is minimal.

2.6. Subway

Next we will examine the variable Subway, which indicates if the house is close to a subway station (“1” means True and “0” means False). From the plot below, we can see that more than 80% of houses are close to a subway station. This is in good consistency in our expectation, as in this study we only considered the Central urban area and the Inner suburb area. If the Outer suburb area and Remote area were also included in the dataset, I would expect this percentage to be lower.

Also from this plot can we see that Subway is a significant factor in determining the unit price of a house. This is understandable, because living near a subway station can save a lot of time in commuting, especially during rush hours.

2.7. School

School is an independent variable indicating whether the house is in a premium school district. From the visualization below, we can see that around 70% of houses are in premium school district, and that School plays a significant role in determining the house price.

2.8. Area

Finally, we will examine the independent variable Area, which is the area of house in square feet (sqft). Similar to the dependent variable Price, we also took log transformation to improve its Normality. This can be observed from the histogram and Q-Q plots below (first row: before log transformation, second row: after log transformation), as well as the scatter plots in the third row.

3. Developing linear regression model

This section is composed of 3 parts. First we will perform subsampling and statistical test for each variable. Then we will build a linear regression model and conduct Regression Diagnostics. Finally we will use the newly built model to make predictions.

3.1. Statistical test

Since the dataset has 16210 samples and is too large for statistical test, we will conduct stratified sampling based on District first, and then we will check the p value for each dependent variable. The code block below shows the subroutine we will employ to conduct subsampling.

def get_sample(df, sampling="simple_random", k=1, stratified_col=None):
    """
    Get samples from input dataframe
    Inputs:
        - df: input dataframe, type: pandas.dataframe
        - sampling: sampling method, type: str
            Available options: ["simple_random", "stratified", "systematic"]
            meaning: Simple random sampling, Stratified sampling, systematic sampling
        - k: number of proportion of samples, type: int or float
            (int, must > 0; float, must in (0,1) range)
            if 0 < k < 1 , then k is the proportion of samples
            if k >= 1 , then k is the total number of samples；
            or number of samples in each subpopulation (stratum) for Stratified sampling
        - stratified_col: list of variables for Stratified sampling
    Outputs:
        sampling results, type: pandas.dataframe
    """
    import random
    import pandas as pd
    from functools import reduce
    import numpy as np
    import math

    len_df = len(df)
    if k <= 0:
        raise AssertionError("k can not be negative")
    elif k >= 1:
        assert isinstance(k, int), "k need to be a positive integer"
        sample_by_n = True
        if sampling == "stratified":
            alln = k * df.groupby(by=stratified_col)[stratified_col[0]].count().count()
            # alln=k*df[stratified_col].value_counts().count()
            if alln >= len_df:
                raise AssertionError("Value of k is too large")
    else:
        sample_by_n = False
        if sampling in ("simple_random", "systematic"):
            k = math.ceil(len_df * k)

    # print(k)
    if sampling == "simple_random":
        print("simple random sampling")
        idx = random.sample(range(len_df), k)
        res_df = df.iloc[idx, :].copy()
        return res_df
    elif sampling == "systematic":
        print("systematic sampling")
        step = len_df // k + 1  # step=len_df//k-1
        start = 0  # start=0
        idx = range(len_df)[start::step]  # idx=range(len_df+1)[start::step]
        res_df = df.iloc[idx, :].copy()
        # print("k=%d,step=%d,idx=%d"%(k,step,len(idx)))
        return res_df
    elif sampling == "stratified":
        assert stratified_col is not None, "please give list of variables for Stratified sampling"
        assert all(np.in1d(stratified_col, df.columns)), "please check list of variables for Stratified sampling"

        grouped = df.groupby(by=stratified_col)[stratified_col[0]].count()
        if sample_by_n == True:
            group_k = grouped.map(lambda x: k)
        else:
            group_k = grouped.map(lambda x: math.ceil(x * k))

        res_df = df.head(0)
        for df_idx in group_k.index:
            df1 = df
            if len(stratified_col) == 1:
                df1 = df1[df1[stratified_col[0]] == df_idx]
            else:
                for i in range(len(df_idx)):
                    df1 = df1[df1[stratified_col[i]] == df_idx[i]]
            idx = random.sample(range(len(df1)), group_k[df_idx])
            group_df = df1.iloc[idx, :].copy()
            res_df = res_df.append(group_df)
        return res_df
    else:
        raise AssertionError("sampling is illegal")

We use this code to conduct stratified sampling based on District (400 samples for each District, 2400 observations in total). Then we computed the p value of each variable, and the results are shown in the table below. From this result, we can see that variables such as Bedroom, Living room, and Floor have a high p value, so we will not include them in the model.

sndHsPr_400 = get_sample(sndHsPr, sampling="stratified", k=400, stratified_col=['district'])

print("p value of District: \t%.4f" % sm.stats.anova_lm(ols('price ~ C(district)', data=sndHsPr_400).fit())._values[0][4])
print("p value of Bedroom: \t%.4f" % sm.stats.anova_lm(ols('price ~ C(bedroom)', data=sndHsPr_400).fit())._values[0][4])
print("p value of Livingroom: \t%.4f" % sm.stats.anova_lm(ols('price ~ C(livingroom)', data=sndHsPr_400).fit())._values[0][4])
print("p value of Floor: \t%.4f" % sm.stats.anova_lm(ols('price ~ C(floor)', data=sndHsPr_400).fit())._values[0][4])
print("p value of Subway: \t%.4f" % sm.stats.anova_lm(ols('price ~ C(subway)', data=sndHsPr_400).fit())._values[0][4])
print("p value of School: \t%.4f" % sm.stats.anova_lm(ols('price ~ C(school)', data=sndHsPr_400).fit())._values[0][4])
print("p value of Area: \t%.4f" % sm.stats.anova_lm(ols('price ~ area', data=sndHsPr_400).fit())._values[0][4])

Variable	p value
District	0.0000
Bedroom	0.2601
Livingroom	0.3774
Floor	0.1719
Subway	0.0000
School	0.0000
Area	0.0047

3.2. Model development

With the help of statsmodels package, we can build and diagnose the linear regression model by using the command block below conveniently.

lm1 = ols("price ~ C(district)+C(subway)+C(school)+area", data=sndHsPr_400).fit()
lm1_summary = lm1.summary()
lm1_summary

sndHsPr_400['pred1']=lm1.predict(sndHsPr_400)
sndHsPr_400['resid1']=lm1.resid
sndHsPr_400.plot('pred1','resid1',kind='scatter')

From the results above, we can see that the adjusted r^2 = 0.598, which is quite decent for real world problems. On the other hand, we also observed heteroscedasticity in the scatter plot of the residual, which is in consistency with the distribution of dependent variable Price we examined in Section 2.1.

To address this critical issue, we will use log_price and log_area to replace price and area and refit the model. From the results below, we can see that most important of all, heteroscedasticity is eliminated. Also, adjusted r^2 is slighted improved from 0.598 to 0.608 as well.

lm2 = ols("log_price ~ C(district)+C(subway)+C(school)+log_area", data=sndHsPr_400).fit()
lm2_summary = lm2.summary()
lm2_summary

sndHsPr_400['pred2']=lm2.predict(sndHsPr_400)
sndHsPr_400['resid2']=lm2.resid
sndHsPr_400.plot('pred2','resid2',kind='scatter')

Finally, we would like to investigate the interaction effects of District and School. From the box plot below, we can see that in Shijingshan district, price of a house in premium school district is lower than those not in premium school district. This is counterintuitive, and it is also inconsistent with other districts. Further investigation found that samples in Shijingshan district are highly unbalanced and only 5 out of 400 of samples (1.25%) are in premium school district.

fig = plt.figure(figsize=(20, 10))
ax1 = fig.add_subplot(2,3,1)
ax2 = fig.add_subplot(2,3,2)
ax3 = fig.add_subplot(2,3,3)
ax4 = fig.add_subplot(2,3,4)
ax5 = fig.add_subplot(2,3,5)
ax6 = fig.add_subplot(2,3,6)

sns.boxplot(x='school', y='price', data = sndHsPr_400[sndHsPr_400.district=='chaoyang'], showmeans=True, ax=ax1).set_title("chaoyang")
sns.boxplot(x='school', y='price', data = sndHsPr_400[sndHsPr_400.district=='dongcheng'], showmeans=True, ax=ax2).set_title("dongcheng")
sns.boxplot(x='school', y='price', data = sndHsPr_400[sndHsPr_400.district=='fengtai'], showmeans=True, ax=ax3).set_title("fengtai")
sns.boxplot(x='school', y='price', data = sndHsPr_400[sndHsPr_400.district=='haidian'], showmeans=True, ax=ax4).set_title("haidian")
sns.boxplot(x='school', y='price', data = sndHsPr_400[sndHsPr_400.district=='shijingshan'], showmeans=True, ax=ax5).set_title("shijingshan")
sns.boxplot(x='school', y='price', data = sndHsPr_400[sndHsPr_400.district=='xicheng'], showmeans=True, ax=ax6).set_title("xicheng")

###Check unusual price trend in shijingshan district -- probably samlpe size is too small and not representative
print('shijingshan - premium school district: \t',sndHsPr_400[(sndHsPr_400['district']=='shijingshan')&(sndHsPr_400['school']==1)].shape[0])
print('shijingshan - Not premium school district: \t',sndHsPr_400[(sndHsPr_400['district']=='shijingshan')&(sndHsPr_400['school']==0)].shape[0])
print('only 5 out of 400 of houses (1.25%) in shijingshan are in premium school district')

In the code block below, we further improve the model by taking into consideration the interaction effects of District and School. There is minor improvement of adjusted r^2 from 0.608 to 0.609, and the heteroscedasticity is also eliminated in the scatter plot of residuals.

lm3 = ols("log_price ~ C(district)*C(subway)+C(school)+log_area", data=sndHsPr_400).fit()
lm3_summary = lm3.summary()
lm3_summary

sndHsPr_400['pred3']=lm3.predict(sndHsPr_400)
sndHsPr_400['resid3']=lm3.resid
sndHsPr_400.plot('pred3','resid3',kind='scatter')

3.3. Model prediction

Question: Assuming a family is interested in buying a 2 bedroom, 1 livingroom house in Dongcheng district, which is 700 sqft, in mid floor, close to subway station, and in premium school district. How much should they expect to pay?

# district        bedroom   livingroom     area    floor  subway  school
# dongcheng       2         1              700     1      1       1

X_test = [['dongcheng', 1, 1, np.log(700)]]
X_test = pd.DataFrame(X_test, columns = ['district', 'school', 'subway', 'log_area'])

unit_price_pred = np.exp(lm3.predict(X_test))
total_price_pred = unit_price_pred * 700
print("predicted unit price: ", round(unit_price_pred[0], 1), "USD/sqft")
print("predicted total price: ", round(total_price_pred[0], 1), "USD")

To answer this question, let’s predict the unit price of this house using our newly developed model. Our model shows that the unit price of such a house is $975.6 per sqft, and the total price they need to pay is $682,915.

4. Conclusion and future work

In this project, I have developed a linear regression model to predict the house price in Beijing, based on its District, whether it is close to subway station, whether it is in preminum school district, and its area. The model can explain about 61% of variations in price, which is quite decent given the limitation of linearity and the scattered nature of the dataset. If explainability is not required legally or ethically, researchers can also consider using artificial neural network or ensemble method such as Gradient-boosted decision trees. These models usually can give better results.