（英文）機器學習案例：西雅圖房價(一)

來自專欄數據挖掘新手上路

這篇文章是我的機器學習課程論文，數據是經典的西雅圖房價數據。這一案例涉及到數據預處理與初步分析，特徵工程，建模和分析的全過程。在這一案例中，我個人最為滿意的是初步分析中的可視化部分，全部代碼附在文章最後的鏈接中。由於篇幅限制，這一部分包括數據處理與初步分析和特徵工程兩部分，剩餘的建模部分在第二部分中。

Hedonic modelling of house sale prices in King County, WA

Executive summary

Home valuation is evaluated in the literature through hedonic price modelling. Athough advanced statistical techniques have been applied to hedonic price models in various settings, there remain gaps in the existing literature on the role of spatial variables and time measures in prediction. Researchers have analysed these factors separately from the hedonic model and in most of the literature advanced techniques have been employed in the context of classification rather than prediction. Therefore, this report introduces both spatial and temporal variables and state-of-the-art techniques into prediction models. While previous papers have been demonstrative of the performance of a limited scope of techniques, this report compares a broad range of techniques so as to identify best performing prediction model in the context of hedonic property pricing.

Application of hedonic price models to home valuation data in King County, WA found that top tier methods include random forest and gradient boosting, whose predictions deviated from the actual prices by only 16.6% and 13.5% respectively on average. The middle tier includes the SVM, MARS, K nearest neighbour techniques, all with a mean deviation rate around 24%. Neural networks, best subset selection, elastic net regularisation and multiple linear regression finish the list, all with a high deviation of close to 29%. Whilst the analysis manages to successfully incorporate the spatial components in the prediction model, the time range of the dataset limits the evaluation of seasonal fluctuations and other time-cyclic activities and their possible improvements to predictions.

Some of the predictive models also produce inferential results, which help provide insight in home characteristic behaviour in the King County market. Across learning methods, the amount of living space turns out to be the most important predictor of home valuation, followed by environmental factors. Renovating a property also positively affects sales price, although not significantly according to some methods. In agreement with intuition, luxury houses in King County tend to have waterfront locations and a large number of bedrooms and bathrooms. They are also built more recently, which corresponds to absence of historical luxury properties in the area. Waterfront location naturally offers a premium to any house, estimated at 0.36% of the price ceteris paribus.

Table of content

Introduction
Dataset ad preliminary analysis
Data pre-processing
Methodology
Result and anlysis

Introduction

Seattle is one of America』s fastest growing cities; in 2016, home price growth in its metropolitan area was the 3rd-highest in the country.Such a strong property market should raise interest in property research. From a personal perspective, home-buyers benefit from knowing the expected price to pay for a house given its characteristics (e.g. number of bedrooms). Home-sellers, in addition to that, might want to know what can be done to fetch a higher selling price - perhaps adding an extra bathroom would raise the expected sale price over and above the cost of renovation. From a commercial perspective, developers gain from knowing which characteristics can maximise returns on a given plot of land - perhaps smaller basements are more desirable for waterfront developments.

Dataset and preliminary analysis

A. Data description

The dataset is a cross-section of home sale prices from May 2014 to May 2015 in King County, WA, USA, comprising 21,613 instances across 21 variables defined as follows:

● id: identity number for each house

● date, price: date and price at which the house was sold

● bedrooms, bathrooms, floors: number of bedrooms, bathrooms, and floors

● sqft_living, sqft_lot: living space and lot size of the house (in square foot)

● sqft_living15, sqft_lot15: average living space and lot size of 15 neighbours

● waterfront: binary variable indicating whether house is sited by the waterfront

● view: number of views (e.g. 0 if the house is entirely blocked by adjacent buildings)

● condition: condition of house (1=poor/ worn-out, 5=excellent)

● grade: quality of construction (1-3=below min. building standards, 13=excellent)

● sqft_above, sqft_basement: living space above and in the basement respectively

● yr_built, yr_renovated: year in which house was constructed and renovated (if at all)

● zipcode, lat, long: geo-data for each house (zipcode, latitude, longitude)

Although the dataset does not have any missing values, some of the variables have imbalanced distributions. For example, waterfront is very imbalanced; only 0.75% of all houses are by the waterfront.

Given these variables with the goal of building hedonic models, we pose the following questions:

● Which housing characteristics are the most important predictors of sale price?

● What are some key characteristics of luxury houses in King County?

● Does renovating a house increase its sale price, ceteris paribus?

● What is the premium attached to a waterfront site?

● Is there seasonality to sale prices

B. Exploration of variables

Figure.1a Histogram of the reported house price. Figure.1b Histogram of the number of bedrooms per house. Figure.1c Histogram of the number of bathrooms per house. Figure.1d Histogram of the number of floors per house. All histograms include 21613 observations.

As Figure 1a shows, price is not normally distributed. It is skewed with a long right-tail as the majority of the houses were sold for $1,000,000 or less. To make sure that the non-normality does not affect our predictive accuracy on house prices, we will include techniques that do not require normality. Figure 1b exhibits one clear peak just above 3 bedrooms per house; almost all households have 5 bedrooms or less. Figure 1c, on the other hand, shows multiple peaks, suggesting a multimodal distribution of number of bathrooms. This reinforces the argument to model using nonparametric techniques.

Figure.2a Density plot of price by number of bedrooms. Figure.2b Density plot of price by the number of floors per house. Figure.2c Density plot of price by the number of views per house. Figure.2d Density plot of price by the condition of the house.

All plots in Figure 2 show prices after a logarithmic transformation, because it helps clarify the visualisation of a skewed distribution. In fact, it turns out in Figure.2b that the transformation normalises the price distribution for some cases. In Figure.2a several distributions exhibit multimodal features, in particular the categories representing 0, 7 ,8 , 9 and 10 bedrooms. The other categories, on the other hand, follow a symmetric smooth distribution with each one distinct peak.

Distinction by view in Figure 2c also results in different Kernel densities. Whilst price observations of houses with 3 or 4 views tail off more slowly as prices increase then they do vice versa, the other distributions seem very symmetric. The density plot for prices of houses with condition 1 shows many discontinuities. This is due to the fact that the curve is drawn based on only 30 observations. The plots for the other conditions in Figure 2d are smoother and have only one distinct peak. Although the curves for houses of condition 2 and 5 are not symmetric, those representing condition 3 and 4 seem to come close. Moreover, the distributions for conditions 3, 4 and 5 follow exhibit only little variation respective to each other.

Figure.3a Count of houses have/ haven』t been renovated. Figure.3b Boxplot of price by renovated status

Figure 3a illustrates the imbalance of predictor 』yr_renovated』. Vast majority of instances have value 0 while few have value at around 2000. The renovation affects the condition of a house more or less. In Figure 3b, the selling price differs considerably between houses with and without renovation - it is clearly shown that the median selling price of renovated houses is higher than those haven』t been renovated. The Wilcoxon test indicates that the difference between renovation status is statistically significant. Hence, in data preprocessing we consider replacing yr_renovated with renovated_status.

Figure.4 Correlation heatmap of features (variables are symmetrically distributed)

Figure 4 exhibits the correlation between all numeric predictors, id is dropped because it doesn』t capture any information. yr_built, yr_renovated are discarded for duplicated information with age and renovated_status; date, zipcode, latitude and longitude are omitted because these variables capture information in a non-numeric way, thus it is inappropriate to include them in correlation matrix. Some variables have quite high correlation between each other. Variables in blue square frames show a strong correlation: the price is highly correlated to sqft_living, bathrooms,grade, this is consistent with the observations we mentioned above. The most interesting observation in this matrix is that grade has a noticeable covariance 0.77 with sqft_living15, which maybe implies the two facts: (1) it is very likely that houses in a community are similar to others, thus the grade of a house is positive correlated to the living space of the 15 nearest neighbors, or (2) when a house is evaluated and graded by rating agency, the neighborhood will affect the grade of the house, thus grades of houses in a neighborhood where their 15 neighbors have big living space are considerably higher.

As we noted above, the distribution of house price is heavily tailed, thus we are curious about the tail behavior. For the purpose of analysis, we define 『luxury houses』 as the top 5% most expensive houses in the dataset, in this case with prices higher than $1.23M. The others are defined as normal houses.

Figure.5a(left) percentage of waterfront in two groups. Figure.5b(right) Boxplot of age distribution of luxury houses and normal houses

Figure 5a shows the fraction of waterfront houses in two groups, roughly 13% of luxury houses located in waterfront areas, whereas negligible fraction of normal houses have waterscape. In Figure 5b, the normal houses have a median age at 40 years, while luxury houses have a lower median age at 30 years. In addition, the variance is larger for the luxury houses, there a noticeable amount of luxury houses younger than 20 years old, and all related features (square feet living, bedrooms, bathrooms, grade) are superior than normal houses (see Appendix-A Figure.1). This is reasonable because considerable amount of modern and fancy houses were built in 1990s-2000s stimulated by the healthy economy, low mortgage rates, adjustable rate mortgages and lax federal oversight of financial institutions (Chesly, 2009).

Figure.6a(left) Geographical distribution of normal houses .Figure 6b(right) luxury houses

Compared with normal houses, luxury houses are located at coastlines and areas around inland lakes, this is the reason why a considerably large fraction of luxury have waterfront sites. The densest distribution of luxury houses is the east coastal areas around Union bay. About 20% of luxury houses are in this area, followed by the environs of Lake Union and west coast line. Small quantities of luxury houses are in northwest coastal area and around Green Lake.

C. Analysis of spatial data

There are two sets of spatial variables in this dataset- zipcode and coordinate (set of latitude and longitude). The zipcode is almost randomly allocated to each district and there is not a general rule of how the sequence of zipcode is related to latitude and longitude (see Appendix-A Figure.2), Hence, the predictor zipcode should be treated as a categorical variable.

Figure.7a(left) Median price per sqft by coordinate grid Figure.7b(right) Median price of sales(in $1000)

By geographical grids, the most frequent area of real estate transactions is the northwest region where Greenwood and Ballard located, followed by the southwestern coastal area and Columbia city (see Appendix-A Figure.3). In the Figure 7a, median price per square feet is showed. The highest price per square feet belongs to the area around inland lakes and bays, followed by the southwestern coastal area. Figure 7b shows the median price of houses in different areas, the houses around Union Bay (middle east of Seattle) stand out with the highest prices per house. Basically, the price per square feet distributed by geographical grid is consistent with the transaction volume, areas with higher transaction volume also have higher price per square feet, which implies higher demand leads to higher price - consistent with economic theory.

In spatial analysis we can conclude that the geographical position of houses are indicative predictors for house price predicting, thus in prediction models afterwards, either zipcode or coordinates should be included into prediction models to get better results.

Data pre-processing

1.Missing values and outliers

Though this dataset seems tidy and clear at first glance, there is no missing value in this dataset, it is not immune from other issues. 8 outliers are identified after cautious checking. 7 outliers are houses whose 『bedrooms』 and 『baths』 are both 0, the 8th discarded instance has 33 bedrooms, this is obviously impossible in real world. Since this dataset is collected by historic house transaction, mis-recording might cause these obviously irrational outliers.

2.Variable manipulation

After cautious checking, there is a strong multicollinearity in three variables because by definition, we have sqft_living = sqft_above + sqft_basement. In principle, any one of these three variables could be deleted since the other two variables are enough to capture all the information. For the sake of interpretation 『sqft_above』 was dropped for two reasons: (1) In real estate, 『sqft_living』 is an intuitive and natural attribute for both buyers and sellers; (2) 『sqft_basement』 captures the unique feature of a house about whether the house is built with basement, which is meaningful for the developer.

In data exploration, 『yr_renovated』 was found to be highly imbalanced, more than 95% samples have 『yr_renovated』 value 0, others houses have been renovated in 1990s and 2000s. Hence another variable 『renovated_status』 is added representing whether houses have been renovated. 『yr_built』 and 『date』 are time-series variables, the absolute values themselves are not much indicative. In house sales, the age of a house is much informative than the year it is built. Thus a new variable 』age』 is added, representing how long ago a house was built. In addition, lat and long are dropped for two reasons (1) the spatial information is already captured by zipcode,(2) they are too specific to each house, almost as many unique values as there are instances e.g. 5034 unique lat values out of entire 21613 dataset (~?), including them would overfit any prediction model. In prediction models, the final 15 predictors are age, bedrooms, bathrooms, sqft_living, sqft_lot, floors,waterfront, view, condition, grade, sqft_basement, renovated_status, zipcode, sqft_living15, sqft_lot15