You are on page 1of 7

Economics, Finance, and Business

G6.4

Valuations of residential properties using a neural network

Gary Grudnitski

Abstract With the advent of large computerized databases, computational techniques are being relied on more frequently to estimate residential property values. As an alternative to the most commonly used computational technique of multiple regression, this application describes how a neural network was applied to estimate the selling price of singlefamily residential properties in one area of a large California city. For the holdout sample of 100 properties, the average absolute difference between the actual selling price and the estimated selling price generated by the neural network was 9.48%. In terms of comparative accuracy, the network was able to achieve, on average, more accurate valuations of properties than the multiple regression model in the holdout sample. The network also produced more accurate valuations than the multiple regression model for 57 out of the 100 residential properties in the holdout sample.

G6.4.1

Design process

Accurate, economical and justiable valuation of residential property is of great importance to mortgage holders who wish to value their portfolios, to prospective lenders who are contemplating the issuance of new mortgages, and to local government authorities who must know the worth of their tax base. As large computerized databases become increasingly more common, computational techniques, especially multiple regression, are being relied on more frequently to assess residential property values. Residential property, like many other commodities, can be viewed as bundles of attributes. A problem in valuing residential property exists, however, because the prices of a propertys individual components are both unobservable and devoid of an implicit market. Empirically, the choice of pricing equations that value a propertys individual components often appears to be dictated by the nature of the available data and the tendency of those providing the estimates to xate on goodness of t criteria. On one hand, this is understandable because pricing equations for residential property represent, in reduced form, an interaction between both supply and demand, and thus make the specication of an exact functional form difcult. On the other hand, however, housing price estimates that critically depend on the functional form chosen can be negatively impacted by this imprecision in the specication of pricing equations. In an attempt to mitigate the negative effects on estimates of property values due to imprecision in the specication of the valuation equation, what follows is a description of how a standard backpropagation neural network (Rumelhart and McLelland 1986) is applied to estimate the selling price of single-family residences. To measure the relative performance of the network, prices produced are compared to estimations generated by a multiple regression model.
c 1997 IOP Publishing Ltd and Oxford University Press Handbook of Neural Computation release 97/1

C1.2

G6.4:1

Valuations of residential properties using a neural network

Table G6.4.1. An example of data downloaded from the MLS describing a sales transaction. PT1 SINGLE FAMILY-DETACHED 08/28/93 03:49 PM LP: 189,000 STATUS: SOLD MT: 82 LD: 01/12/93 XD: 06/12/93 REF# 69 SP: 186,000 OLP: 189,000 FIN: OMD: 05/27/93 LNO: 93 6000909 AD: 13131 OLD WEST DR ZIP: 92129 APN: 3151703700 MC: 30F3 XST: TED WMS PRWY COM: RP NCD: CRM YB: 1987 ZN: NONE BR: 3 OPBR: BATH: 2.5 ESF: 1638 SSF: ASSES TRM: LR : 11X17 FP : F PTO: SLAB HOF: 101 TLB: 0 DR : 10X10 TV : C EXT: STUCO HFP: MONTHLY TD1: 0 FAM: 12X14 R/O: R/O ELEC RF : CNSHK HFI: GTC IN1: 0.0 AS1: KIT: 11X10 DW : DISHWASH SWR: SEWER OF : 0 LT1: MBR: 13X20 MW : MICRO BI SPA: NONE OFP: NONE KNO TD2: 0 BR2: 11X10 TC : IRR: SPRINKLE TOF: NONE KNO IN2: 0.0 AS2: BR3: 11X11 HT : FAG FLR: SLAB LDY: GAR LT2: BR4: 0 WH : ALU: NONE KNO LSZ: 8500 AST: NONE KNOWN BR5: 0 SEC: EQPT OWN GUEST: NONE ACS: 0.00 BF : NONE KNOWN XRM: 0 VU : NK AGEREST: NONE LSF: 0 EQP: D,E,F,G,K STY: 2 STO PL: YES CL: CFA PKG: 2G REMARKS: THIS PLAN 3 CAMBRIDGE HAS IT ALL! MINT CONDITION WITH NEW BERBER CARPET NEW WINDOW TREATMENTS, NEW FLOORING IN BATHROOMS SEC SYS, 2 PATIOS, PATIO COVER, BUILT-IN GAS BRICK BBQ, SOFT WTR SYS, REFINISHED KITCHEN CABINETS, LANDSCAPED WITH AUTO SPRINKLERS. SHOWS TERRIFIC! GATE CODE * 0289

G6.4.1.1 Description Source data representing the sale of a residential property were obtained from the San Diego Board of Realtors multiple listing service (MLS). For this application, data on single-family homes sold during 199293 in Rancho Penasquitos, a northern suburb of San Diego County, California, were electronically downloaded. A typical entry for one of these properties is shown in table G6.4.1. From the downloaded MLS residential sales data, a parser, written in C, extracted the following nine descriptors for each property (these descriptors are shown in bold in table G6.4.1): SP is the actual selling price, YB is the age of the structure in years, derived by subtracting the year the house was built from 1992, BR is the number of bedrooms, BATH is the number of bathrooms in increments of 1/4 baths, ESF is estimated total square footage of the house, LSZ is the lot size measured in square feet, STY is the number of stories, PL/SPA indicates if a pool or spa existed (0 otherwise) and PKG is the number of car-garages. For the sample, descriptive statistics for the continuous variables are presented in table G6.4.2. In addition, for the PL/SPA variable, 31% of the houses in the sample had either a pool or spa. Data from the parser were then passed to an Excel spreadsheet. Using the spreadsheet, each of the values of the variables was normalized according to equation (G6.4.1) and output to the neural network software. inorm = (i min)/range (G6.4.1)

where inorm is the vector of normalized values of the variable, i is the vector of original values of the variable, min is the minimum original value of the variable, and range is the range of the original values of the variable. G6.4.1.2 Topology The topology of the network to estimate the selling price of a house is depicted in gure G6.4.1. This standard backpropagation network consisted of an input layer of eight neurons, a hidden layer of N neurons, and an output layer of a single neuron. The eight neurons in the input layer of the network captured the attributes believed to determine a propertys value. The single neuron in the output layer represented the networks determination of the selling price of a house. Values estimated by the network fell within a range of 0 to 1 to achieve comparability to the previously transformed (also according to equation (G6.4.1)) actual selling prices of these houses.
c 1997 IOP Publishing Ltd and Oxford University Press Handbook of Neural Computation release 97/1

G6.4:2

Valuations of residential properties using a neural network

Table G6.4.2. Descriptive statistics of the continuous variables in the sample. Variable abbreviation SP YB BR BATH ESF LSZ STY PKG Variable denition Selling price ($) Age (yrs) Number of bedrooms Number of bathrooms Total square footage Lot size (sq. ft.) Number of stories Number of car-garages Overall mean (std. deviation) 214 112 (32 997) 9.19 (5.32) 3.72 (0.67) 2.51 (0.41) 1991 (413) 8246 (4464) 1.75 (0.43) 2.16 (0.37) Minimum (maximum) 150 000 (365 000) 0 (22) 2 (6) 2 (4) 1100 (3009) 3746 (44 866) 1 (2) 1 (3)

Figure G6.4.1. Topology of the neural network.

G6.4.2

Training methods

The data set was randomly divided into three subsets. The rst subset of the data, made up of 119 properties, was used to train the network. The second subset of the data, called the training-test set, consisted of 30 properties. It was used to check the ability of the supposedly trained neural network to generalize (i.e. to prevent overtraining), and to select the optimal number of hidden-layer neurons (Masters 1993, p 183). The third subset of the data consisted of 100 properties, and was used to assess the ability of the network to estimate property values accurately. The neural network software was written in C for a personal computer and is available as shareware from Roy W Dobbins (Eberhart and Dobbins 1990). The network was run on a 33MHz 486DX. With random starting weights between 5.0, and a learning coefcient and momentum factor of 0.1 and 0.6, respectively, networks employing a logistic activation function and having from two to four neurons in their hidden layer were trained. Figure G6.4.2 graphs the average absolute error(i.e. (estimated selling price actual selling price)/actual selling price)of the training set against the average absolute error of the training-test set for from two to four hidden-layer neurons at 2000, 4000, 6000, 8000 and 10 000 training iterations. Figure G6.4.2 indicates for this training and training-test sample the superiority of a network with two neurons in its hidden layer. Specically, contrast the plot of the error of the network with two neurons in its hidden layer to the plot of the error of the network with three neurons in its hidden layer. While the error of the network with two neurons in its hidden layer moves consistently down and to the left as
c 1997 IOP Publishing Ltd and Oxford University Press Handbook of Neural Computation release 97/1

B3.5

G6.4:3

Valuations of residential properties using a neural network

Figure G6.4.2. Average absolute error for the training set and training-test set when the number of hiddenlayer neurons is varied from 2 to 4.

the number of iterations increases from 2000 to 10 000, the plot of the training-test error for the network with three neurons in its hidden layer initially declines from 0.0844 at 2000 iterations to 0.0831 at 4000 iterations, but then begins to rise fairly uniformly to 0.0848 at 10 000 iterations.

G6.4.3

Output interpretation

In terms of overall estimation of the selling price of the 100 properties in the test sample, the trained network with two neurons in its hidden layer resulted in an average absolute error of 9.48%. The smallest and largest individual absolute errors in estimating the selling price of the test sample residential properties were 0.3% and 38.7%, respectively. Figure G6.4.3 graphs the absolute error of the networks prediction, ordered by the size of the absolute error, for the test sample of 100 properties. It shows that 28% of the determinations were in error by less than 5%, 65% of the determinations were in error by less than 10%, and 12% of the determinations of the network were in error by more than 20%.

Figure G6.4.3. Absolute error for the 100 test-set properties.

c 1997 IOP Publishing Ltd and Oxford University Press

Handbook of Neural Computation

release 97/1

G6.4:4

Valuations of residential properties using a neural network G6.4.4 Comparison with multiple regression

A linear multiple regression model was derived based on the 119 properties in the training sample. The regression coefcients and their corresponding t values are given in table G6.4.3.
Table G6.4.3. Statistics for the multiple regression model. Variable abbreviation Variable denition Intercept YB BR BATH ESF LSZ STY PL/SPA PKG Age (yrs) Number of bedrooms Number of bathrooms Total square footage Lot size (sq. ft.) Number of stories Existence of a pool or spa Number of car-garages Coefcient (std. error) 132 422 (125 042) 1769 (208) 2878 (1964) 1789 (5227) 36 (4) 0.73 (0.29) 1337 (3349) 5502.23 (3788.47) 3281 (4238) t value (Prob > |t | ) 11.00 (0.0001) 8.51 (0.0001) 1.47 (0.1458) 0.34 (0.7328) 8.95 (0.0001) 2.56 (0.0118) 0.40 (0.6909) 1.45 (0.1505) 0.77 (0.4405)

In terms of statistical performance, the multiple regression model had an adjusted R -squared of 0.689 and an F value of 33.7. In terms of estimation performance, the multiple regression model resulted in an average absolute error of 11.6% in estimating the selling price of the test sample properties. Thirty-six per cent of the determinations of the multiple regression model were in error by less than 5%, 54% of the determinations were in error by less than 10%, and 9% of the determinations were in error by more than 20%. Further, for 57 out of 100 test sample properties, the absolute error of the multiple regression model exceeded that of the network. G6.4.5 Conclusion

While for this sample of residential properties the network produced more accurate overall estimates of selling prices than the multiple regression model, the networks average absolute error was still relatively high and some of its errors were unacceptably large. These weaknesses are likely to be attributable to two sources. First and most importantly, a number of potentially signicant variables have been omitted from the pricing equation. These include view characteristics of the property such as canyon, mountain, and ocean; specic neighborhood location parameters, such as those that might be obtained by reference to the Thomas Guide 0.25 square-mile grid identier; and other physical attributes of a house such as the existence of air conditioning, the type of roof, and the presence of a security system. A second factor that contributed to the size of the network error was the source data. The source data describing a property were supplied by the listing agent and are subject to buyer verication. Although these agents attempt to describe the property as completely as possible, frequently the data were incomplete or erroneous. References
Eberhart R C and Dobbins R W (eds) 1990 Neural Network PC Tools (San Diego, CA: Academic) Masters T 1993 Practical Neural Network Recipes in C++ (San Diego, CA: Academic) Rumelhart D E and McLelland J L 1986 Parallel Distributed Processing vol 1 (Cambridge, MA: MIT Press)

c 1997 IOP Publishing Ltd and Oxford University Press

Handbook of Neural Computation

release 97/1

G6.4:5

You might also like