You are on page 1of 1

Data Mining: Predicting Laptop Retail Price Using Regression

By: Britney Robinson and Joi Officer Advisor: Dr. Fred Bowers

Abstract
Regression is an inherently statistical technique used regularly in data mining. Regression can translate a companys growth patterns and can predict conclusions about a companys successes based on the products they choose to sell. In this project, regression was used to build models that predicted the retail price of laptops sold in London electronic stores. The model was built using the XLMiner data mining software and a database that contained 7,956 records. After the research was concluded, it was found that multiple linear regression was more efficient than simple linear regression in predicting the retail price of a laptop.

Divide each equation by 2 then solve the system of partial derivative equations.
a=

Results
The simple linear regression model is as follows: =.299x1 +442.795, with r = .485

b=

The multiple linear regression model is as follows: = 49.931x1 + .399x2 + 49.383x3 + 18.922 x4 + 51.063x5 + 49.085x6 - 34.769, with r2 = .93

It can shown by the second partial derivative test that (a,b) is a minimum point on the function. Multiple Linear Regression The coefficient of determination (r2 ) is a percentage of the total variation shown by the outcome variable in relation to the independent variables. If 2 the r value is interpreted correctly, then it would be useful in determining whether or not the predicted model is substantial enough to continue estimation and/or prediction. The range of the coefficient of determination is between 0 and 1.

Several records were used to validate the models. Table 2 and table 3 show 4 of the records used to validate both models.
HD size (GB) Processor Speeds (GHz) Integrated Wireless Bundled Application RAM (GB) Battery Life (Hours) Predicted Retail Sale Price ($) Actual Retail Sale Price ($)

Introduction
Regression analysis establishes a relationship between a dependent or outcome variable and a set of predictors. Regression, as a data mining technique, is supervised learning. Supervised learning partitions the database into training and validation data. The techniques used in this research were simple linear regression and multiple linear regression. Some distinctions between the use of regression in statistics verses data mining are listed below in Table 1.
Statistics The data is a sample from a population. Data Mining (supervised learning) The data is taken from a large database (e.g. 1 million records). The regression model is constructed from a portion of the data (training data).
Table 1

40

1.5

407.62

400.00

80

469.21

455.00

40

434.31

395.00

120

566.42

585.00

Table 2 depicts the validation of the multiple linear regression model using records 1, 4,5, and 79 respectively.

The regression model is constructed from a sample

The r2 value can be calculated using the following equation: r = ( ) / ( yi )


2 2 2

HD size (GB) 40 80

Predicted Retail Price ($) 454.75 466.71 454.75 478.67

Actual Retail Price ($) 400.00 455.00 395.00 585.00

Simple Linear Regression

The strength and direction of a linear relationship between two variables can be computed by the correlation coefficient, r. The following formula can be used to calculate r:

By extension of the Method of Least Squares, we can develop the multiple regression model. The coefficients and constant terms in the multiple regression formula can be derived in the same manner in which we derived the simple linear regression equation. The multiple regression formula is as follows: = a0 + a1x1 + a2x2 + a3x3++anxn, where is the outcome variable and xis are the predictors.

40 120

Table 3 depicts the validation of the simple linear regression model using records 1, 4,5, and 79 respectively.

Conclusion
Based on the research, the multiple linear regression model was more effective in predicting the retail price of a laptop in comparison to the simple linear regression model. It is evident that a better regression model can be constructed using several attributes to predict laptop retail prices instead of one attribute. Since this research was based on data from laptop sales that occurred in January 2008, this work could be expanded by including data from February 2008 to December 2008.

The solutions to r range from negative one to positive one.

Database
Data Name: Laptop Sales January 2008 Number of Records: 7,956 Number of Attributes in the Dataset: 17 Attribute used for Simple Linear Regression: HD size (GB) Attributes used for Multiple Regression: HD size (GB)[x2], battery life (hours)[x5], integrated wireless[x4], bundled applications[x1], processor speeds (GHz) [x3 ] RAM (GB)[x6]

The formula for simple linear regression is as follows: = ax + b, where is the outcome variable, and x is the predictor. The a, which represents the slope and b, which represents the y-intercept, can be calculated using the method of least squares. Method of Least Squares The method of least squares is a minimizing technique. In regression the method of least squares is used to find the least distance between the predicted y-values and the actual y-values to create a regression line (the line of best fit). Derivation of the Method of Least Squares Given points (x1, y1), (x2, y2),,(xn, yn), the coefficient and constant for the regression line can be derived from:

References
1. Data Mining for Business Intelligence, Galit Shmueli, John Wiley, Nitin Patel, and Peter Bruce, 2007. 2. Business Statistics, Wayne Daniel and James Terrell, 1992, Houghton Mifflin Company 3. Data Mining: Methods and Models, Daniel Larose, 2006, John Wiley 4. Discovering Knowledge in Data, Daniel Larose, 2005, John Wiley 5. Applied Calculus, Barnett, Ziegler and Byleen, 2003, Prentice Hall

Procedure
The attribute with the strongest r will be used to construct a simple linear regression model. Since HD size has the strongest r ( 0.485), HD size will be the predictor variable. Excel will be used to construct the regression model. The attributes with the strongest r will be used to construct a multiple linear regression model. Since HD size, battery life, integrated wireless, bundled applications, processor speeds, and RAM had the strongest r, these attributes were used. XLMiner is a data mining tool that builds multiple regression models, and it will be used to construct the regression model.

Acknowledgements
We would like to thank the following persons for their support throughout our research: Dr. Sidbury, Dr. Lee, Dr. Lawrence, Mr. Duffie, and Dr. Bowers. Support for this work has been provided by a grant from the Advancing Spelmans Participation in Informatics Research and Education Program and the National Science Foundation Award # HRD-0714553.

First, minimize F(a,b) by calculating the partial derivatives with respect to a and b. Then set the derivatives equal to zero.

You might also like