You are on page 1of 3

World Applied Programming, Vol (3), Issue (10), October 2013.

479-481
ISSN: 2222-2510
2013 WAP journal. www.tijournals.com

A New Method on Spam Filtering


Mina Khoshrangbaf

Ali Farzan

A.R. Ayremlou

Computer Dept., Shabestar branch,


Islamic Azad University,
Shabestar, Iran.
ma.khoshrang@gmail.com

Computer Dept., Shabestar branch,


Islamic Azad University,
Shabestar, Iran.
alifarzanam@gmail.com

Computer Dept., Shabestar


branch, Islamic Azad University,
Shabestar, Iran.
t_ayremlou@yahoo.com

Abstract: Unsolicited email is not only a nuisance but can be potentially dangerous. Methods to filter it out
work fairly well with conventional unsolicited commercial email (also known as SPAM). This paper
investigates features of emails and tries to find the best discriminative ones in categorizing emails into SPAM or
Normal-Mail classes. Afterwards, some classifiers are designed and their severity is evaluated based on
accuracy, sensitivity and specificity. It is revealed that decision tree outperforms all other classifiers with
specificity of 0.99, sensitivity of 0.96 and accuracy of 0.98.
Keywords: SPAM Filtering, feature selection, classification
I.

INTRODUCTION

In recent years, the increasing use of e-mail has led to the emergence and further escalation of problems caused by
unsolicited bulk e-mail messages, commonly referred to as Spam [1]. Evolving from a minor nuisance to a major
concern, given the high circulating volume and offensive content of some of these messages, Spam is beginning to
diminish the reliability of e-mail [2]. Personal users and companies are affected by Spam due to the network bandwidth
wasted receiving these messages and the time spent by users distinguishing between Spam and normal messages. A
business model relying on Spam marketing is usually advantageous because the costs for the sender are small, so that a
large number of messages can be sent, maximizing the returns, this aggressive behavior being one of the defining
characteristics of Spammers [3]. The economical impacts of Spam have led some countries to adopt legislation [2, 4-5],
although it is limited by the fact that many such messages are sent from various countries [6].
Another approach adopted is the use of Spam filters, which, based on analysis of the message contents and additional
information, attempt to identify Spam messages [7-8]. So far, the spam filtering based on the contents of mail is a
significant approach although there have been a variety of approaches.
In spam filtering, the appropriate pre-processing steps are required before a mail is fed into a classifier, which include
tokenization, lemmatization, stop-words removal and representation [1]. Among them, the representation of the mail is
significant [9]. Based on e-mail messages received at a large e-mail service provider, two studies [10], [11] investigated
the aggregate global characteristics of spamming botnets (networks of compromised machines involved in spamming)
including the size of botnets and the spamming patterns of botnets. These studies provided important insights into the
aggregate global characteristics of spamming botnets by clustering spam messages received at the provider into spam
campaigns using embedded URLs and near-duplicate content clustering, respectively. However, their approaches are
better suited for large e-mail service providers to understand the aggregate global characteristics of spamming botnets
instead of being deployed by individual networks to detect internal compromised machines. Xie et al. developed an
effective tool named DBSpam to detect proxy-based spamming activities in a network relying on the packet symmetry
property of such activities [12]. Their algorithm can just detect the spam proxies that translate and forward upstream
non-SMTP packets (for example, HTTP) into SMTP commands to downstream mail servers.
Here in this work, Spam-Mail dataset(adopted from http://archive.ics.uci.edu/ml/datasets/ Spambase) as a popular
standard dataset is used to test the proposed algorithm. The dataset includes 4601 records of 57 features. All the records
are labeled with two different class labels, SPAM or Normal-Mail. A total of 1813 patterns are SPAM samples and 2788
remained ones are Normal-Mails.

479

Mina Khoshrangbaf, et al. World Applied Programming, Vol (3), No (10), October 2013.

II.

METHODS

As mentioned above, there are lots of features in the dataset and this high dimensionality forces to use high dimensional
complicated classifiers. At the other hand, much of these features doesnt have high contribution in classifying the
records and can be eliminated at the cost of losing a small part of information. According to these facts, discrimination
power of the features is evaluated in some ways. Student two-sample t-test as a method of specifying the significance of
features is used in the dataset. Regarding the results, the feature related to word will is not significant and cannot be
used in the classification.
The other method of specifying discrimination power of features is the Fishers Discriminant Ratio (FDR). FDR is
sometimes used to quantify the separability capabilities of individual features. It reminds us of the test statistic q
appearing in the hypothesis statistical tests. FDR can be formulated as:

Where and represent mean value and standard deviation of feature values respectively. Regarding high
dimensionality of features and for the sake of simplicity, FDR values for all the features are not shown here and just a
few of the features with highest FDR value is investigated. The feature of your word has the highest FDR value equal
to 0.3467 and seems to be the best single feature in classifying data patterns. The FDR values of 8 best discriminative
features are depicted in Figure. 1.

FDR

0.4
0.3467
0.35
0.3
0.25 0.2084
0.2101
0.1967
0.1786
0.1743
0.2
0.14520.1328
0.15
0.1
0.05
0
!
r
e
"
u
$
1
hp
ov
00
yo
ou
hp
y
0
m
"
e
r
Most Discriminative Features

Figure. 1. FDR values of eight most discriminative features out of 57 features

By using thresholding as one of the simplest methods of classification on this feature, we can evaluate its discrimination
power. Four various methods of thresholding are used. Rigsure (Based on Steins Unbiased Risk Estimate), heursure (an
alternative of Rigsure), Sqtwolog (a Hybrid method of rigsure and heursure) and Minimaxi are adopted methods.
Accuracy, sensitivity and specificity are three measures of evaluating severity of classifier. These values are revealed in
Table. 1. Regarding the undergoing application, it is obvious that specificity is more important measure in between the
three measures. It is because of those misclassifying normal-mails as SPAM is a tremendous situation than
misclassifying SPAMs as normal-mails. So, heursure and sqtwolog methods are the prominent ones out of four methods.
Table 1. Severity measures of threshold based classifiers
Thresholding Method
rigrsure
heursure
sqtwolog
minimaxi

Accuracy
0.65
0.60
0.60
0.61

480

Sensitivity
0.22
0.003
0.003
0.03

Specificity
0.94
0.99
0.99
0.98

Mina Khoshrangbaf, et al. World Applied Programming, Vol (3), No (10), October 2013.

Another method of classification adopted here is decision tree. It is a hierarchical classification method which tries to
reduce the entropy of data subsets and subdividing them into the more pure datasets. Evaluation of the proposed method
is done in two various ways. One by resubmitting training data set to evaluate tree and the other using well known crossvalidation. Results are shown in Table.2. Results show that resubmission yields better results than cross validation and
that is because of differences between memorization and generalization capability of learning based classifiers.

Table 2. Severity of designed decision tree


Evaluation Method
Resubmission

Accuracy
0.98

Sensitivity
0.96

Specificity
0.99

10 fold Cross-Validation

0.91

0.89

0.93

III.

RESULTS AND CONCLUSION

According to the evaluation results mentioned in previously, it is revealed that decision tree outperforms all other
methods and can be chosen as the best classifier. Regarding the specificity, it performs as well as best thresholding ones,
but much better sensitivity and consequently accuracy than the simple thresholding methods.

REFERENCE
[1]
[2]
[3]
[4]
[5]
[6]
[7]
[8]
[9]
[10]
[11]
[12]

Guzella and, T. S; Caminhas, W. M. A review of machine learning approaches to Spam filtering, Expert Systems with
Applications, vol. 36, pp. 10206-10222, 2009.
Hoanca, B. How good are our weapons in the spam wars?, Technology and Society Magazine, IEEE, vol. 25, pp. 22-30,
2006.Reference 3.
Martn-Herrn, G; et al. Competing for consumer's attention, Automatica, vol. 44, pp. 361-370, 2008.
Carpinter , J; Hunt, R. Tightening the net: A review of current and next generation spam filtering tools, Computers &
security, vol. 25, pp. 566-578, 2006.
Stern, H. A survey of modern spam tools, 2008.
Talbot, D. Where SPAM is born, Technology Review, vol. 111, p. 28, 2008.
Goodman, J; et al. Spam and the ongoing battle for the inbox, Communications of the ACM, vol. 50, pp. 24-33, 2007.
Hayes, B. How many ways can you spell V1@ gra?, American scientist, vol. 95, pp. 298-302, 2007.
Sebastiani, F. Machine learning in automated text categorization, ACM computing surveys (CSUR), vol. 34, pp. 1-47, 2002.
Xie, Y; et al. Spamming botnets: signatures and characteristics, 2008, pp. 171-182.
Zhuang, L; et al. Characterizing botnets from email spam records, 2008, pp. 1-9.
Xie, M; et al. An effective defense against email spam laundering, 2006, pp. 179-190.

481

You might also like