You are on page 1of 11

Proceedings of The Fourth International Conference on Informatics & Applications, Takamatsu, Japan, 2015

Statistical Analysis Between Malware and Benign


Based on IA-32 Instruction
Dongwoo Kang, Donghoon Lee, Jaewook Jung, and Dongho Won
College of Information and Communication Engineering
Sungkyunkwan University
2066 Seobu-ro, Jangan-gu, Suwon-si, Gyeongki-do, 440-746, Korea
{dwkang, dhlee, jwjung, dhwon}@security.re.kr
ABSTRACT
Malicious software is one of the serious threats in the
information society. A natural result of evolved malicious software, techniques for detecting malicious
software are also in progress. Based on statistical data
about existing malicious software is most important to
detect new malicious software. Studies which statistical malicious software analysis so far have mainly focused only opcode which a part of whole instruction.
This paper analyses the statistical data which considers whole instruction, not only opcode but also 5
types of operands. We find out that major of instruction both benign and malicious software are related
function call, and it can not be a good predictor for detecting malicious software. But, when the benigns instruction frequency gets smaller, the relation between
rare instruction malicious software classes multiplies.
Also, this paper discovers some instructions which are
only used in malicious software.

KEYWORDS
Malware, Statistical, Instruction, computer security,
Unique Instruction

INTRODUCTION

Malware (Malicious Software) is any software


used to disrupt computer operation, gather sensitive information, or gain access to private computer systems [1]. Malware are classified as
viruses, worms, and trojan according to their ability to replicate and infect target to collectively

presence of executable code created for malicious


purposes. Virus is a combination of instructions
to copy or modify itself to program or executable
part. Worm is a combination of instructions to
copy their own or modified without infecting the
programs. Worm is present in the presence of
a code or executable file of a memory location
and when executed, the file or code copies from
one system to another. Trojan, it appears as benign program, but will do damage once installed
or run. Those on the receiving end of a trojan
are usually tricked into opening them because
they appear to be receiving legitimate software
or files from a legitimate source. Recently, under covered internet users computer, encodes internal documents or spreadsheets, exchanges the
decoding program and money, which called ransomware program, cryptowall, spreads domestic
and worldwide. According to Dell SecureWorks
report, malware infects the 625,000 computers in
5 months. It encodes 1.24 billion files and damage cost is $1,101,900[2].
In this way, symptoms and spread type of
malicious code are complicated and intelligent.
Inevitably, the existing anti-virus programs need
to be expanded to function as an integrated security program to treat, diagnose a variety of malware. According to McAfee Labs Threats Report,
since 2012, the number of malware increasing
every quantity linearly. Also the number of new
malware increasing also linearly. There are over
307 new threats every minute, or more than 5 every seconds. The number of malware is increas-

Corresponding Author

ISBN: 978-1-941968-16-1 2015 SDIWC

32

Proceedings of The Fourth International Conference on Informatics & Applications, Takamatsu, Japan, 2015

ing about 3 times, in comparison with 2 years


ago, while number of new malware is increasing
about 5 times. [3]
Likewise, by classifying a large number of
malware to increase day by day shortening analysis time is most important issue in detecting. Using the similarity measurement techniques and
artificial intelligence to spot a new malware has
been proposed as technique. In this paper, we investigate the opcode analysis methods that have
been used as the main statistical data of existing
malicious code and try for more detailed statistical analysis and malware using IA-32 instruction.
In this paper, analysis some statistical information difference between malware and benign.
In Section 2, review and related work of analysis
malware and limitations of the statistical analysis
based opcode.In Section 3 and 4, explain how to
collect the sample, instruction extraction and result of statistical analysis about malware. In Section 5 and 6, discuss about finding out, future research and potential development.
2

RELATED WORK

Nowadays signature-based detection is widely


known as the most used to detect the malware. Signature-based detection depends on the
database of malware which is getting the amount
of many. Mount of malware database is gradually big, its accuracy is also reliable. But, as
mentioned in the introduction, the number of
new malware is on the rise and necessarily, the
probability of signature-based detection is not always good. It is important to analyse statistical
data on existing malicious code in order to detect new malicious code. Furthermore, research
to find clearer statistical differences that distinguish benign and malware has been continued.
Li et al (2005)[4] implemented 1-gram analysis of binary level, not opcode level and propose
the fingerprint which file type identification. But
it is not very good efficient when malicious code
does not place at the beginning or end of the file.
Weber at al (2002)[5] implemented and provide the tool, PEAT(Portable Executable Aalysis
Toolkit), which given the information of instruc-

ISBN: 978-1-941968-16-1 2015 SDIWC

tion frequencies, instruction patterns, register offsets, jump and call offsets, entropy of opcode values, code and ASCII probabilities.
Bilar(2007)[6] focused on the opcode as
predictor, and verified the validity of opcode.
Analyse 14 kinds of commonly used opcode and
14 kinds of rare used opcode. The result was that
rare used opcode is more efficient as predictor
compared to commonly used opcode.
Santos et al(2010)[7] proposed a new technique to detect variants of known malware families. Abstract benign and malware opcode sequence. Using the cosine similarity, evaluate
the resemblance benign and malware. Furthermore, evaluated further similarity between malware families. The system was able to identify
malware variants, second able to distinguish benign executables.
Shabtai et al(2012)[8] examined the effectiveness of malware detection when using different N-gram size. Proved 2-gram size which saw
consecutive 2 opcodes as one performed best.
This result also supported same conclusion which
Moskovitch et al (2008)s research[9].
Santos et al(2013)[10] proposed new technique to detect unknown malware families based
on the requency of the appearance of opcode
sequences. Using decision tree, SVM, kNN,
bayesian networks.
Nissim et al(2014)[11] used the statistical
data of the N-gram byte present an active learning (AL) framework and introduce two new AL
methods that will assist anti-virus vendors to focus their analytical efforts by acquiring those files
that are most probably malicious
Philip et al(2014)[12] investigated the full
spectrum of opcodes that could be used to detect
malware. Using a novel combination of feature
filtering and feature selection to investigate the
ability to detect malware with different N-gram
and when N = 3, N = 4 in the best detection rates
showed.
In this way, any research so far has been
the focused of byte-level or opcode sequenece
(Shown Figure 1). However, within the framework of instruction, exclude the operand and
consider opcode only can produce a statisti-

33

Proceedings of The Fourth International Conference on Informatics & Applications, Takamatsu, Japan, 2015

cally distortion. Even though opcodes are each


same, there are many types of operands and each
operand as like as chalk and cheese (For example:
mov eax [ebx], mov Imm eax).
Therefore,
Opcode

Operand

Opcode

SXVK

HE[

SXVK

SXVK

HVL

SXVK

PRY

HVLHG[

PRY

PRY

HE[HD[

PRY

FDOO

68%B/

FDOO

WHVW

HD[HD[

WHVW

MQ]

/

MQ]

[RU

HD[HD[

[RU

SRS

HVL

SRS

SRS

HE[

SRS

UHWQ

Operand

UHWQ

tool, virus, worm). On equal terms with benign set, each classes consists of 200 Win32
executable files. 35%(70files) of (1-10KB),
45%(90files) of (10-100KB), 20%(40files) of
(100KB-1000KB(=1MB)). Also, for raise variety, each malware classes have at least 10 variants. Table 1 shows each classes and variants.
All of the malware files source is vxheavens
database.
Table 1. Variants of each malware classes
Name
Backdoor
DoS
Rootkit
Trojan

Figure 1. Limit of the so far research


in this study, using instruction prefixes, opcode,
Mod R / M, SIB according the IA-32 instruction
format. It can be compared advantageously with
so far researching because consider not only opcode but also operand at the same time.
3

COLLECT INSTRUCTIONS

In this section, we describe how to collect the


sample executable benign and malware set and
match machine language to instructions.
3.1

Benign Collect

For benign set, we extract default executable files


on a Win 7 Ultimate K. It consists of 200 executable files. Balancing the size of the file, share
interval of program size (1-10KB), (10-100KB),
(100KB-1000KB(=1MB)). Total executable files
are included these in each interval of size block.
The ratio of number of file in each size block is
35%(70 files), 45%(90 files), 20%(40 files).
3.2

Malware Collect

For malware set, we classify malware into


7 classes (backdoor, dos, rootkit, trojan, vir-

ISBN: 978-1-941968-16-1 2015 SDIWC

Virtool
Virus
Worm

3.3

Variants
Angelfire, Banito, CrashCool, Danton, Frenzy,
Infexor, Izram, Jinmoze, Lantinus, MoSucker
ARPKiller, Agent, Aleph, Chalcol, DBomb,
DK-ToyBox, Delf, Hucsyn, Jman, Kod, Lanxue
DarkShell, Fuzen, Mag, Namana, Pakes,
Podnuha, Qandr, Ressdt, TDSS, Vanti
Aditer, Agent, Alcalup, Fatoos, Filco,
Mista, Monderd, Spabot, Srizbi, Whispy
Ainder, BlindEye, Cicho, Delf, Facker,
Jointer, Krepus, LdPinch, Scramble, Voodoo
Afgan, Belial, Bolzano, Cruck, Deemo,
Enumiacs, Henky, Krepper, Kuto, Neshta
Busan, Bymer, Donk, Fasong, Fesber,
Mobler, Petik, Sever, Sorin, Wogue

Instruction Breakdown

We use python package named diStorm3 to decode x86 binary files and analyse structure. Analyse only text section of executable file which
has real execute instruction of program. Each
files have DOS Header, stub code, PE Header,
each image sections header and each image section. There are 4 types of section, text, data, resource and relocated (Shown Figure 2). Text section which we focus on holds program code and
it is mapped as execute/readonly. For extract the
only text section, we use text section header. Table 2 shows some important contents which consists of text sections header.
We check the files characteristics whether
it is executable file or not. If it is executable file,
to using its text section header, We mapped each
sections RVA to RAW. At first, search the section
which includes RVA, and then using proportional
equation, calculate the File offset(RAW) which is
formally given by the following:

34

Proceedings of The Fourth International Conference on Informatics & Applications, Takamatsu, Japan, 2015
RIIVHW




'



'


)






















%&


),/(

0(025<

'26+HDGHU

'26+HDGHU

6WXE&RGH

6WXE&RGH

3(+HDGHU
2SWLRQDO)LOH+HDGHU 

3(+HDGHU
2SWLRQDO)LOH+HDGHU 

6HFWLRQ+HDGHU WH[W 

6HFWLRQ+HDGHU WH[W 

6HFWLRQ+HDGHU GDWD 

6HFWLRQ+HDGHU GDWD 

6HFWLRQ+HDGHU UVUF 

6HFWLRQ+HDGHU UVUF 

6HFWLRQ+HDGHU UHORF 

6HFWLRQ+HDGHU UHORF 

SDGGLQJ

SDGGLQJ

(6HFWLRQ WH[W

SDGGLQJ
6HFWLRQ GDWD

SDGGLQJ
6HFWLRQ UVUF

SDGGLQJ

SDGGLQJ
&6HFWLRQ GDWD

SDGGLQJ
6HFWLRQ UVUF

&6HFWLRQ UHORF

SDGGLQJ

&$6HFWLRQ WH[W

SDGGLQJ
%&6HFWLRQ UHORF

DGGUHVV




'



'


)


























%&



,QVWUXFWLRQ
3UHIL[HV

2SFRGH

0RG50

6,%

8SWRIRXU
SUHIL[HVRI
E\WHHDFK
2SWLRQDO 

RU
E\WH
RSFRGH

E\WH
LIUHTXLUHG 

E\WH
LIUHTXLUHG 


 

02'

Table 2. Important Member of Sections Header


structure

SizeOfRawData
PointerToRawData
Characteristics

We already know sections size which in the section header, so can be extract text section from
the whole instruction. After extract text section
from executable file, we translate each machine
language into instruction using IA-32 instruction
format. IA-32 instruction consists of 6 contents
(Shown Figure 3). Opcode is essential content,
and others are option. Figure 4 shows the simple
translate machine language to IA-32 instructions.
Next, we classify each operand by five types. Table 3 Shows each types of operands and brief explanation.
Finally, we regard each instruction as a
single word calculate the each instructions frequency with a python package named collections
and operator. And then, datasets normalized in
Microsoft Excel 2010. All this works execute on

ISBN: 978-1-941968-16-1 2015 SDIWC



50

6FDOH

,QGH[

%DVH

Operand Type

Description

Immediate

The Constant Value

Register

The value which in the Register

Absolute Memory

The address calculated uses


registers expression

Absolute Memory Address

The address calculated is absolute

Far Memory

Like absolute but with


selector/segment specified too

Microsoft Windows 7 Ultimake K, 64bit operating system.


4

RAW = RVA - VirtualAddress +


PointerToRawData.

$GGUHVV
,PPHGLDWH
GLVSODFHPHQW
GDWDRI
RIRU
RU
E\WHVRUQRQH E\WHVRUQRQH

Table 3. Types of operand

Figure 2. PE format

Description
Actual size of the code or data
RVA to where the loader should map the section
The size of the section after its been rounded up to the
file alignment size
File-based offset of where the raw data emitted by the
compiler or assembler can be found
Set of flags that indicate the sections attributes (such as
code/data, readable, or writeable)

,PPHGLDWH

Figure 3. IA-32 Instruction Format

SDGGLQJ

Name
VirtualSize
VirtualAddress

5HJ
2SFRGH

'LVSODFHPHQW

RESULT & ANALYSIS

In this section, we describe the statistical data


about benign and malware and analyse each data.
4.1

Instruction Frequency

At first, describe frequency about malware and


benign set. Which instruction has large proportion of program.
4.1.1

Benign

The benign samples included about 4,040,120


instructions and 594 different instruction types
were found. 64 instruction types(about 9%) accounted for 90% of total instruction type. Top 10
instruction types consist of 50% of total instruction and Top 20 instruction types consist of 70%
of total instruction.

35

Proceedings of The Fourth International Conference on Informatics & Applications, Takamatsu, Japan, 2015

&
>*@
>(@
>0RG50@

029 (&; ($;


029(&;($;
Figure 4. Translate machine language to IA-32 instructions
4.1.2

Malware

Malware, in common with benign set, calculate


total instructions.The malware samples included
about 31,264,571 instructions and 1,030 different
instruction types across 7 classes. Table 4 shows
each classes number of instruction and number
of instruction type. 96 instruction types(about
1%) accounted for 90% of total instruction type.
Top 10 instruction types consist of 48% of total
instruction and Top 20 instruction types consist
of 63% of total instruction. Table 5 shows the
most frequent 10 instructions about benign and
compares malwares each classes.
we see that high rank of instruction
both benign and malware are similar. Furthermore, high rank of instruction is related
to function call. When the function is called.
Always push[register], move[register, register],
pop[register], move[register, register], retn instructions are increased, because of stack
frame(See Figure 5).

ISBN: 978-1-941968-16-1 2015 SDIWC

Table 4. Statistic about Malware instruction


Classes

Total number of inst.

number of inst. type

Virus

3708544

905

Worm

4805446

865

Backdoor

4068544

913

Dos

4459540

873

Virtool

4835244

1013

Torjan

4695292

949

Rootkit

4691961

1004

4.2

Malware Instruction Ratio

Since lots of instruction, we design 4 instruction ratio blocks based on the benign instruction frequency. [-1%], [1% - 0.1%], [0.1% 0.01%], [0.01% - 0.001%]. Instruction ratio under 0.001% is excepted because it is too few to
analysis and it can be distort the statistical data.
Each interval has 22, 55, 104, 102 instructions.
We assumed that if there are no association between malware classes and instruction, the
ratio of benign instruction also can apply malware. We apply the each instruction ratio to mal-

36

Proceedings of The Fourth International Conference on Informatics & Applications, Takamatsu, Japan, 2015

Table 5. Top 10 instructions for Benign


!"#$%&#'(!

)*!'+!,-.

/'%$",-.

0(%1,-.

)2&34((%,-.

5(",-.

/'%#((6,-.

7%(82!,-.

9((#3'#,-.

:55;:<"(6$#*=*1(%>?9*+'"#*%@

ABCDA

EFCAGH ,A.

AFCFAH ,A.

EICDEH ,A.

ADCBFH ,A.

AJCEKH ,A.

AECIAH ,A.

ADCJLH ,A.

MNOP;9*+'"#*%@

JCIA

JCIAH ,F.

KCFFH ,F.

KCJH ,E.

DCLGH ,F.

DCJDH ,E.

KCIDH ,E.

DCALH ,E.

=Q/;9*+'"#*%?:<"(6$#*=*1(%>@

BCGF

DCEJH ,B.

KCDLH ,E.

FCGGH ,F.

DCALH ,B.

FCDGH ,B.

DCJH ,F.

ACFEH ,L.

R7H F;@

BCFE

JCBKH ,E.

ACDLH ,AD.

ACFBH ,AD.

JCAH ,E.

ECGJH ,J.

ACKBH ,AF.

ICFBH ,KB.

S:TT; 11*4'2#*@

FCJG

FCAKH ,J.

BCKFH ,J.

ECAH ,AI.

FCAKH ,K.

ECAAH ,L.

FCGGH ,J.

ACELH ,AA.

MQM;9*+'"#*%@

FCJ

FCAGH ,D.

BCLEH ,D.

FCIGH ,B.

FCLFH ,D.

FCGGH ,F.

DCBJH ,B.

FCLJH ,B.

=Q/;9*+'"#*%?9*+'"#*%@

FCFL

FCAH ,K.

DCFDH ,B.

ACGKH ,AA.

FCJLH ,J.

ECDH ,G.

BCKDH ,D.

ACFH ,AI.

UV; 11*4'2#*@

FCFG

ACLJH ,AI.

ECEKH ,AI.

ACFFH ,AJ.

ACLDH ,AA.

ACDH ,AE.

ECAAH ,AI.

ICJAH ,FF.

MNOP; 11*4'2#*@

FCED

ECFJH ,G.

ECFEH ,L.

ECDKH ,K.

ACLAH ,AF.

ACBKH ,AF.

ECIGH ,AA.

ICLBH ,AB.

=Q/;:<"(6$#*=*1(%>?9*+'"#*%@

FCID

ECELH ,L.

FCDJH ,K.

ECELH ,L.

ECEH ,L.

ACKDH ,AA.

ECBKH ,L.

ACAGH ,AE.

R$1<*%H WX'&XH '!H Y2%*!#X*"'"H '"H #X*H %2!3H (ZH *2&XH 126W2%*H &62""*"H '!"#%$&#'(!H %2!3'!+H

68%B/
6
SXVK

HE[

SXVK
S

HES

SXVK>UHJLVWHU@

SXVK

HVL

PRY
P

HESHVS

PRYH>UHJLVWHUUHJLVWHU@

PRY

HVLHG[

PRY

HE[HD[

FDOO
FDOO

68%B/
68%B/

WHVW

HD[HD[

SRS

HES

SRS>UHJLVWHU@

MQ]

/

PRY

HVSHES

PRYH>UHJLVWHUUHJLVWHU@

[RU

HD[HD[

UHWQ

SRS

HVL

SRS

HE[

UHWQ>,PPHGLDWH@RU
>0HPRU\$GGUHVV@

UHWQ

Figure 5. Function Call

has similar ratio, and there are no or little instruction which very high or very low compared
to benign instruction(0% - 27.27%). But, ratio
of benign instruction is decreasing, gap of percentage between benign and malware are increasing, reversly. Ratio instruction is between 0.01%
- 0.001%, about from 11.76% to 35.29% malware instruction has similar ratio, and about from
23.53% to 69.60% malware instruction has very
high or very low ratio compared by benign instruction.
4.3

wares instruction to calculate with total instruction. Compared the gap of percentage each ratio block. We divide gap of percentage 3 parts.
Similar, High or Low, Very High or Very Low.
Similar means malware and benigns gap of percentage is between 0.5 and 2 times. High or Low
means malware and benigns gap of percentage
is between 2 and 5 times(High) or 0.2(=1/5) and
0.5 times(Low). Very High or Very Low means
malware and benigns gap of percentage is over 5
times or under 0.2 times. After, calculate the ratio
of instruction which is relevant each interval.
Table 6-8 shows that. According to result,
ratio of benign instruction is high, also about
from 27.27% to 90.91% of malware instruction

ISBN: 978-1-941968-16-1 2015 SDIWC

Contingency test of Instruction

The more specific measure of association, phi, is


a measure which adjusts the chi square statistic
by the sample size. Phi is most easily defined as:
r
2
=
n
Sometimes phi squared is used as a measure of
association, and phi squared is defined as:
2 =

2
n

Since phi is usually less than one, and since the


square of a number less than one is an even
smaller number.

37

Proceedings of The Fourth International Conference on Informatics & Applications, Takamatsu, Japan, 2015

Table 6. percentage of Similar ratio between malware and benign



5DWLR

9LUXV

:RUP

%DFNGRRU

'RV

9LU7RRO

7URMDQ

5RRWNLW

>@















>@















>@















>@

















Table 7. percentage of High or Low ratio between malware and benign



5DWLR

9LUXV

:RUP

%DFNGRRU

'RV

9LU7RRO

7URMDQ

5RRWNLW

>@















>@















>@















>@

















Table 8. percentage of Very high or Very low ratio between malware and benign

5DWLR

9LUXV

:RUP

%DFNGRRU

'RV

9LU7RRO

7URMDQ

5RRWNLW

>@















>@















>@















>@















A slightly different measure of association


is the contingency coefficient. This is another chi
square based measure of association, and one that
also adjusts for different sample sizes like this
situation. The contingency coefficient can be defined as:
s
2
C=
n + 2
Since , it is straightforward to show that:

ISBN: 978-1-941968-16-1 2015 SDIWC

C=

2
n + 2

When there is no association between two variables, contingency coefficient value was zero,
and when the association strength increase, contingency coefficient approaches one.
Table 9 shows the contingency coefficient
each ratio block. The benign instruction ratio is
over 1%, the distribution of contingency coeffi-

38

Proceedings of The Fourth International Conference on Informatics & Applications, Takamatsu, Japan, 2015

Table 9. Compare of contingency coefficient

Ratio

Virus

Worm

Backdoor

Dos

Virtool

Trojan

Rootkit

-1%

0.3503

0.3608

0.3984

0.3219

0.4143

0.3824

0.4879

1% - 0.1%

0.3721

0.4860

0.6504

0.7233

0.7154

0.4771

0.7271

0.1% - 0.01%

0.7954

0.6248

0.8012

0.9326

0.8972

0.6719

0.8794

0.01% - 0.001%

0.8524

0.8149

0.9151

0.9578

0.9247

0.8845

0.9714

cient is between 32% - 44%. But, getting instruction ratio is smaller, contingency coefficient is
getting bigger, at last when the benign instruction
ratio is between 0.01% and 0.001%, contingency
coefficient is increased 85% - 97%, about two
times. It shows positive proof prior subsection,
when benign instruction ratio is getting smaller,
the ratio gap between malware and benign instruction is getting bigger. And it is strong statistical data which the rare instruction in the benign
is more stronger predictor when detecting malware.
4.4

UNPCKLPS, PUNPCKLBW, PUNPCKHBW)are


also has high proportion of malware unique
instruction. Figure 6 shows some unique instructions operation. The unique instruction related
shuffle & exchange. Including XADD, PSHUFW).
XADD instruction is exchange the first operand
with the second operand, then loads the sum of
the two values into the destination operand, and
PSHUFW instruction is shuffle the words based
on the encoding in immediate and store.

Some Specific statistic data in Malware

There are many instruction only use malware.


The total number of unique instruction is about
1% of total opcode. Table 10 shows total number
of each malware classs unique instruction and
number of type.
About 0.3 - 2.1% of total instruction is
unique instruction which only extract malware
executable file. Also, half type of total instruction
are not exists benign samples. The malwares
instruction diversity is very high compared to
benign. Through the unique instruction, there are
some characteristic which has high proportion.
First, instruction which related to bit test(eg.
BTS, BT) are using widely. Bit test instruction
can be used with a lock prefix to allow the
instruction to be executed atomically. It is related
to multi-thread environment. Second, instruction
which related to Double Precision Floating Point
instruction(eg. MULSD, ADDSD, MULPD, SUBSD
etc.) comprise a large proportion of malware
unique instruction. Next, instruction which
related to Packed and Unpacked instruction(eg.

ISBN: 978-1-941968-16-1 2015 SDIWC

'(67

;

;

;

;

65&

<

<

<

<

'(67

<

;
;

<

;

813&./36,QVWUXFWLRQ2SHUDWLRQ ELW 
65&

< < < < < < < <

'(67

3813&./%:
'(67

< ; < ;


; <
;
< ;
; <
< ;
;

; ; ; ; ; ; ; ;


3813&.+%:

'(67
'(67

< ;
<
; <
< ;
;
 < ; < ;

3813&./%:3813&.+%:,QVWUXFWLRQ2SHUDWLRQ ELW 

Figure 6. Unique instructions operation

FURTHER WORK

In this section, we discuss about the further work


which gain a toehold in this research.
5.1

Make more accurate detection tool with


Machine Learning

Not only extract statistical data, the final aim


is make more accurate detection malware tool.
There are many machine Learning algorithm(eg.

39

Proceedings of The Fourth International Conference on Informatics & Applications, Takamatsu, Japan, 2015

Table 10. Some Unique Instruction in each Malware classes


class of
malware

% of total
unique
inst.

total type
of unique
inst.

Top 5 Unique inst.

Virus

1.452

338

XADD[Register,Register], BTS[Register,Register],
MULSD[Register,Register], ADDSD[Register,Register],
MULPD[Register,Register]

307

BT[AbsoluteMemoryAddress,Register],
FLDCW[AbsoluteMemoryAddress],
CMOVNS[Register,AbsoluteMemory],
BTS[Register,Register],
FIMUL[AbsoluteMemoryAddress]

351

PCMPGTD[Register,Register], WRMSR[],
PSUBB[Register,AbsoluteMemory],
UNPCKLPS[Register,AbsoluteMemory],
FLDCW[AbsoluteMemoryAddress]

313

BT[AbsoluteMemoryAddress,Register],
FLDCW[AbsoluteMemoryAddress],
CVTTPS2PI[Register,AbsoluteMemory],
UNPCKLPS[Register,AbsoluteMemory], WRMSR[]

447

PSHUFW[Register,AbsoluteMemory,Immediate],
PUNPCKHBW[Register,AbsoluteMemory],
ADDPS[Register,AbsoluteMemory],
SUBPS[Register,AbsoluteMemory],
PUNPCKLBW[Register,AbsoluteMemory]

Worm

Backdoor

Dos

Virtool

0.501

0.6667

0.3608

1.358

Trojan

0.9567

388

MOVMSKPS[Register,Register],
BT[AbsoluteMemoryAddress,Register],
FLDCW[AbsoluteMemoryAddress],
CMOVNS[Register,AbsoluteMemory],
FIMUL[AbsoluteMemoryAddress]

Rootkit

2.135

442

LSL[Register,AbsoluteMemory], UD2[],
FIMUL[AbsoluteMemoryAddress], WBINVD[],
MAXPS[Register,AbsoluteMemory]

Support Vector Machine, Genetic Algorithm, Decision Tree or these combination). Provides an effective method to detect malware, Furthermore,
variants of malware families.
5.2

Expanded to N-gram instruction

Doing the research, we found some unusual instruction sequence code in Malware code like fig-

ISBN: 978-1-941968-16-1 2015 SDIWC

ure 7.
This is deserved more deepen research. As you
see, some instructions has context relationship as
combination more than two instruction sequence.
Expand the instruction sequence N-gram and extract some more specific statistical data, it is provide more solid statistical data about malware.
Recently, aimed at specific targets with
clear objectives intelligently, secretly collect and

40

Proceedings of The Fourth International Conference on Informatics & Applications, Takamatsu, Japan, 2015
&03>$EVROXWH0HPRU\,PPHGLDWH@
-=>,PPHGLDWH@
&03>$EVROXWH0HPRU\,PPHGLDWH@
-=>,PPHGLDWH@
&03>$EVROXWH0HPRU\,PPHGLDWH@
-=>,PPHGLDWH@
&03>$EVROXWH0HPRU\,PPHGLDWH@
-=>,PPHGLDWH@
&03>$EVROXWH0HPRU\,PPHGLDWH@
-=>,PPHGLDWH@
&03>$EVROXWH0HPRU\,PPHGLDWH@
-=>,PPHGLDWH@

9LUXV:LQ.GDU

2876>5HJLVWHU$EVROXWH0HPRU\@
,1&>5HJLVWHU@
$''>5HJLVWHU5HJLVWHU@
2876>5HJLVWHU$EVROXWH0HPRU\@
,1&>5HJLVWHU@
$''>$EVROXWH0HPRU\5HJLVWHU@
2876>5HJLVWHU$EVROXWH0HPRU\@
,1&>5HJLVWHU@
$''>5HJLVWHU5HJLVWHU@
2876>5HJLVWHU$EVROXWH0HPRU\@
,1&>5HJLVWHU@
$''>$EVROXWH0HPRU\5HJLVWHU@

:RUP:LQ'HIIHFWLYH

386+>,PPHGLDWH@
386+>,PPHGLDWH@
&$//>,PPHGLDWH@
029>$EVROXWH0HPRU\5HJLVWHU@
386+>,PPHGLDWH@
386+>,PPHGLDWH@
&$//>,PPHGLDWH@
029>$EVROXWH0HPRU\5HJLVWHU@
386+>,PPHGLDWH@
386+>,PPHGLDWH@
&$//>,PPHGLDWH@
029>$EVROXWH0HPRU\5HJLVWHU@

7URMDQ:LQ,&46SRRI

-03>$EVROXWH0HPRU\$GGUHVV@
029>5HJLVWHU5HJLVWHU@
-03>$EVROXWH0HPRU\$GGUHVV@
029>5HJLVWHU5HJLVWHU@
-03>$EVROXWH0HPRU\$GGUHVV@
029>5HJLVWHU5HJLVWHU@
-03>$EVROXWH0HPRU\$GGUHVV@
029>5HJLVWHU5HJLVWHU@
-03>$EVROXWH0HPRU\$GGUHVV@
029>5HJLVWHU5HJLVWHU@
-03>$EVROXWH0HPRU\$GGUHVV@
029>5HJLVWHU5HJLVWHU@

9LUXV:LQ(PDU7URMDQ:LQ$GH[

$''>$EVROXWH0HPRU\5HJLVWHU@
,1&>5HJLVWHU@
,1&>5HJLVWHU@
$''>$EVROXWH0HPRU\5HJLVWHU@
,1&>5HJLVWHU@
,1&>5HJLVWHU@
$''>$EVROXWH0HPRU\5HJLVWHU@
,1&>5HJLVWHU@
,1&>5HJLVWHU@
$''>$EVROXWH0HPRU\5HJLVWHU@
,1&>5HJLVWHU@
,1&>5HJLVWHU@

:RUP:LQ2QYHU

6%%>5HJLVWHU,PPHGLDWH@
$''>$EVROXWH0HPRU\5HJLVWHU@
6%%>5HJLVWHU,PPHGLDWH@
$''>5HJLVWHU5HJLVWHU@
6%%>5HJLVWHU,PPHGLDWH@
$''>5HJLVWHU5HJLVWHU@
6%%>5HJLVWHU,PPHGLDWH@
$''>5HJLVWHU5HJLVWHU@
6%%>5HJLVWHU,PPHGLDWH@
$''>5HJLVWHU5HJLVWHU@
6%%>5HJLVWHU,PPHGLDWH@
$''>$EVROXWH0HPRU\5HJLVWHU@

5RRWNLW:LQ5HVVGWE[

029>5HJLVWHU,PPHGLDWH@
&03>5HJLVWHU,PPHGLDWH@
029>5HJLVWHU,PPHGLDWH@
386+>,PPHGLDWH@
5(7>@
029>5HJLVWHU,PPHGLDWH@
&03>5HJLVWHU,PPHGLDWH@
029>5HJLVWHU,PPHGLDWH@
386+>,PPHGLDWH@
5(7>@
029>5HJLVWHU,PPHGLDWH@
&03>5HJLVWHU,PPHGLDWH@
029>5HJLVWHU,PPHGLDWH@
386+>,PPHGLDWH@
5(7>@
029>5HJLVWHU,PPHGLDWH@
&03>5HJLVWHU,PPHGLDWH@
029>5HJLVWHU,PPHGLDWH@
386+>,PPHGLDWH@
5(7>@

'R6:LQ9%FY

$''>5HJLVWHU5HJLVWHU@
'(&>5HJLVWHU@
,1&>5HJLVWHU@
$''>5HJLVWHU5HJLVWHU@
'(&>5HJLVWHU@
,1&>5HJLVWHU@
$''>$EVROXWH0HPRU\5HJLVWHU@
'(&>5HJLVWHU@
,1&>5HJLVWHU@
$''>$EVROXWH0HPRU\5HJLVWHU@
'(&>5HJLVWHU@
,1&>5HJLVWHU@

%DFNGRRU:LQ)ORERD

Figure 7. Some sequence instruction pattern in malware


disclose information of APT (Advanced Persistent Threat) attack increases, a growing trend
in the paper which detecting APT attacks[13].
This paper can be graft on to the other detection
method for increasing probability of detection.
6

CONCLUSION

We analyse benign and malware instructions with


4 parts in section 4. At result, there are some
worth notice points. First, number of malware instruction type is more detailed and more various
compared with benign. Secondly, it is not so different between high ranked instruction between
malware and benign instruction. But, the ratio
of benign instruction is getting smaller, present
a great contrast to malware instruction. To compare the contingency coefficient value, the relationship between malware and benign is getting

ISBN: 978-1-941968-16-1 2015 SDIWC

bigger when the ratio is lower. Furthermore, contingency coefficient interpreted as how much of
association without reference to other factors, so
the rare instruction which is high in scarcity can
be good predictor to distinguish between benign
and malware. Finally, there are some special instructions use malware instruction such as Double Precision Floating Point instruction, packed
or unpacked instruction. Also it can be important
characteristic when detecting malware.
ACKNOWLEDGMENT
This work was supported by Institute for Information & communications Technology Promotion(IITP) grant funded by the Korea government(MSIP) (No.R0126-15-1111, The Development of Risk-based Authentication Access Control Platform and Compliance Technique for

41

Proceedings of The Fourth International Conference on Informatics & Applications, Takamatsu, Japan, 2015

Cloud Security)
REFERENCES

[13] Kyungho Son. Design for Zombie PCs and


APT Attack Detection based on traffic analysis
Journal of The Korea Institute of Information Security & Cryptology 24.3 (2014): 491- 498

[1] Malware definition. techterms.com. Retrieved


30 April 2015.
[2] CryptoWall Ransomware.
http://www.secureworks.com/cyber-threatintelligence/threats/cryptowall-ransomware/.
Retrieved 30 April 2015.
[3] McAfee Labs Threats Report. 2014. Intel Security.
[4] Li, Wei-Jen. Fileprints: Identifying file types by
n-gram analysis. Information Assurance Workshop, 2005. IAW05. Proceedings from the Sixth
Annual IEEE SMC. IEEE, 2005.
[5] Weber. A toolkit for detecting and analyzing
malicious software. Computer Security Applications Conference, 2002. Proceedings. 18th Annual. IEEE, 2002.
[6] Bilar, Daniel. Opcodes as predictor for malware. International Journal of Electronic Security and Digital Forensics 1.2 (2007): 156-168.
[7] Santos, Igor. Idea: Opcode-sequence-based
malware detection. Engineering Secure Software and Systems. Springer Berlin Heidelberg,
2010. 35-43.
[8] Shabtai, Asaf. Detecting unknown malicious
code by applying classification techniques on opcode patterns. Security Informatics 1.1 (2012):
1-22.
[9] Moskovitch, Robert. Unknown malcode detection using OPCODE representation. Intelligence and Security Informatics. Springer Berlin
Heidelberg, 2008. 204-215.
[10] Santos, Igor. Opcode sequences as representation of executables for data-mining-based unknown malware detection. Information Sciences 231 (2013): 64-82.
[11] Nissim, Nir. Novel active learning methods
for enhanced PC malware detection in windows
OS. Expert Systems With Applications 41.13
(2014): 5843-5857.
[12] OKane, Philip, Sakir Sezer, and Kieran
McLaughlin. N-gram density based malware
detection. Computer Applications & Research
(WSCAR), 2014 World Symposium on. IEEE,
2014.

ISBN: 978-1-941968-16-1 2015 SDIWC

42

You might also like