135 PDF

Proceedings of The Fourth International Conference on Informatics & Applications, Takamatsu, Japan, 2015
Statistical Analysis Between Malware and Benign

Based on IA-32 Instruction
Dongwoo Kang, Donghoon Lee, Jaewook Jung, and Dongho Won
College of Information and Communication Engineering
Sungkyunkwan University
2066 Seobu-ro, Jangan-gu, Suwon-si, Gyeongki-do, 440-746, Korea
{dwkang, dhlee, jwjung, dhwon}@security.re.kr
ABSTRACT
Malicious software is one of the serious threats in the
information society. A natural result of evolved malicious software, techniques for detecting malicious
software are also in progress. Based on statistical data
about existing malicious software is most important to
detect new malicious software. Studies which statistical malicious software analysis so far have mainly focused only opcode which a part of whole instruction.
This paper analyses the statistical data which considers whole instruction, not only opcode but also 5
types of operands. We find out that major of instruction both benign and malicious software are related
function call, and it can not be a good predictor for detecting malicious software. But, when the benigns instruction frequency gets smaller, the relation between
rare instruction malicious software classes multiplies.
Also, this paper discovers some instructions which are
only used in malicious software.
KEYWORDS
Malware, Statistical, Instruction, computer security,
Unique Instruction
INTRODUCTION
Malware (Malicious Software) is any software

used to disrupt computer operation, gather sensitive information, or gain access to private computer systems [1]. Malware are classified as
viruses, worms, and trojan according to their ability to replicate and infect target to collectively
presence of executable code created for malicious

purposes. Virus is a combination of instructions
to copy or modify itself to program or executable
part. Worm is a combination of instructions to
copy their own or modified without infecting the
programs. Worm is present in the presence of
a code or executable file of a memory location
and when executed, the file or code copies from
one system to another. Trojan, it appears as benign program, but will do damage once installed
or run. Those on the receiving end of a trojan
are usually tricked into opening them because
they appear to be receiving legitimate software
or files from a legitimate source. Recently, under covered internet users computer, encodes internal documents or spreadsheets, exchanges the
decoding program and money, which called ransomware program, cryptowall, spreads domestic
and worldwide. According to Dell SecureWorks
report, malware infects the 625,000 computers in
5 months. It encodes 1.24 billion files and damage cost is $1,101,900[2].
In this way, symptoms and spread type of
malicious code are complicated and intelligent.
Inevitably, the existing anti-virus programs need
to be expanded to function as an integrated security program to treat, diagnose a variety of malware. According to McAfee Labs Threats Report,
since 2012, the number of malware increasing
every quantity linearly. Also the number of new
malware increasing also linearly. There are over
307 new threats every minute, or more than 5 every seconds. The number of malware is increas-
Corresponding Author
ISBN: 978-1-941968-16-1 2015 SDIWC
32
ing about 3 times, in comparison with 2 years

ago, while number of new malware is increasing
about 5 times. [3]
Likewise, by classifying a large number of
malware to increase day by day shortening analysis time is most important issue in detecting. Using the similarity measurement techniques and
artificial intelligence to spot a new malware has
been proposed as technique. In this paper, we investigate the opcode analysis methods that have
been used as the main statistical data of existing
malicious code and try for more detailed statistical analysis and malware using IA-32 instruction.
In this paper, analysis some statistical information difference between malware and benign.
In Section 2, review and related work of analysis
malware and limitations of the statistical analysis
based opcode.In Section 3 and 4, explain how to
collect the sample, instruction extraction and result of statistical analysis about malware. In Section 5 and 6, discuss about finding out, future research and potential development.
2
RELATED WORK
Nowadays signature-based detection is widely

known as the most used to detect the malware. Signature-based detection depends on the
database of malware which is getting the amount
of many. Mount of malware database is gradually big, its accuracy is also reliable. But, as
mentioned in the introduction, the number of
new malware is on the rise and necessarily, the
probability of signature-based detection is not always good. It is important to analyse statistical
data on existing malicious code in order to detect new malicious code. Furthermore, research
to find clearer statistical differences that distinguish benign and malware has been continued.
Li et al (2005)[4] implemented 1-gram analysis of binary level, not opcode level and propose
the fingerprint which file type identification. But
it is not very good efficient when malicious code
does not place at the beginning or end of the file.
Weber at al (2002)[5] implemented and provide the tool, PEAT(Portable Executable Aalysis
Toolkit), which given the information of instruc-
ISBN: 978-1-941968-16-1 2015 SDIWC
tion frequencies, instruction patterns, register offsets, jump and call offsets, entropy of opcode values, code and ASCII probabilities.
Bilar(2007)[6] focused on the opcode as
predictor, and verified the validity of opcode.
Analyse 14 kinds of commonly used opcode and
14 kinds of rare used opcode. The result was that
rare used opcode is more efficient as predictor
compared to commonly used opcode.
Santos et al(2010)[7] proposed a new technique to detect variants of known malware families. Abstract benign and malware opcode sequence. Using the cosine similarity, evaluate
the resemblance benign and malware. Furthermore, evaluated further similarity between malware families. The system was able to identify
malware variants, second able to distinguish benign executables.
Shabtai et al(2012)[8] examined the effectiveness of malware detection when using different N-gram size. Proved 2-gram size which saw
consecutive 2 opcodes as one performed best.
This result also supported same conclusion which
Moskovitch et al (2008)s research[9].
Santos et al(2013)[10] proposed new technique to detect unknown malware families based
on the requency of the appearance of opcode
sequences. Using decision tree, SVM, kNN,
bayesian networks.
Nissim et al(2014)[11] used the statistical
data of the N-gram byte present an active learning (AL) framework and introduce two new AL
methods that will assist anti-virus vendors to focus their analytical efforts by acquiring those files
that are most probably malicious
Philip et al(2014)[12] investigated the full
spectrum of opcodes that could be used to detect
malware. Using a novel combination of feature
filtering and feature selection to investigate the
ability to detect malware with different N-gram
and when N = 3, N = 4 in the best detection rates
showed.
In this way, any research so far has been
the focused of byte-level or opcode sequenece
(Shown Figure 1). However, within the framework of instruction, exclude the operand and
consider opcode only can produce a statisti-
33
cally distortion. Even though opcodes are each

same, there are many types of operands and each
operand as like as chalk and cheese (For example:
mov eax [ebx], mov Imm eax).
Therefore,
Opcode
Operand
Opcode
SXVK
HE[
SXVK
SXVK
HVL
SXVK
PRY
HVLHG[
PRY
PRY
HE[HD[
PRY
FDOO
68%B/
FDOO
WHVW
HD[HD[
WHVW
MQ]
/
MQ]
[RU
HD[HD[
[RU
SRS
HVL
SRS
SRS
HE[
SRS
UHWQ
Operand
UHWQ
tool, virus, worm). On equal terms with benign set, each classes consists of 200 Win32
executable files. 35%(70files) of (1-10KB),
45%(90files) of (10-100KB), 20%(40files) of
(100KB-1000KB(=1MB)). Also, for raise variety, each malware classes have at least 10 variants. Table 1 shows each classes and variants.
All of the malware files source is vxheavens
database.
Table 1. Variants of each malware classes
Name
Backdoor
DoS
Rootkit
Trojan
Figure 1. Limit of the so far research

in this study, using instruction prefixes, opcode,
Mod R / M, SIB according the IA-32 instruction
format. It can be compared advantageously with
so far researching because consider not only opcode but also operand at the same time.
3
COLLECT INSTRUCTIONS
In this section, we describe how to collect the

sample executable benign and malware set and
match machine language to instructions.
3.1
Benign Collect
For benign set, we extract default executable files

on a Win 7 Ultimate K. It consists of 200 executable files. Balancing the size of the file, share
interval of program size (1-10KB), (10-100KB),
(100KB-1000KB(=1MB)). Total executable files
are included these in each interval of size block.
The ratio of number of file in each size block is
35%(70 files), 45%(90 files), 20%(40 files).
3.2
Malware Collect
For malware set, we classify malware into

7 classes (backdoor, dos, rootkit, trojan, vir-
ISBN: 978-1-941968-16-1 2015 SDIWC
Virtool
Virus
Worm
3.3
Variants
Angelfire, Banito, CrashCool, Danton, Frenzy,
Infexor, Izram, Jinmoze, Lantinus, MoSucker
ARPKiller, Agent, Aleph, Chalcol, DBomb,
DK-ToyBox, Delf, Hucsyn, Jman, Kod, Lanxue
DarkShell, Fuzen, Mag, Namana, Pakes,
Podnuha, Qandr, Ressdt, TDSS, Vanti
Aditer, Agent, Alcalup, Fatoos, Filco,
Mista, Monderd, Spabot, Srizbi, Whispy
Ainder, BlindEye, Cicho, Delf, Facker,
Jointer, Krepus, LdPinch, Scramble, Voodoo
Afgan, Belial, Bolzano, Cruck, Deemo,
Enumiacs, Henky, Krepper, Kuto, Neshta
Busan, Bymer, Donk, Fasong, Fesber,
Mobler, Petik, Sever, Sorin, Wogue
Instruction Breakdown
We use python package named diStorm3 to decode x86 binary files and analyse structure. Analyse only text section of executable file which
has real execute instruction of program. Each
files have DOS Header, stub code, PE Header,
each image sections header and each image section. There are 4 types of section, text, data, resource and relocated (Shown Figure 2). Text section which we focus on holds program code and
it is mapped as execute/readonly. For extract the
only text section, we use text section header. Table 2 shows some important contents which consists of text sections header.
We check the files characteristics whether
it is executable file or not. If it is executable file,
to using its text section header, We mapped each
sections RVA to RAW. At first, search the section
which includes RVA, and then using proportional
equation, calculate the File offset(RAW) which is
formally given by the following:
34
RIIVHW

'

'

)

%&
),/(
0(025<
'26+HDGHU
'26+HDGHU
6WXE&RGH
6WXE&RGH
3(+HDGHU
2SWLRQDO)LOH+HDGHU
3(+HDGHU
2SWLRQDO)LOH+HDGHU
6HFWLRQ+HDGHUWH[W
6HFWLRQ+HDGHUWH[W
6HFWLRQ+HDGHUGDWD
6HFWLRQ+HDGHUGDWD
6HFWLRQ+HDGHUUVUF
6HFWLRQ+HDGHUUVUF
6HFWLRQ+HDGHUUHORF
6HFWLRQ+HDGHUUHORF
SDGGLQJ
SDGGLQJ
(6HFWLRQ WH[W
SDGGLQJ
6HFWLRQ GDWD
SDGGLQJ
6HFWLRQ UVUF
SDGGLQJ
SDGGLQJ
&6HFWLRQ GDWD
SDGGLQJ
6HFWLRQ UVUF
&6HFWLRQ UHORF
SDGGLQJ
&$6HFWLRQ WH[W
SDGGLQJ
%&6HFWLRQ UHORF
DGGUHVV

'

'

)

%&

,QVWUXFWLRQ
3UHIL[HV
2SFRGH
0RG50
6,%
8SWRIRXU
SUHIL[HVRI
E\WHHDFK
2SWLRQDO
RU
E\WH
RSFRGH
E\WH
LIUHTXLUHG
E\WH
LIUHTXLUHG

02'
Table 2. Important Member of Sections Header

structure
SizeOfRawData
PointerToRawData
Characteristics
We already know sections size which in the section header, so can be extract text section from
the whole instruction. After extract text section
from executable file, we translate each machine
language into instruction using IA-32 instruction
format. IA-32 instruction consists of 6 contents
(Shown Figure 3). Opcode is essential content,
and others are option. Figure 4 shows the simple
translate machine language to IA-32 instructions.
Next, we classify each operand by five types. Table 3 Shows each types of operands and brief explanation.
Finally, we regard each instruction as a
single word calculate the each instructions frequency with a python package named collections
and operator. And then, datasets normalized in
Microsoft Excel 2010. All this works execute on
ISBN: 978-1-941968-16-1 2015 SDIWC

50
6FDOH
,QGH[
%DVH
Operand Type
Description
Immediate
The Constant Value
Register
The value which in the Register
Absolute Memory
The address calculated uses

registers expression
Absolute Memory Address
The address calculated is absolute
Far Memory
Like absolute but with

selector/segment specified too
Microsoft Windows 7 Ultimake K, 64bit operating system.

4
RAW = RVA - VirtualAddress +

PointerToRawData.
$GGUHVV
,PPHGLDWH
GLVSODFHPHQW
GDWDRI
RIRU
RU
E\WHVRUQRQH E\WHVRUQRQH
Table 3. Types of operand
Figure 2. PE format
Description
Actual size of the code or data
RVA to where the loader should map the section
The size of the section after its been rounded up to the
file alignment size
File-based offset of where the raw data emitted by the
compiler or assembler can be found
Set of flags that indicate the sections attributes (such as
code/data, readable, or writeable)
,PPHGLDWH
Figure 3. IA-32 Instruction Format
SDGGLQJ
Name
VirtualSize
VirtualAddress
5HJ
2SFRGH
'LVSODFHPHQW
RESULT & ANALYSIS
In this section, we describe the statistical data

about benign and malware and analyse each data.
4.1
Instruction Frequency
At first, describe frequency about malware and

benign set. Which instruction has large proportion of program.
4.1.1
Benign
The benign samples included about 4,040,120

instructions and 594 different instruction types
were found. 64 instruction types(about 9%) accounted for 90% of total instruction type. Top 10
instruction types consist of 50% of total instruction and Top 20 instruction types consist of 70%
of total instruction.
35
&
>*@
>(@
>0RG50@
029 (&; ($;

029(&;($;
Figure 4. Translate machine language to IA-32 instructions
4.1.2
Malware
Malware, in common with benign set, calculate

total instructions.The malware samples included
about 31,264,571 instructions and 1,030 different
instruction types across 7 classes. Table 4 shows
each classes number of instruction and number
of instruction type. 96 instruction types(about
1%) accounted for 90% of total instruction type.
Top 10 instruction types consist of 48% of total
instruction and Top 20 instruction types consist
of 63% of total instruction. Table 5 shows the
most frequent 10 instructions about benign and
compares malwares each classes.
we see that high rank of instruction
both benign and malware are similar. Furthermore, high rank of instruction is related
to function call. When the function is called.
Always push[register], move[register, register],
pop[register], move[register, register], retn instructions are increased, because of stack
frame(See Figure 5).
ISBN: 978-1-941968-16-1 2015 SDIWC
Table 4. Statistic about Malware instruction

Classes
Total number of inst.
number of inst. type
Virus
3708544
905
Worm
4805446
865
Backdoor
4068544
913
Dos
4459540
873
Virtool
4835244
1013
Torjan
4695292
949
Rootkit
4691961
1004
4.2
Malware Instruction Ratio
Since lots of instruction, we design 4 instruction ratio blocks based on the benign instruction frequency. [-1%], [1% - 0.1%], [0.1% 0.01%], [0.01% - 0.001%]. Instruction ratio under 0.001% is excepted because it is too few to
analysis and it can be distort the statistical data.
Each interval has 22, 55, 104, 102 instructions.
We assumed that if there are no association between malware classes and instruction, the
ratio of benign instruction also can apply malware. We apply the each instruction ratio to mal-
36
Table 5. Top 10 instructions for Benign

!"#$%&#'(!
)*!'+!,-.
/'%$",-.
0(%1,-.
)2&34((%,-.
5(",-.
/'%#((6,-.
7%(82!,-.
9((#3'#,-.
:55;:<"(6$#*=*1(%>?9*+'"#*%@
ABCDA
EFCAGH ,A.
AFCFAH ,A.
EICDEH ,A.
ADCBFH ,A.
AJCEKH ,A.
AECIAH ,A.
ADCJLH ,A.
MNOP;9*+'"#*%@
JCIA
JCIAH ,F.
KCFFH ,F.
KCJH ,E.
DCLGH ,F.
DCJDH ,E.
KCIDH ,E.
DCALH ,E.
=Q/;9*+'"#*%?:<"(6$#*=*1(%>@
BCGF
DCEJH ,B.
KCDLH ,E.
FCGGH ,F.
DCALH ,B.
FCDGH ,B.
DCJH ,F.
ACFEH ,L.
R7H F;@
BCFE
JCBKH ,E.
ACDLH ,AD.
ACFBH ,AD.
JCAH ,E.
ECGJH ,J.
ACKBH ,AF.
ICFBH ,KB.
S:TT; 11*4'2#*@
FCJG
FCAKH ,J.
BCKFH ,J.
ECAH ,AI.
FCAKH ,K.
ECAAH ,L.
FCGGH ,J.
ACELH ,AA.
MQM;9*+'"#*%@
FCJ
FCAGH ,D.
BCLEH ,D.
FCIGH ,B.
FCLFH ,D.
FCGGH ,F.
DCBJH ,B.
FCLJH ,B.
=Q/;9*+'"#*%?9*+'"#*%@
FCFL
FCAH ,K.
DCFDH ,B.
ACGKH ,AA.
FCJLH ,J.
ECDH ,G.
BCKDH ,D.
ACFH ,AI.
UV; 11*4'2#*@
FCFG
ACLJH ,AI.
ECEKH ,AI.
ACFFH ,AJ.
ACLDH ,AA.
ACDH ,AE.
ECAAH ,AI.
ICJAH ,FF.
MNOP; 11*4'2#*@
FCED
ECFJH ,G.
ECFEH ,L.
ECDKH ,K.
ACLAH ,AF.
ACBKH ,AF.
ECIGH ,AA.
ICLBH ,AB.
=Q/;:<"(6$#*=*1(%>?9*+'"#*%@
FCID
ECELH ,L.
FCDJH ,K.
ECELH ,L.
ECEH ,L.
ACKDH ,AA.
ECBKH ,L.
ACAGH ,AE.
R$1<*%H WX'&XH '!H Y2%*!#X*"'"H '"H #X*H %2!3H (ZH *2&XH 126W2%*H &62""*"H '!"#%$&#'(!H %2!3'!+H
68%B/
6
SXVK
HE[
SXVK
S
HES
SXVK>UHJLVWHU@
SXVK
HVL
PRY
P
HESHVS
PRYH>UHJLVWHUUHJLVWHU@
PRY
HVLHG[
PRY
HE[HD[
FDOO
FDOO
68%B/
68%B/
WHVW
HD[HD[
SRS
HES
SRS>UHJLVWHU@
MQ]
/
PRY
HVSHES
PRYH>UHJLVWHUUHJLVWHU@
[RU
HD[HD[
UHWQ
SRS
HVL
SRS
HE[
UHWQ>,PPHGLDWH@RU
>0HPRU\$GGUHVV@
UHWQ
Figure 5. Function Call
has similar ratio, and there are no or little instruction which very high or very low compared
to benign instruction(0% - 27.27%). But, ratio
of benign instruction is decreasing, gap of percentage between benign and malware are increasing, reversly. Ratio instruction is between 0.01%
- 0.001%, about from 11.76% to 35.29% malware instruction has similar ratio, and about from
23.53% to 69.60% malware instruction has very
high or very low ratio compared by benign instruction.
4.3
wares instruction to calculate with total instruction. Compared the gap of percentage each ratio block. We divide gap of percentage 3 parts.
Similar, High or Low, Very High or Very Low.
Similar means malware and benigns gap of percentage is between 0.5 and 2 times. High or Low
means malware and benigns gap of percentage
is between 2 and 5 times(High) or 0.2(=1/5) and
0.5 times(Low). Very High or Very Low means
malware and benigns gap of percentage is over 5
times or under 0.2 times. After, calculate the ratio
of instruction which is relevant each interval.
Table 6-8 shows that. According to result,
ratio of benign instruction is high, also about
from 27.27% to 90.91% of malware instruction
ISBN: 978-1-941968-16-1 2015 SDIWC
Contingency test of Instruction
The more specific measure of association, phi, is

a measure which adjusts the chi square statistic
by the sample size. Phi is most easily defined as:
r
2
=
n
Sometimes phi squared is used as a measure of
association, and phi squared is defined as:
2 =
2
n
Since phi is usually less than one, and since the

square of a number less than one is an even
smaller number.
37
Table 6. percentage of Similar ratio between malware and benign
5DWLR
9LUXV
:RUP
%DFNGRRU
'RV
9LU7RRO
7URMDQ
5RRWNLW
>@

>@

>@

>@

Table 7. percentage of High or Low ratio between malware and benign
5DWLR
9LUXV
:RUP
%DFNGRRU
'RV
9LU7RRO
7URMDQ
5RRWNLW
>@

>@

>@

>@

Table 8. percentage of Very high or Very low ratio between malware and benign
5DWLR
9LUXV
:RUP
%DFNGRRU
'RV
9LU7RRO
7URMDQ
5RRWNLW
>@

>@

>@

>@

A slightly different measure of association

is the contingency coefficient. This is another chi
square based measure of association, and one that
also adjusts for different sample sizes like this
situation. The contingency coefficient can be defined as:
s
2
C=
n + 2
Since , it is straightforward to show that:
ISBN: 978-1-941968-16-1 2015 SDIWC
C=
2
n + 2
When there is no association between two variables, contingency coefficient value was zero,
and when the association strength increase, contingency coefficient approaches one.
Table 9 shows the contingency coefficient
each ratio block. The benign instruction ratio is
over 1%, the distribution of contingency coeffi-
38
Table 9. Compare of contingency coefficient
Ratio
Virus
Worm
Backdoor
Dos
Virtool
Trojan
Rootkit
-1%
0.3503
0.3608
0.3984
0.3219
0.4143
0.3824
0.4879
1% - 0.1%
0.3721
0.4860
0.6504
0.7233
0.7154
0.4771
0.7271
0.1% - 0.01%
0.7954
0.6248
0.8012
0.9326
0.8972
0.6719
0.8794
0.01% - 0.001%
0.8524
0.8149
0.9151
0.9578
0.9247
0.8845
0.9714
cient is between 32% - 44%. But, getting instruction ratio is smaller, contingency coefficient is
getting bigger, at last when the benign instruction
ratio is between 0.01% and 0.001%, contingency
coefficient is increased 85% - 97%, about two
times. It shows positive proof prior subsection,
when benign instruction ratio is getting smaller,
the ratio gap between malware and benign instruction is getting bigger. And it is strong statistical data which the rare instruction in the benign
is more stronger predictor when detecting malware.
4.4
UNPCKLPS, PUNPCKLBW, PUNPCKHBW)are

also has high proportion of malware unique
instruction. Figure 6 shows some unique instructions operation. The unique instruction related
shuffle & exchange. Including XADD, PSHUFW).
XADD instruction is exchange the first operand
with the second operand, then loads the sum of
the two values into the destination operand, and
PSHUFW instruction is shuffle the words based
on the encoding in immediate and store.
Some Specific statistic data in Malware
There are many instruction only use malware.

The total number of unique instruction is about
1% of total opcode. Table 10 shows total number
of each malware classs unique instruction and
number of type.
About 0.3 - 2.1% of total instruction is
unique instruction which only extract malware
executable file. Also, half type of total instruction
are not exists benign samples. The malwares
instruction diversity is very high compared to
benign. Through the unique instruction, there are
some characteristic which has high proportion.
First, instruction which related to bit test(eg.
BTS, BT) are using widely. Bit test instruction
can be used with a lock prefix to allow the
instruction to be executed atomically. It is related
to multi-thread environment. Second, instruction
which related to Double Precision Floating Point
instruction(eg. MULSD, ADDSD, MULPD, SUBSD
etc.) comprise a large proportion of malware
unique instruction. Next, instruction which
related to Packed and Unpacked instruction(eg.
ISBN: 978-1-941968-16-1 2015 SDIWC
'(67
;
;
;
;
65&
<
<
<
<
'(67
<
;
;
<
;
813&./36,QVWUXFWLRQ2SHUDWLRQELW
65&
< < < < < < < <
'(67
3813&./%:
'(67
< ; < ;

; <
;
< ;
; <
< ;
;
; ; ; ; ; ; ; ;

3813&.+%:
'(67
'(67
< ;
<
; <
< ;
;
< ; < ;
3813&./%:3813&.+%:,QVWUXFWLRQ2SHUDWLRQELW
Figure 6. Unique instructions operation
FURTHER WORK
In this section, we discuss about the further work

which gain a toehold in this research.
5.1
Make more accurate detection tool with

Machine Learning
Not only extract statistical data, the final aim

is make more accurate detection malware tool.
There are many machine Learning algorithm(eg.
39
Table 10. Some Unique Instruction in each Malware classes

class of
malware
% of total
unique
inst.
total type
of unique
inst.
Top 5 Unique inst.
Virus
1.452
338
XADD[Register,Register], BTS[Register,Register],
MULSD[Register,Register], ADDSD[Register,Register],
MULPD[Register,Register]
307
BT[AbsoluteMemoryAddress,Register],
FLDCW[AbsoluteMemoryAddress],
CMOVNS[Register,AbsoluteMemory],
BTS[Register,Register],
FIMUL[AbsoluteMemoryAddress]
351
PCMPGTD[Register,Register], WRMSR[],
PSUBB[Register,AbsoluteMemory],
UNPCKLPS[Register,AbsoluteMemory],
FLDCW[AbsoluteMemoryAddress]
313
CVTTPS2PI[Register,AbsoluteMemory],
UNPCKLPS[Register,AbsoluteMemory], WRMSR[]
447
PSHUFW[Register,AbsoluteMemory,Immediate],
PUNPCKHBW[Register,AbsoluteMemory],
ADDPS[Register,AbsoluteMemory],
SUBPS[Register,AbsoluteMemory],
PUNPCKLBW[Register,AbsoluteMemory]
Worm
Backdoor
Dos
Virtool
0.501
0.6667
0.3608
1.358
Trojan
0.9567
388
MOVMSKPS[Register,Register],
CMOVNS[Register,AbsoluteMemory],
FIMUL[AbsoluteMemoryAddress]
Rootkit
2.135
442
LSL[Register,AbsoluteMemory], UD2[],
FIMUL[AbsoluteMemoryAddress], WBINVD[],
MAXPS[Register,AbsoluteMemory]
Support Vector Machine, Genetic Algorithm, Decision Tree or these combination). Provides an effective method to detect malware, Furthermore,
variants of malware families.
5.2
Expanded to N-gram instruction
Doing the research, we found some unusual instruction sequence code in Malware code like fig-
ISBN: 978-1-941968-16-1 2015 SDIWC
ure 7.
This is deserved more deepen research. As you
see, some instructions has context relationship as
combination more than two instruction sequence.
Expand the instruction sequence N-gram and extract some more specific statistical data, it is provide more solid statistical data about malware.
Recently, aimed at specific targets with
clear objectives intelligently, secretly collect and
40
&03>$EVROXWH0HPRU\,PPHGLDWH@
-=>,PPHGLDWH@
-=>,PPHGLDWH@
-=>,PPHGLDWH@
-=>,PPHGLDWH@
-=>,PPHGLDWH@
-=>,PPHGLDWH@

9LUXV:LQ.GDU
2876>5HJLVWHU$EVROXWH0HPRU\@
,1&>5HJLVWHU@
$''>5HJLVWHU5HJLVWHU@
,1&>5HJLVWHU@
$''>$EVROXWH0HPRU\5HJLVWHU@
,1&>5HJLVWHU@
,1&>5HJLVWHU@

:RUP:LQ'HIIHFWLYH
386+>,PPHGLDWH@
386+>,PPHGLDWH@
&$//>,PPHGLDWH@
029>$EVROXWH0HPRU\5HJLVWHU@
386+>,PPHGLDWH@
386+>,PPHGLDWH@
&$//>,PPHGLDWH@
386+>,PPHGLDWH@
386+>,PPHGLDWH@
&$//>,PPHGLDWH@

7URMDQ:LQ,&46SRRI
-03>$EVROXWH0HPRU\$GGUHVV@
029>5HJLVWHU5HJLVWHU@

9LUXV:LQ(PDU7URMDQ:LQ$GH[
,1&>5HJLVWHU@
,1&>5HJLVWHU@
,1&>5HJLVWHU@
,1&>5HJLVWHU@
,1&>5HJLVWHU@
,1&>5HJLVWHU@
,1&>5HJLVWHU@
,1&>5HJLVWHU@

:RUP:LQ2QYHU
6%%>5HJLVWHU,PPHGLDWH@

5RRWNLW:LQ5HVVGWE[
029>5HJLVWHU,PPHGLDWH@
&03>5HJLVWHU,PPHGLDWH@
386+>,PPHGLDWH@
5(7>@
386+>,PPHGLDWH@
5(7>@
386+>,PPHGLDWH@
5(7>@
386+>,PPHGLDWH@
5(7>@

'R6:LQ9%FY
'(&>5HJLVWHU@
,1&>5HJLVWHU@
'(&>5HJLVWHU@
,1&>5HJLVWHU@
'(&>5HJLVWHU@
,1&>5HJLVWHU@
'(&>5HJLVWHU@
,1&>5HJLVWHU@

%DFNGRRU:LQ)ORERD
Figure 7. Some sequence instruction pattern in malware

disclose information of APT (Advanced Persistent Threat) attack increases, a growing trend
in the paper which detecting APT attacks[13].
This paper can be graft on to the other detection
method for increasing probability of detection.
6
CONCLUSION
We analyse benign and malware instructions with

4 parts in section 4. At result, there are some
worth notice points. First, number of malware instruction type is more detailed and more various
compared with benign. Secondly, it is not so different between high ranked instruction between
malware and benign instruction. But, the ratio
of benign instruction is getting smaller, present
a great contrast to malware instruction. To compare the contingency coefficient value, the relationship between malware and benign is getting
ISBN: 978-1-941968-16-1 2015 SDIWC
bigger when the ratio is lower. Furthermore, contingency coefficient interpreted as how much of
association without reference to other factors, so
the rare instruction which is high in scarcity can
be good predictor to distinguish between benign
and malware. Finally, there are some special instructions use malware instruction such as Double Precision Floating Point instruction, packed
or unpacked instruction. Also it can be important
characteristic when detecting malware.
ACKNOWLEDGMENT
This work was supported by Institute for Information & communications Technology Promotion(IITP) grant funded by the Korea government(MSIP) (No.R0126-15-1111, The Development of Risk-based Authentication Access Control Platform and Compliance Technique for
41
Cloud Security)
REFERENCES
[13] Kyungho Son. Design for Zombie PCs and

APT Attack Detection based on traffic analysis
Journal of The Korea Institute of Information Security & Cryptology 24.3 (2014): 491- 498
[1] Malware definition. techterms.com. Retrieved

30 April 2015.
[2] CryptoWall Ransomware.
http://www.secureworks.com/cyber-threatintelligence/threats/cryptowall-ransomware/.
Retrieved 30 April 2015.
[3] McAfee Labs Threats Report. 2014. Intel Security.
[4] Li, Wei-Jen. Fileprints: Identifying file types by
n-gram analysis. Information Assurance Workshop, 2005. IAW05. Proceedings from the Sixth
Annual IEEE SMC. IEEE, 2005.
[5] Weber. A toolkit for detecting and analyzing
malicious software. Computer Security Applications Conference, 2002. Proceedings. 18th Annual. IEEE, 2002.
[6] Bilar, Daniel. Opcodes as predictor for malware. International Journal of Electronic Security and Digital Forensics 1.2 (2007): 156-168.
[7] Santos, Igor. Idea: Opcode-sequence-based
malware detection. Engineering Secure Software and Systems. Springer Berlin Heidelberg,
2010. 35-43.
[8] Shabtai, Asaf. Detecting unknown malicious
code by applying classification techniques on opcode patterns. Security Informatics 1.1 (2012):
1-22.
[9] Moskovitch, Robert. Unknown malcode detection using OPCODE representation. Intelligence and Security Informatics. Springer Berlin
Heidelberg, 2008. 204-215.
[10] Santos, Igor. Opcode sequences as representation of executables for data-mining-based unknown malware detection. Information Sciences 231 (2013): 64-82.
[11] Nissim, Nir. Novel active learning methods
for enhanced PC malware detection in windows
OS. Expert Systems With Applications 41.13
(2014): 5843-5857.
[12] OKane, Philip, Sakir Sezer, and Kieran
McLaughlin. N-gram density based malware
detection. Computer Applications & Research
(WSCAR), 2014 World Symposium on. IEEE,
2014.
ISBN: 978-1-941968-16-1 2015 SDIWC
42

135 PDF

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

135 PDF

Uploaded by

Copyright:

Available Formats

Proceedings of The Fourth International Conference on Informatics & Applications, Takamatsu, Japan, 2015

Statistical Analysis Between Malware and Benign

Malware (Malicious Software) is any software

presence of executable code created for malicious

ISBN: 978-1-941968-16-1 2015 SDIWC

ing about 3 times, in comparison with 2 years

Nowadays signature-based detection is widely

ISBN: 978-1-941968-16-1 2015 SDIWC

cally distortion. Even though opcodes are each

Figure 1. Limit of the so far research

In this section, we describe how to collect the

For benign set, we extract default executable files

For malware set, we classify malware into

ISBN: 978-1-941968-16-1 2015 SDIWC

Table 2. Important Member of Sections Header

ISBN: 978-1-941968-16-1 2015 SDIWC

The Constant Value

The value which in the Register

The address calculated uses

Absolute Memory Address

The address calculated is absolute

Like absolute but with

Microsoft Windows 7 Ultimake K, 64bit operating system.

RAW = RVA - VirtualAddress +

Table 3. Types of operand

Figure 3. IA-32 Instruction Format

RESULT & ANALYSIS

In this section, we describe the statistical data

At first, describe frequency about malware and

The benign samples included about 4,040,120

029 (&; ($;

Malware, in common with benign set, calculate

ISBN: 978-1-941968-16-1 2015 SDIWC

Table 4. Statistic about Malware instruction

Total number of inst.

number of inst. type

Malware Instruction Ratio

Table 5. Top 10 instructions for Benign

Figure 5. Function Call

ISBN: 978-1-941968-16-1 2015 SDIWC

Contingency test of Instruction

The more specific measure of association, phi, is

Since phi is usually less than one, and since the

Table 6. percentage of Similar ratio between malware and benign

A slightly different measure of association

ISBN: 978-1-941968-16-1 2015 SDIWC

Table 9. Compare of contingency coefficient

UNPCKLPS, PUNPCKLBW, PUNPCKHBW)are

Some Specific statistic data in Malware

There are many instruction only use malware.

ISBN: 978-1-941968-16-1 2015 SDIWC

< < < < < < < <

< ; < ;

; ; ; ; ; ; ; ;

Figure 6. Unique instructions operation

In this section, we discuss about the further work

Make more accurate detection tool with

Not only extract statistical data, the final aim

Table 10. Some Unique Instruction in each Malware classes

Top 5 Unique inst.

Expanded to N-gram instruction

ISBN: 978-1-941968-16-1 2015 SDIWC

< < < < < < < <

< ; < ;

; ; ; ; ; ; ; ;