Professional Documents
Culture Documents
KEYWORDS
Malware, Statistical, Instruction, computer security,
Unique Instruction
INTRODUCTION
Corresponding Author
32
Proceedings of The Fourth International Conference on Informatics & Applications, Takamatsu, Japan, 2015
RELATED WORK
tion frequencies, instruction patterns, register offsets, jump and call offsets, entropy of opcode values, code and ASCII probabilities.
Bilar(2007)[6] focused on the opcode as
predictor, and verified the validity of opcode.
Analyse 14 kinds of commonly used opcode and
14 kinds of rare used opcode. The result was that
rare used opcode is more efficient as predictor
compared to commonly used opcode.
Santos et al(2010)[7] proposed a new technique to detect variants of known malware families. Abstract benign and malware opcode sequence. Using the cosine similarity, evaluate
the resemblance benign and malware. Furthermore, evaluated further similarity between malware families. The system was able to identify
malware variants, second able to distinguish benign executables.
Shabtai et al(2012)[8] examined the effectiveness of malware detection when using different N-gram size. Proved 2-gram size which saw
consecutive 2 opcodes as one performed best.
This result also supported same conclusion which
Moskovitch et al (2008)s research[9].
Santos et al(2013)[10] proposed new technique to detect unknown malware families based
on the requency of the appearance of opcode
sequences. Using decision tree, SVM, kNN,
bayesian networks.
Nissim et al(2014)[11] used the statistical
data of the N-gram byte present an active learning (AL) framework and introduce two new AL
methods that will assist anti-virus vendors to focus their analytical efforts by acquiring those files
that are most probably malicious
Philip et al(2014)[12] investigated the full
spectrum of opcodes that could be used to detect
malware. Using a novel combination of feature
filtering and feature selection to investigate the
ability to detect malware with different N-gram
and when N = 3, N = 4 in the best detection rates
showed.
In this way, any research so far has been
the focused of byte-level or opcode sequenece
(Shown Figure 1). However, within the framework of instruction, exclude the operand and
consider opcode only can produce a statisti-
33
Proceedings of The Fourth International Conference on Informatics & Applications, Takamatsu, Japan, 2015
Operand
Opcode
SXVK
HE[
SXVK
SXVK
HVL
SXVK
PRY
HVLHG[
PRY
PRY
HE[HD[
PRY
FDOO
68%B/
FDOO
WHVW
HD[HD[
WHVW
MQ]
/
MQ]
[RU
HD[HD[
[RU
SRS
HVL
SRS
SRS
HE[
SRS
UHWQ
Operand
UHWQ
tool, virus, worm). On equal terms with benign set, each classes consists of 200 Win32
executable files. 35%(70files) of (1-10KB),
45%(90files) of (10-100KB), 20%(40files) of
(100KB-1000KB(=1MB)). Also, for raise variety, each malware classes have at least 10 variants. Table 1 shows each classes and variants.
All of the malware files source is vxheavens
database.
Table 1. Variants of each malware classes
Name
Backdoor
DoS
Rootkit
Trojan
COLLECT INSTRUCTIONS
Benign Collect
Malware Collect
Virtool
Virus
Worm
3.3
Variants
Angelfire, Banito, CrashCool, Danton, Frenzy,
Infexor, Izram, Jinmoze, Lantinus, MoSucker
ARPKiller, Agent, Aleph, Chalcol, DBomb,
DK-ToyBox, Delf, Hucsyn, Jman, Kod, Lanxue
DarkShell, Fuzen, Mag, Namana, Pakes,
Podnuha, Qandr, Ressdt, TDSS, Vanti
Aditer, Agent, Alcalup, Fatoos, Filco,
Mista, Monderd, Spabot, Srizbi, Whispy
Ainder, BlindEye, Cicho, Delf, Facker,
Jointer, Krepus, LdPinch, Scramble, Voodoo
Afgan, Belial, Bolzano, Cruck, Deemo,
Enumiacs, Henky, Krepper, Kuto, Neshta
Busan, Bymer, Donk, Fasong, Fesber,
Mobler, Petik, Sever, Sorin, Wogue
Instruction Breakdown
We use python package named diStorm3 to decode x86 binary files and analyse structure. Analyse only text section of executable file which
has real execute instruction of program. Each
files have DOS Header, stub code, PE Header,
each image sections header and each image section. There are 4 types of section, text, data, resource and relocated (Shown Figure 2). Text section which we focus on holds program code and
it is mapped as execute/readonly. For extract the
only text section, we use text section header. Table 2 shows some important contents which consists of text sections header.
We check the files characteristics whether
it is executable file or not. If it is executable file,
to using its text section header, We mapped each
sections RVA to RAW. At first, search the section
which includes RVA, and then using proportional
equation, calculate the File offset(RAW) which is
formally given by the following:
34
Proceedings of The Fourth International Conference on Informatics & Applications, Takamatsu, Japan, 2015
RIIVHW
'
'
)
%&
),/(
0(025<
'26+HDGHU
'26+HDGHU
6WXE&RGH
6WXE&RGH
3(+HDGHU
2SWLRQDO)LOH+HDGHU
3(+HDGHU
2SWLRQDO)LOH+HDGHU
6HFWLRQ+HDGHUWH[W
6HFWLRQ+HDGHUWH[W
6HFWLRQ+HDGHUGDWD
6HFWLRQ+HDGHUGDWD
6HFWLRQ+HDGHUUVUF
6HFWLRQ+HDGHUUVUF
6HFWLRQ+HDGHUUHORF
6HFWLRQ+HDGHUUHORF
SDGGLQJ
SDGGLQJ
(6HFWLRQ WH[W
SDGGLQJ
6HFWLRQ
GDWD
SDGGLQJ
6HFWLRQ
UVUF
SDGGLQJ
SDGGLQJ
&6HFWLRQ
GDWD
SDGGLQJ
6HFWLRQ
UVUF
&6HFWLRQ UHORF
SDGGLQJ
&$6HFWLRQ WH[W
SDGGLQJ
%&6HFWLRQ
UHORF
DGGUHVV
'
'
)
%&
,QVWUXFWLRQ
3UHIL[HV
2SFRGH
0RG50
6,%
8SWRIRXU
SUHIL[HVRI
E\WHHDFK
2SWLRQDO
RU
E\WH
RSFRGH
E\WH
LIUHTXLUHG
E\WH
LIUHTXLUHG
02'
SizeOfRawData
PointerToRawData
Characteristics
We already know sections size which in the section header, so can be extract text section from
the whole instruction. After extract text section
from executable file, we translate each machine
language into instruction using IA-32 instruction
format. IA-32 instruction consists of 6 contents
(Shown Figure 3). Opcode is essential content,
and others are option. Figure 4 shows the simple
translate machine language to IA-32 instructions.
Next, we classify each operand by five types. Table 3 Shows each types of operands and brief explanation.
Finally, we regard each instruction as a
single word calculate the each instructions frequency with a python package named collections
and operator. And then, datasets normalized in
Microsoft Excel 2010. All this works execute on
50
6FDOH
,QGH[
%DVH
Operand Type
Description
Immediate
Register
Absolute Memory
Far Memory
$GGUHVV
,PPHGLDWH
GLVSODFHPHQW
GDWDRI
RIRU
RU
E\WHVRUQRQH E\WHVRUQRQH
Figure 2. PE format
Description
Actual size of the code or data
RVA to where the loader should map the section
The size of the section after its been rounded up to the
file alignment size
File-based offset of where the raw data emitted by the
compiler or assembler can be found
Set of flags that indicate the sections attributes (such as
code/data, readable, or writeable)
,PPHGLDWH
SDGGLQJ
Name
VirtualSize
VirtualAddress
5HJ
2SFRGH
'LVSODFHPHQW
Instruction Frequency
Benign
35
Proceedings of The Fourth International Conference on Informatics & Applications, Takamatsu, Japan, 2015
&
>*@
>(@
>0RG50@
Malware
Virus
3708544
905
Worm
4805446
865
Backdoor
4068544
913
Dos
4459540
873
Virtool
4835244
1013
Torjan
4695292
949
Rootkit
4691961
1004
4.2
Since lots of instruction, we design 4 instruction ratio blocks based on the benign instruction frequency. [-1%], [1% - 0.1%], [0.1% 0.01%], [0.01% - 0.001%]. Instruction ratio under 0.001% is excepted because it is too few to
analysis and it can be distort the statistical data.
Each interval has 22, 55, 104, 102 instructions.
We assumed that if there are no association between malware classes and instruction, the
ratio of benign instruction also can apply malware. We apply the each instruction ratio to mal-
36
Proceedings of The Fourth International Conference on Informatics & Applications, Takamatsu, Japan, 2015
)*!'+!,-.
/'%$",-.
0(%1,-.
)2&34((%,-.
5(",-.
/'%#((6,-.
7%(82!,-.
9((#3'#,-.
:55;:<"(6$#*=*1(%>?9*+'"#*%@
ABCDA
EFCAGH ,A.
AFCFAH ,A.
EICDEH ,A.
ADCBFH ,A.
AJCEKH ,A.
AECIAH ,A.
ADCJLH ,A.
MNOP;9*+'"#*%@
JCIA
JCIAH ,F.
KCFFH ,F.
KCJH ,E.
DCLGH ,F.
DCJDH ,E.
KCIDH ,E.
DCALH ,E.
=Q/;9*+'"#*%?:<"(6$#*=*1(%>@
BCGF
DCEJH ,B.
KCDLH ,E.
FCGGH ,F.
DCALH ,B.
FCDGH ,B.
DCJH ,F.
ACFEH ,L.
R7H F;@
BCFE
JCBKH ,E.
ACDLH ,AD.
ACFBH ,AD.
JCAH ,E.
ECGJH ,J.
ACKBH ,AF.
ICFBH ,KB.
S:TT; 11*4'2#*@
FCJG
FCAKH ,J.
BCKFH ,J.
ECAH ,AI.
FCAKH ,K.
ECAAH ,L.
FCGGH ,J.
ACELH ,AA.
MQM;9*+'"#*%@
FCJ
FCAGH ,D.
BCLEH ,D.
FCIGH ,B.
FCLFH ,D.
FCGGH ,F.
DCBJH ,B.
FCLJH ,B.
=Q/;9*+'"#*%?9*+'"#*%@
FCFL
FCAH ,K.
DCFDH ,B.
ACGKH ,AA.
FCJLH ,J.
ECDH ,G.
BCKDH ,D.
ACFH ,AI.
UV; 11*4'2#*@
FCFG
ACLJH ,AI.
ECEKH ,AI.
ACFFH ,AJ.
ACLDH ,AA.
ACDH ,AE.
ECAAH ,AI.
ICJAH ,FF.
MNOP; 11*4'2#*@
FCED
ECFJH ,G.
ECFEH ,L.
ECDKH ,K.
ACLAH ,AF.
ACBKH ,AF.
ECIGH ,AA.
ICLBH ,AB.
=Q/;:<"(6$#*=*1(%>?9*+'"#*%@
FCID
ECELH ,L.
FCDJH ,K.
ECELH ,L.
ECEH ,L.
ACKDH ,AA.
ECBKH ,L.
ACAGH ,AE.
R$1<*%H WX'&XH '!H Y2%*!#X*"'"H '"H #X*H %2!3H (ZH *2&XH 126W2%*H &62""*"H '!"#%$&#'(!H %2!3'!+H
68%B/
6
SXVK
HE[
SXVK
S
HES
SXVK>UHJLVWHU@
SXVK
HVL
PRY
P
HESHVS
PRYH>UHJLVWHUUHJLVWHU@
PRY
HVLHG[
PRY
HE[HD[
FDOO
FDOO
68%B/
68%B/
WHVW
HD[HD[
SRS
HES
SRS>UHJLVWHU@
MQ]
/
PRY
HVSHES
PRYH>UHJLVWHUUHJLVWHU@
[RU
HD[HD[
UHWQ
SRS
HVL
SRS
HE[
UHWQ>,PPHGLDWH@RU
>0HPRU\$GGUHVV@
UHWQ
has similar ratio, and there are no or little instruction which very high or very low compared
to benign instruction(0% - 27.27%). But, ratio
of benign instruction is decreasing, gap of percentage between benign and malware are increasing, reversly. Ratio instruction is between 0.01%
- 0.001%, about from 11.76% to 35.29% malware instruction has similar ratio, and about from
23.53% to 69.60% malware instruction has very
high or very low ratio compared by benign instruction.
4.3
wares instruction to calculate with total instruction. Compared the gap of percentage each ratio block. We divide gap of percentage 3 parts.
Similar, High or Low, Very High or Very Low.
Similar means malware and benigns gap of percentage is between 0.5 and 2 times. High or Low
means malware and benigns gap of percentage
is between 2 and 5 times(High) or 0.2(=1/5) and
0.5 times(Low). Very High or Very Low means
malware and benigns gap of percentage is over 5
times or under 0.2 times. After, calculate the ratio
of instruction which is relevant each interval.
Table 6-8 shows that. According to result,
ratio of benign instruction is high, also about
from 27.27% to 90.91% of malware instruction
2
n
37
Proceedings of The Fourth International Conference on Informatics & Applications, Takamatsu, Japan, 2015
5DWLR
9LUXV
:RUP
%DFNGRRU
'RV
9LU7RRO
7URMDQ
5RRWNLW
>@
>@
>@
>@
Table 7. percentage of High or Low ratio between malware and benign
5DWLR
9LUXV
:RUP
%DFNGRRU
'RV
9LU7RRO
7URMDQ
5RRWNLW
>@
>@
>@
>@
Table 8. percentage of Very high or Very low ratio between malware and benign
5DWLR
9LUXV
:RUP
%DFNGRRU
'RV
9LU7RRO
7URMDQ
5RRWNLW
>@
>@
>@
>@
C=
2
n + 2
When there is no association between two variables, contingency coefficient value was zero,
and when the association strength increase, contingency coefficient approaches one.
Table 9 shows the contingency coefficient
each ratio block. The benign instruction ratio is
over 1%, the distribution of contingency coeffi-
38
Proceedings of The Fourth International Conference on Informatics & Applications, Takamatsu, Japan, 2015
Ratio
Virus
Worm
Backdoor
Dos
Virtool
Trojan
Rootkit
-1%
0.3503
0.3608
0.3984
0.3219
0.4143
0.3824
0.4879
1% - 0.1%
0.3721
0.4860
0.6504
0.7233
0.7154
0.4771
0.7271
0.1% - 0.01%
0.7954
0.6248
0.8012
0.9326
0.8972
0.6719
0.8794
0.01% - 0.001%
0.8524
0.8149
0.9151
0.9578
0.9247
0.8845
0.9714
cient is between 32% - 44%. But, getting instruction ratio is smaller, contingency coefficient is
getting bigger, at last when the benign instruction
ratio is between 0.01% and 0.001%, contingency
coefficient is increased 85% - 97%, about two
times. It shows positive proof prior subsection,
when benign instruction ratio is getting smaller,
the ratio gap between malware and benign instruction is getting bigger. And it is strong statistical data which the rare instruction in the benign
is more stronger predictor when detecting malware.
4.4
'(67
;
;
;
;
65&
<
<
<
<
'(67
<
;
;
<
;
813&./36,QVWUXFWLRQ2SHUDWLRQELW
65&
'(67
3813&./%:
'(67
'(67
'(67
< ;
<
; <
< ;
;
< ; < ;
3813&./%:3813&.+%:,QVWUXFWLRQ2SHUDWLRQELW
FURTHER WORK
39
Proceedings of The Fourth International Conference on Informatics & Applications, Takamatsu, Japan, 2015
% of total
unique
inst.
total type
of unique
inst.
Virus
1.452
338
XADD[Register,Register], BTS[Register,Register],
MULSD[Register,Register], ADDSD[Register,Register],
MULPD[Register,Register]
307
BT[AbsoluteMemoryAddress,Register],
FLDCW[AbsoluteMemoryAddress],
CMOVNS[Register,AbsoluteMemory],
BTS[Register,Register],
FIMUL[AbsoluteMemoryAddress]
351
PCMPGTD[Register,Register], WRMSR[],
PSUBB[Register,AbsoluteMemory],
UNPCKLPS[Register,AbsoluteMemory],
FLDCW[AbsoluteMemoryAddress]
313
BT[AbsoluteMemoryAddress,Register],
FLDCW[AbsoluteMemoryAddress],
CVTTPS2PI[Register,AbsoluteMemory],
UNPCKLPS[Register,AbsoluteMemory], WRMSR[]
447
PSHUFW[Register,AbsoluteMemory,Immediate],
PUNPCKHBW[Register,AbsoluteMemory],
ADDPS[Register,AbsoluteMemory],
SUBPS[Register,AbsoluteMemory],
PUNPCKLBW[Register,AbsoluteMemory]
Worm
Backdoor
Dos
Virtool
0.501
0.6667
0.3608
1.358
Trojan
0.9567
388
MOVMSKPS[Register,Register],
BT[AbsoluteMemoryAddress,Register],
FLDCW[AbsoluteMemoryAddress],
CMOVNS[Register,AbsoluteMemory],
FIMUL[AbsoluteMemoryAddress]
Rootkit
2.135
442
LSL[Register,AbsoluteMemory], UD2[],
FIMUL[AbsoluteMemoryAddress], WBINVD[],
MAXPS[Register,AbsoluteMemory]
Support Vector Machine, Genetic Algorithm, Decision Tree or these combination). Provides an effective method to detect malware, Furthermore,
variants of malware families.
5.2
Doing the research, we found some unusual instruction sequence code in Malware code like fig-
ure 7.
This is deserved more deepen research. As you
see, some instructions has context relationship as
combination more than two instruction sequence.
Expand the instruction sequence N-gram and extract some more specific statistical data, it is provide more solid statistical data about malware.
Recently, aimed at specific targets with
clear objectives intelligently, secretly collect and
40
Proceedings of The Fourth International Conference on Informatics & Applications, Takamatsu, Japan, 2015
&03>$EVROXWH0HPRU\,PPHGLDWH@
-=>,PPHGLDWH@
&03>$EVROXWH0HPRU\,PPHGLDWH@
-=>,PPHGLDWH@
&03>$EVROXWH0HPRU\,PPHGLDWH@
-=>,PPHGLDWH@
&03>$EVROXWH0HPRU\,PPHGLDWH@
-=>,PPHGLDWH@
&03>$EVROXWH0HPRU\,PPHGLDWH@
-=>,PPHGLDWH@
&03>$EVROXWH0HPRU\,PPHGLDWH@
-=>,PPHGLDWH@
9LUXV:LQ.GDU
2876>5HJLVWHU$EVROXWH0HPRU\@
,1&>5HJLVWHU@
$''>5HJLVWHU5HJLVWHU@
2876>5HJLVWHU$EVROXWH0HPRU\@
,1&>5HJLVWHU@
$''>$EVROXWH0HPRU\5HJLVWHU@
2876>5HJLVWHU$EVROXWH0HPRU\@
,1&>5HJLVWHU@
$''>5HJLVWHU5HJLVWHU@
2876>5HJLVWHU$EVROXWH0HPRU\@
,1&>5HJLVWHU@
$''>$EVROXWH0HPRU\5HJLVWHU@
:RUP:LQ'HIIHFWLYH
386+>,PPHGLDWH@
386+>,PPHGLDWH@
&$//>,PPHGLDWH@
029>$EVROXWH0HPRU\5HJLVWHU@
386+>,PPHGLDWH@
386+>,PPHGLDWH@
&$//>,PPHGLDWH@
029>$EVROXWH0HPRU\5HJLVWHU@
386+>,PPHGLDWH@
386+>,PPHGLDWH@
&$//>,PPHGLDWH@
029>$EVROXWH0HPRU\5HJLVWHU@
7URMDQ:LQ,&46SRRI
-03>$EVROXWH0HPRU\$GGUHVV@
029>5HJLVWHU5HJLVWHU@
-03>$EVROXWH0HPRU\$GGUHVV@
029>5HJLVWHU5HJLVWHU@
-03>$EVROXWH0HPRU\$GGUHVV@
029>5HJLVWHU5HJLVWHU@
-03>$EVROXWH0HPRU\$GGUHVV@
029>5HJLVWHU5HJLVWHU@
-03>$EVROXWH0HPRU\$GGUHVV@
029>5HJLVWHU5HJLVWHU@
-03>$EVROXWH0HPRU\$GGUHVV@
029>5HJLVWHU5HJLVWHU@
9LUXV:LQ(PDU7URMDQ:LQ$GH[
$''>$EVROXWH0HPRU\5HJLVWHU@
,1&>5HJLVWHU@
,1&>5HJLVWHU@
$''>$EVROXWH0HPRU\5HJLVWHU@
,1&>5HJLVWHU@
,1&>5HJLVWHU@
$''>$EVROXWH0HPRU\5HJLVWHU@
,1&>5HJLVWHU@
,1&>5HJLVWHU@
$''>$EVROXWH0HPRU\5HJLVWHU@
,1&>5HJLVWHU@
,1&>5HJLVWHU@
:RUP:LQ2QYHU
6%%>5HJLVWHU,PPHGLDWH@
$''>$EVROXWH0HPRU\5HJLVWHU@
6%%>5HJLVWHU,PPHGLDWH@
$''>5HJLVWHU5HJLVWHU@
6%%>5HJLVWHU,PPHGLDWH@
$''>5HJLVWHU5HJLVWHU@
6%%>5HJLVWHU,PPHGLDWH@
$''>5HJLVWHU5HJLVWHU@
6%%>5HJLVWHU,PPHGLDWH@
$''>5HJLVWHU5HJLVWHU@
6%%>5HJLVWHU,PPHGLDWH@
$''>$EVROXWH0HPRU\5HJLVWHU@
5RRWNLW:LQ5HVVGWE[
029>5HJLVWHU,PPHGLDWH@
&03>5HJLVWHU,PPHGLDWH@
029>5HJLVWHU,PPHGLDWH@
386+>,PPHGLDWH@
5(7>@
029>5HJLVWHU,PPHGLDWH@
&03>5HJLVWHU,PPHGLDWH@
029>5HJLVWHU,PPHGLDWH@
386+>,PPHGLDWH@
5(7>@
029>5HJLVWHU,PPHGLDWH@
&03>5HJLVWHU,PPHGLDWH@
029>5HJLVWHU,PPHGLDWH@
386+>,PPHGLDWH@
5(7>@
029>5HJLVWHU,PPHGLDWH@
&03>5HJLVWHU,PPHGLDWH@
029>5HJLVWHU,PPHGLDWH@
386+>,PPHGLDWH@
5(7>@
'R6:LQ9%FY
$''>5HJLVWHU5HJLVWHU@
'(&>5HJLVWHU@
,1&>5HJLVWHU@
$''>5HJLVWHU5HJLVWHU@
'(&>5HJLVWHU@
,1&>5HJLVWHU@
$''>$EVROXWH0HPRU\5HJLVWHU@
'(&>5HJLVWHU@
,1&>5HJLVWHU@
$''>$EVROXWH0HPRU\5HJLVWHU@
'(&>5HJLVWHU@
,1&>5HJLVWHU@
%DFNGRRU:LQ)ORERD
CONCLUSION
bigger when the ratio is lower. Furthermore, contingency coefficient interpreted as how much of
association without reference to other factors, so
the rare instruction which is high in scarcity can
be good predictor to distinguish between benign
and malware. Finally, there are some special instructions use malware instruction such as Double Precision Floating Point instruction, packed
or unpacked instruction. Also it can be important
characteristic when detecting malware.
ACKNOWLEDGMENT
This work was supported by Institute for Information & communications Technology Promotion(IITP) grant funded by the Korea government(MSIP) (No.R0126-15-1111, The Development of Risk-based Authentication Access Control Platform and Compliance Technique for
41
Proceedings of The Fourth International Conference on Informatics & Applications, Takamatsu, Japan, 2015
Cloud Security)
REFERENCES
42