You are on page 1of 13

Early Stage Malware Prediction Using Recurrent

Neural Networks
Matilda Rhode , Pete Burnap , Kevin Jones ,
Cardiff
University
{rhodem, burnapp}@cardiff.ac.uk
Airbus Group

kevin.jones@airbus.com

obfuscating code. Signature-based systems are particularly


arXiv:1708.03513v1 [cs.CR] 11 Aug 2017

AbstractCertain malware variants, such as ransomware,


highlight the importance of detecting malware prior to the unsuited to detecting zero-day malware unless it shares signa-
execution of the malicious payload. Static code analysis can be tures with previously known strains[3]. Conversely, anomaly-
vulnerable to obfuscation techniques. Behavioural data collected
during file execution is more difficult to obfuscate, but typically based systems compare a baseline of normal activity and
takes a long time to capture. capture deviations from this baseline. Anomaly detection
In this paper we investigate the possibility of predicting creates an alert for any aberrant behaviour rather than for
whether or not an executable is malicious. We use sequential recognised signatures. Deviations from baseline activity must
dynamic data and find that an ensemble of recurrent neural then be investigated, by which time a malicious executable
networks is able to predict whether an executable is malicious or
benign within the first 4 seconds of execution with 93% accuracy. may have already achieved its aims. Machine learning seeks
This is the first time a file has been predicted to be malicious to remedy the tendency of signature-based systems to falsely
during its execution rather than using the complete log file post- classify malware as benign and the tendency of anomaly-based
execution. systems to falsely classify benign behaviour as a risk.
The data used by machine learning algorithms to classify
I. I NTRODUCTION executables may be static, dynamic or a combination of the
The aftermath of successfully executed malware attacks can two. Static data is scraped directly from source code, and
be costly, whether this entails network repair or data recovery. can be used to classify files before execution. Static data,
Ransomware attacks, for example, typically demand financial however, can be manipulated to evade accurate classification
compensation in exchange for a decryption key to unlock [4]. Dynamic data is recorded during file execution in a virtual
data, and have seen a 300% rise in 2016 compared with machine and seeks to capture the actual processes carried out
2015 [1]. Such attacks may also incur the less-quantifiable by the file. It is more difficult to mask malicious activities
cost of user trust and/or organisation reputation due to service when data is collected in this way, as malware must at least
disruption. In this paper, we investigate the possibility of carry out the processes necessary to realise its objectives.
predicting whether or not a file is malicious based on the Existing research using machine learning and dynamic data
initial behaviour of the executable. The motivation for this to detect malware uses the entire execution cycle e.g. ([5], [6],
research is to detect malware before the malicious payload [7]) or choose some arbitrary time at which to cap recording
is executed, thus using behavioural data to prevent damage e.g.([8], [9], [10]). Some approaches examine the length of
rather than a means to understand the repairs necessary after time for which to record data based on file properties e.g.
the damage has been done. As far as we are aware, this ([11], [12]). Such approaches focus on reducing dynamic data
is the first paper attempting to predict whether or not a recording time on an instance-by-instance basis, meaning that
file is malicious using initial behavioural activity. To date, full dynamic analysis is still carried out for some samples.
dynamic malware detection research has made use of entire Our approach will primarily use the initial seconds of file
file execution logs, or simply chosen some fixed time at which execution. This reduces the amount of information available
to stop recording. Using sequential machine activity data, we to the classifier. Owing to the sequential nature of dynamic
explore the predictive capabilities of recurrent neural networks data, we believe that a classifier capable of understanding the
to determine whether or not a file is malicious. relationships between features over time will capture the most
Malware detection today needs to be automated due to information about each sample and thus be capable of more
the increasing volume of files submitted for analysis. The accurate classification. Recurrent neural networks (RNNs) are
free file analysis tool, VirusTotal, can receive more than one able to process time series data [13] and we show that RNNs
million distinct files for investigation in a single day1 [2]. outperform other widely-used machine learning algorithms in
Signature-based detection, which identifies malware by com- predicting malware.
paring known signatures to file contents, can be evaded by The key contributions of this paper are:
Demonstration that malware can be predicted from the
1 1.04 million on 26th April 2017 first 4 seconds of file execution with 93% classification
Reference Reported time collecting dynamic data
accuracy. [10] 30 seconds
Analysis of classification accuracy as time into file exe- [17] 3.33 minutes (200 seconds)
cution progresses. [18] 5 minutes
[19] 5 minutes logging 10 times each per sample
We show that it is possible to reduce the time spent
[16] Fixed time and 5-10 minutes mentioned but overall
recording sample data by 98% (4.8 minutes per sample to time cap not explicitly stated
4 seconds per sample) and achieve an accuracy of 93%, [8] At least 15 steps - exact time unreported
while 98% accuracy can be achieved using the first 20 [5] No time cap mentioned - implicit full execution
[20] No time cap mentioned - implicit full execution
seconds of data (93% time reduction). [6] No time cap mentioned - implicit full execution
TABLE I
II. R ELATED W ORK R EPORTED TIMES COLLECTING DYNAMIC BEHAVIOURAL DATA PER
SAMPLE
A. Malware Classification
Malware detection is a complex classification problem
partly because adversaries adapt malware to avoid detection.
The resilience of machine learning-based detection can depend or run in an emulator, which gives slightly faster analysis (e.g.
on the kind of data used. Static data, collected directly from [5]), but reducing recording time per sample still enables more
code, is quick to collect and can be analysed prior to file data to be collected in a shorter time.
execution. Saxe and Berlin [14] distinguish malware from Shibahara et al. [11] decide when to stop analysis for
benignware with an accuracy of 97.45% using a deep feed- each sample based on changes in network communication,
forward neural network trained on static program features. reducing the total time taken by 67% compared with a
Grosse et al.[4] demonstrate, however that static data can conventional method that analyses samples for 15 minutes
be obfuscated to cause a classifier previously achieving 97% each. Neugschwandtner et al. [21] used static data to determine
accuracy to fall as low as 20% when classifying adversarially dissimilarity to known malware variants using a clustering
crafted samples, with some accuracy increase possible using algorithm, and thus the implicit value of dynamic analysis.
defensive techniques such as training on obfuscated samples. This approach demonstrated an improvement in classification
This vulnerability to obfuscation renders static models less accuracy by comparison with randomly selecting which files
reliable than dynamic models [15]. to dynamically analyse, or selecting based on sample diversity.
Dynamic runtime behaviours such as system calls to the Similarly, Bayer et al. [12] create behavioural profiles to try
operating system kernel are the most common features used in and identify polymorphic variants of known malware, reducing
dynamic analysis. Beyond resilience to obfuscation, dynamic the number of files undergoing full dynamic analysis by 25%.
data has been found to give better classification accuracy. Each of these approaches, which calculate the expected
Damodoran et al. [16] compare malware classification of static benefit of dynamic analysis for each sample, may still execute
and dynamic models using Hidden Markov Models (HMMs) an entire file. We look at cutting analysis short for all files and
and find that dynamic data yields the highest accuracy with the resulting impact on accuracy with a view to a practical
a 0.98 Area Under the Curve score. Huang and Stokes [5] endpoint solution for detecting malware.
were able to classify malicious and benign executables with an
C. Recurrent neural networks for prediction
accuracy of 99.64% from 114 System API calls and features
derived from those API calls using a shallow feed-forward Time series data captures the changes in parameters as
neural network. well as the parameter values themselves. For this reason we
Dynamic data, though more robust to obfuscation, takes anticipate that a model capable of analysing time series data
much longer to collect and so is impractical (by comparison will give the highest classification accuracy. Recurrent neural
with static analysis) for analysing files in real-time as part of networks and Hidden Markov models are both able to capture
an endpoint defensive system. Waiting for the entire file to sequential changes, but RNNs hold the advantage in situations
execute in a virtual machine or emulator before classifying with a large possible universe of states and memory over an
creates an impractical delay if used as part of an endpoint extended chain of events [22], and is therefore better suited to
security system. detecting malware using behavioural data.
RNNs use hidden layers which are deep in time, rather
B. Reducing dynamic data recording time than space, to model changes between inputs in a sequence.
Existing methods to reduce dynamic data recording time Recurrent models have been employed successfully in the
focus on efficiency. The core concept is only to record dynamic fields of natural language processing (NLP) and can be used
data if it will improve accuracy, either by omitting some files to predict and generate text based on an initial short sequence
from dynamic data collection or by stopping data collection of words or characters [13] or an image [23].
early. The time taken to record dynamic data is not always Weights in neural networks represent the importance of
reported, but 5 minutes or a full execution of the sample are information travelling between two neurons. During training
commonly reported (Table I). The cutoff for data samples may these weights are updated repeatedly to reduce the error
be based on sequence length rather than time, as in the case of rate in categorising the training data against its specified
Pascanu et al.s work [8]. Data recording can be parallelised labels. Prior to the development of Long-Short-Term-Memory

2
(LSTM) cells by Hochreiter and Schmidhuber [24], RNN
weights were liable to increase or decrease exponentially over Training Executable files
long sequences [25]. LSTM cells can hold information back Classification
from the network until such a time as it is relevant or forget Virtual machine
information, thus mitigating the problems surrounding weight
updates. The success of LSTM has prompted a number of
variants, though few of these have significantly improved on Data
the classification abilities of the original model [26]. Gated
Recurrent Units (GRUs) [27], however, have been shown to Data
have comparable classification to LSTM cells, and in some preprocessing
instances can be faster to train [28].
Sequential data holds more information than data removed
from its sequential context, and so provides additional in- Preprocessed data
formation for difficult classification tasks such as identifying
malware. Pascanu et al. [8], compared RNNs with Echo-State
networks to learn the language of malware and detect re- Classifier training
ordered malware. The authors found that Echo-State networks
outperformed RNNs for binary classification.
Kolsnaji et al [29] conducted extensive experiments with Classification
deep neural networks, including recurrent networks, to classify
malware into families using API call sequences. By combining
a convolutional neural network with LSTM cells, the authors Predictions
were able to attain a recall of 89.4%, but do not address the
binary classification problem of distinguishing malware from
benignware.
Existing dynamic malware detection prioritises classifica-
tion accuracy, even when trying to reduce the dynamic data
Fig. 1. Data pipeline of proposed system
recording time. Huang and Stokes [5] have demonstrated,
using a very large dataset of 4.5 million training samples
and 2 million testing samples, an accuracy of 99.6% can be
compare prediction accuracy over time with the time at which
achieved. In this paper, our goal is to examine whether it is
the malicious payload is executed. Few papers report the time
possible to predict the likelihood that a file is malicious with a
for which data is recorded (see Table I), implying that data may
reasonable level of accuracy so that dynamic classification can
be captured for the entire duration of file execution. Others
be incorporated into an online malware detection system. For
imply fifteen [11] or five to ten minutes [16] before recording
example, is it possible to have just 5% of samples misclassified
is stopped. Even if we take the five minute estimate, it is clear
within the first 3 seconds of execution? If not, with what
that the time in the virtual machine constitutes the bulk of the
degree of accuracy can we predict malware using 3 seconds
classification time per sample. Preprocessing, assuming it is
of dynamic data? Does accuracy increase monotonically with
automated, is likely to take fractions of seconds; to decorrelate
more data or is there an optimal data sequence length for peak
or select features or convert data into some other format for
accuracy?
example. Classification itself will be similarly quick. From [5]
III. S YSTEM D ESCRIPTION for example, we can deduce a classification time of 0.0056
seconds per sample using a FFNN. Therefore, reducing the
Figure 1 outlines our proposed system for early malware
time in the virtual machine by 50% will deliver a far greater
prediction both in the training and classification phases. Our
reduction in overall time per sample than reducing either of
goal is to reduce the overall time taken by this pipeline, both
preprocessing or classification time by half.
for training and classification, whilst maintaining a high level
of classification accuracy, to investigate the viability of using
A. Data capture and dataset
this system as part of an endpoint detection solution. We want
to reduce the overall time for one sample to pass through the The dataset comprises machine activity data collected in a
pipeline for classification. Per sample this equates to: virtual machine using the Cuckoo Sandbox malware analysis
Time in virtual machine (seconds or minutes) + preprocess- tool [30]. We captured 11 features each second for five minutes
ing time (fraction of second) + classification time (fraction of for 594 malicious and 594 trusted portable executable files
second) provided by VirusTotal [31]. If the file finished execution
Predicting malware, if sufficiently quick, could enable dy- before five minutes, data recording ended. The average sample
namic malware detection prior to the file being executed in the recording time was 4.8 minutes, this took almost five days
desired environment (if deemed benign). In the future we could (95.4 hours) to collect for all 1188 samples. The dataset size

3
is consistent with dynamic malware detection research e.g. [36]. Time-stamps exhibit near identical data, as expected, as
([6], [10], [32], [33], [16], [19], [34]). data was collected at one-second intervals for both benign and
We capture numeric data, primarily machine activity data, malicious samples.
which has advantages in preprocessing and model training
time over categorical data. Many dynamic malware detection
methods use system API calls e.g. ([5], [35], [32], [8]) which
must be converted to numeric form before being fed into
a machine learning classifier. Continuous numeric data does
not need to be transformed. Often categorical data will be
transformed into one-hot-vectors. One-hot-vectors are vectors
of the size of the number of possible categorical values, in
which each value is assigned an index; the values presence
is represented by a 1 in this vector and its absence by a
0. Many possible values can cause the dimensions of the
input data to be large, which makes the model slower to tune
as the number of tunable parameters between the inputs and
the first hidden layer of the network are proportional to the
shape of the input data. Numeric data can represent many
different values within a single input and so does not increase
the network parameters as dramatically when capturing many
kinds of data.
1) Features: Despite the rapid evolution of malware, the
objective of malicious files if often shared with previously
seen malware strains, for example gaining unauthorised access
to a network or some data or sending information out of a
network. Benignware objectives are much broader, and may
be defined as anything which is not malware. The alternative Fig. 2. Frequency distributions of input features for benign and malicious
samples
to behavioural analysis is to extract features from the malware
code, which is not always available [9] and may be reordered
[8] or obfuscated otherwise [16]. Similar to [36], we use Benign Malicious
machine activity data to monitor the file behaviour. The Min Max Min Max
activities monitored represent all the machine activity data Time 0 310.92 0 298.46
Total Processes 40 62 40 58
we were able to capture, and resulted in 11 data points each
Max. Process ID 2440 4092 2712 3068
second for each sample. CPU User (%) 0 100 0 100
The features captured were: total number of processes being CPU System (%) 0 63 0 80
executed, the maximum number of processes being carried out Memory Use (MB) 540 1545 571 1,332
(derived from highest assigned process identification number), Swap Use (MB) 655 2026 657 1,780
user and system CPU use (each as a percentage of total CPU Packets Received 1,186 85,951 1,169 215,263
capacity), memory use, swap use, number of packets received, Packets Sent 390 60,146 357 268,326
Bytes Received 1.60 81.83 1.59 15.45
number sent, number of bytes received, number of bytes sent (MB)
and the number of milliseconds elapsed since the file began Bytes Sent (MB) .05 101.87 .04 354.50
running. The milliseconds elapsed are included in the recurrent TABLE II
neural network even though the snapshots are taken at one- M INIMUM , AND MAXIMUM VALUES OF EACH INPUT FOR BENIGN AND
MALICIOUS SAMPLES
second intervals because of slight deviations that occur during
data capture.
Figure 2 illustrates the relative frequency distributions of
the values held by each of the 11 features between benign B. Data Preprocessing
and malicious samples. These data represent values held By using continuous numeric data, we avoid converting Sys-
across the entire data capture window (about five minutes tem API calls into large vectors as other dynamic behavioural
per files). Table II gives summary statistics for the features. analysis approaches are forced to (e.g. [5], [8]). We further
Bytes received and bytes sent exhibit the greatest contrast implement two prepossessing measures aimed at speeding up
in distribution between malware and benignware, which may classification and training time.
be indicative of sending data out of the network, a common Prior to training and classification, we normalise the data to
goal of malware seeking to capture information. This is also increase model convergence in training. By roughly keeping
consistent with Burnap et al.s findings that network traffic data between 1 and -1, the model is able to converge more
was the most predictive feature for identifying malicious URLs quickly, as the neurons within the network operate within this

4
range [37]. We achieve this by normalising around the zero detecting malware and provides be the benchmark for compar-
mean and unit variance of the training data. For each feature, ing algorithms. Malicious and benign samples were randomly
i , we establish the mean, i , and variance, i , of the training assigned to either the training or test set to give two datasets
data. These values are stored, after which every feature, xi is comprising 594 samples (297 malicious and 297 benign). We
scaled: use 10-fold cross validation on the training set to tune the
xi i
hyperparameters of the RNN (see next Section III-D). Finally
i we test on the unseen test set.
For the recurrent neural networks only, the data is divided The dataset is balanced, very minor imbalances may occur
into equal sized batches to increase the training speed over as a result of omitting short sequences during training and
training one sample at a time. Random samples are omitted testing, so we use accuracy rather than F-Scores to measure
until the batch size is a factor of the dataset size. model success. F-Scores help to evaluate the results of imbal-
To emulate the early stopping of data capture that we will anced datasets but are biased towards favouring true positives
use in the final system, we must truncate the data sequences. over true negatives.
We select the desired sequence length, s, which corresponds to We rely on Keras [39] with a TensorFlow [40] backend to
the number of seconds of data captured in the virtual machine implement the RNN and use the WEKA [41] machine learning
since we take a data snapshot each second. Sequences are then toolkit to compare accuracy with other machine learning
omitted or truncated relative to the chosen sequence length, s. classifiers. To reduce training and testing time, the RNN is
We remove all samples with sequences shorter than s and then trained and tested on an Nvidia GTX 1080 GPU.
truncate all data after time step s for the remaining sequences.
For neural networks the sequences must all be the same length; D. Recurrent neural network configuration
we cannot pad out the short sequences using masking (as
is possible with categorical data) because this is done using The choice of hyperparameters can be crucial in determining
vectors of zero, which has a numerical significance so cannot the success of a neural network. Further, the solution space(s)
be learnt as a class representing no data for continuous data. delivering good results can be quite small by comparison with
These sequences are omitted for the other classifiers to enable the search space of possible configurations. Hyperparameters
comparison between the algorithms. are often hand-chosen (e.g.[8], [42]) and some argue that the
selection process resembles an art over a science [43].
C. Classifier training and prediction In the domain of malware detection, it is important that
We are interested in whether behavioural data about benign this process be automated. Machine learning for malware
and malicious executables captured early on in file execution detection has been a response to the need for automation,
can give an indication as to whether or not the file is malicious, given the rate at which samples must be analysed. Malware
and the frequency with which this can be deemed accurate. evolves in response to new opportunities for gain and to
We anticipate that RNNs will yield the best predictive evade detection. As such, the same RNN configuration may
capabilities due to their ability to process sequential data, not consistently capture the relevant features for detecting
and compare it with other machine learning algorithms used malware. Automatic hyperparameter selection enables cheap
for malware classification: Random Forest, Support Vector and frequent re-searching of the possible configuration space
Machine (SVM), Naive Bayes, J48 Decision Tree, K-Nearest in case for a better alternative. We conduct a random search of
Neighbour and Multi-Layer Perceptron algorithms ([10], [6], the hyperparameter space as it is trivial to implement, works
[38]). Previous research indicates that Random Forest, De- well in a large search space and is more efficient than grid
cision Tree or SVM are likely to perform the best of those search at finding good configurations [44].
considered. Some aspects of the model are chosen in advance for their
As well as truncating the sequence length, we are also potential to reduce model training and testing time. Weights
interested in the training and testing times of the models. are initialised at the initial uniform distribution outlined in [37]
Though this is secondary to recording time in the virtual to speed up convergence. Between the input layer and output
machine, which is expected to provide the most scope for layer are hidden layers comprising GRU cells [27], chosen
reducing processing time per sample. Support Vector Machines over LSTM cells for the potential reduction in training time
and RNNs, for example, take longer to train than random [28]. The hyperparameters changed during random search and
forests. their possible values are indicated in Table III.
To simulate the manner in which an early detection tool The number of hidden layers refers to the intermediate
might be used, we hold out 50% of the data for testing. In layers between the input and output layers. This is equivalent
practice, the model used for predicting malware will be trained to model depth and can improve accuracy through layers of
using some dataset but for the system to be of any use the weights giving a hierarchical feature selection structure, but
model resulting from training must generalise well to detect depth may also cause the data to overfit to the training data
new samples. unless some regularisation technique is used. The number of
The accuracy achieved on the unseen test set classes best GRU cells in the hidden layers lies between 1 and 500. Too
simulate the real-world situation of testing the classifier in few neurons may not capture enough information for accurate

5
Hyperparameter Possible values Hyperparameter Config. A Config. B Config. C
Hidden layers 1, 2, 3 Depth 1 2 2
Bidirectional True, False Bidirectional True True True
Hidden neurons 1 500 Hidden neurons 56 244 195
Learning rate 0.001 1.0 (0.001 increments) Learning rate 0.001 0.025 0.001
Weight updating algorithm Stochastic gradient descent, Adam Weight updating Adam S.G.D. Adam
Epochs 1 500 algorithm
Dropout rate 0 0.5 (0.1 increments) Epochs 151 497 318
Weight regularisation None, l1, l2, l1 and l2 Dropout rate 0 0 0
Bias regularisation None, l1, l2, l1 and l2 Weight regulari- None None l1 and l2
Batch size 1 59 sation
TABLE III Bias regularisa- None l1 and l2 None
P OSSIBLE HYPERPARAMETER VALUES tion
Batch size 58 47 56
TABLE IV
H YPERPARAMETER CONFIGURATIONS OF THE THREE HIGHEST- SCORING
RNN S FOUND THROUGH RANDOM SEARCH
classification but too many can again cause the model to overfit
to the training data.
Pascanu et al. [8] had some success using a bidirectional
RNN, hypothesising that adversaries are likely to reorder any models capable of high classification. We do not know
malware and that bidirectional layers can help to reduce the whether a model will increase monotonically in accuracy
efficacy of reordering by processing the time series data both with more data or peak at a particular time into the file
progressing and regressing in time, but it may be that only execution. Randomising the sequence length used for training
progressive sequences are helpful for prediction. and classification reduces the chances of having a blinkered
The number of epochs gives the number of times the model view of model capabilities.
weights are updated after the data is passed through the model,
IV. E XPERIMENTAL R ESULTS
and the learning rate dictates how much the weights can be up-
dated at each epoch. We allow the weight updating algorithm A. Predicting malware using early-stage data
to be either stochastic gradient descent or the Adam optimiser We want to determine whether machine learning can be used
[45]. Adam is a computationally efficient implementation of to detect malware during the early stages of file execution,
stochastic gradient descent which does not require tuning whether there is a trade-off between time into file execution
of its own hyperparameters, but stochastic gradient descent and detection accuracy. Each of the RNNs and other ML
is commonly used in successful malware classification with classifiers are fed an additional second of execution data to
neural networks (e.g. [5]). determine the predictive capabilities of the models.
We also vary regularisation techniques designed to prevent The configurations of the best-performing RNNs are de-
the model from overfitting to the training data. Dropout tailed in Table IV. These configurations were simply those
randomly removes a pre-specified proportion of units from which achieved the highest accuracy in 10-fold cross validation
the network [46]. Weight and bias regularisation can be used on the training set.
to penalise large weights (l1 regularisation) and to allow The trend of the accuracy metric is show in the Figure 3,
some weights to become large whilst penalising others (l2 the upper graph shows the accuracy trends on the training set,
regularisation). and the lower graph on the unseen test set. On the unseen test
The batch size varies between 1 and 59. The batch size set, the RNNs significantly outperform the other classifier after
denotes the number of samples that will be passed through 3 seconds of execution. Table V records the highest accuracy
the network in one propagation. Very small batch sizes will achieved on the test set by the RNNs and other algorithms.
tend to increase the time that the network weights take to The RNN cannot usefully learn from 1 second of data as
converge as the estimates of the best weight update are based there is no sequence so accuracy is equivalent to random guess.
on a very small, and therefore possibly unrepresentative, subset Using just 1 second of machine activity data, the random forest
of the data. Larger batch sizes may equally contain redundant is able to classify 86% of unseen samples correctly. Using
information if there is significant similarity between samples in just 4 seconds of data the RNN correctly classifies 91% of
the dataset[37]. 59 is the upper limit as this is the largest batch unseen samples, and achieves 98% accuracy at 19 seconds
size possible for 10-fold cross validation with the training set into execution, whereas the next best algorithm reaches a
of 594 samples, comprising 50% of the dataset. maximum of 90% at any point during the first 30 seconds.
Although there are only 10 parameters to tune, there are The RNN improves in accuracy as the amount of sequential
4.5 trillion different possible configurations. As well as the data increases.
hyperparameters above, we randomly select the sequence The dataset size here is relatively small, but consistent
length of data. Although the goal is to find the best classifier with those used in similar research. To capitalise on the
for the shortest sequence, selecting an arbitrary length such data available, we conduct a 10-fold cross-validation on the
as 5 or 10 seconds into file execution may not produce entire dataset, thus increasing the amount of data available for

6
Config. A Config. B Config. C
Time Acc. FP FN
Acc. FP FN Acc. FP FN
Classifier Sequence Accuracy FP (%) FN (%) (s) (%) (%) (%)
(%) (%) (%) (%) (%) (%)
Length (%) 2 86.61 11.86 14.92
86.86 8.81 17.46 85.9 10.79 17.41
RandomForest 11 89.96 11.61 9.54 3 89.58 10.85 10.0
89.75 7.12 13.39 85.17 9.15 20.51
SVM 8 77.25 22.78 22.71 4 91.53 7.29 9.66
91.53 6.1 10.85 90.42 8.47 10.68
NaiveBayes 8 65.13 40.62 10.88 5 92.12 7.46 8.31
91.78 6.78 9.66 90.76 7.63 10.85
MLP 22 86.82 15.29 10.8 6 92.37 7.29 7.97
93.9 4.41 7.8 93.98 7.12 4.92
KNN 11 83.98 14.9 17.08 7 93.73 5.42 7.12
94.15 4.41 7.29 94.07 5.93 5.93
AdaBoost 14 86.11 14.56 13.19 8 95.08 3.9 5.93
94.83 4.41 5.93 93.73 5.42 7.12
DecisionTree (J48) 16 87.72 12.14 12.41 9 95.25 4.41 5.08
95.17 4.24 5.42 94.15 5.59 6.1
RNN A 19 98.1 2.06 1.73 10 95.76 3.73 4.75
95.76 2.88 5.59 95.0 4.58 5.42
RNN B 19 98.23 1.41 2.14 15 97.2 2.54 3.05
96.86 2.71 3.56 95.93 4.41 3.73
RNN C 25 98.39 1.79 1.42 20 96.78 3.22 3.22
97.8 1.86 2.54 96.1 2.88 4.92
TABLE V 25 98.14 2.03 1.69
98.47 1.36 1.69 96.78 2.88 3.56
H IGHEST ACCURACY DURING FIRST 30 SECONDS OF EXECUTION WITH 30 98.31 1.69 1.69
97.3 2.62 2.77 97.12 3.22 2.54
CORRESPONDING FALSE POSITIVES (FP) AND FALSE NEGATIVES (FN) TABLE VI
ACCURACY, FALSE POSITIVES (FP) AND FALSE NEGATIVES (FN) FOR
BEST THREE RNN CONFIGURATIONS FOUND THROUGH RANDOM SEARCH

training (to 1050 for each fold) without reducing the test set
to a single small batch of (possibly unrepresentative) samples.
The 10-fold cross validation on the test set gives similar results
(Table VI) to the test-train data, but demonstrates a more
consistent trend to accuracy increasing with execution time.
Considering our initial question - is it possible to predict
malware in 3 seconds? Results indicate that this may be
possible with 90% accuracy. We do not achieve the desired
95% accuracy until 8 seconds. We can however use these
figures as initial estimates for the trade-off between accuracy
and the time into file execution. Beyond 15 seconds, the value
of an extra second of data is much lower than during the initial
5 seconds.

B. Impact of varying data snapshot intervals


In the following experiments we analyse the results of 10-
fold cross validation across the entire dataset (1188 samples)
as in Table VI for the reasons outlined in the previous section.
The results from Section IV-A indicate that the RNNs im-
prove in classification accuracy with longer sequence lengths.
This may be because later execution data gives a better
indication of whether or not samples are malicious or because
more data gives the model the ability to learn the nuances
between malicious and benign machine activity better. We
hypothesise that malicious activity begins as soon as possible
once the executable begins running because this reduces the
overall runtime of the file and thus the window of opportunity
for being disrupted (by a detection system, analyst, or technical
failure). More sequential data may enable the model to learn
relationships between multiple features which is indicative of
malware; activities A and B may be innocuous individually,
but typically represent malicious activity when combined.
We vary the interval between feature sets to explore the
question of whether later execution data or longer sequences
have a higher correlation with model accuracy. Using data
Fig. 3. Classification accuracy with increasing data collection time collected in the first minute of file execution, we increase the
interval between data snapshots from 1-second to 2-, 3-, 4-
and 5-seconds

7
Average prediction
Time (s) TN FN TP FP FN (%) FP (%)
2 0.10 0.73 0.89 0.32 15 12
3 0.06 0.75 0.91 0.25 10 9
4 0.04 0.80 0.95 0.23 10 8
5 0.03 0.86 0.96 0.22 8 8
TABLE VII
AVERAGE PREDICTIONS TRUE NEGATIVES (TN), FALSE NEGATIVES (FN),
TRUE POSITIVES (TP) AND FALSE POSITIVES (FP) WHERE BENIGN =0,
MALICIOUS =1 WITH INCREASING TIME , AND CORRESPONDING
PERCENTAGE OF INCORRECTLY CLASSIFIED MALWARE (FN) AND
BENIGNWARE (FP)

C. Improving prediction accuracy with an ensemble classifier


Accuracy appears to increase with longer sequence lengths.
We hypothesise that this is because longer sequences temper
Fig. 4. Accuracy vs. number of feature sets for increasing interval time anomalous or unrepresentative feature sets. When adding 1
between data snapshots
more second of data to a 2-second sequence, this data com-
prises one third of the data, but when increasing from 19-
to 20-seconds, the new set is just 5% of the data classified.
Analysis of false positives and false negatives shows that the
model grows in confidence over time. Table VII shows how
predictions move towards 0 and 1 on average for both correctly
and incorrectly classified samples.
We attempt to temper the misguided confidence of the mod-
els in misclassifying executables as sequences grow longer.
During training, neural network weights are adjusted according
to some weight update rule (here Adam or SGD), but may not
reach the globally optimal set of weights. Different hyperpa-
rameter configurations are likely to set different final weights
Fig. 5. Accuracy vs. real time data collection time for increasing interval
time between data snapshots for the network. For this reason, it may help to use an ensemble
classifier. An ensemble classifier combines the predictions of
multiple models. Ensembles can often yield better predictions
than single models. Dietterich [47] suggests that ensemble
Figure 4 shows a general trend of increasing accuracy with models can improve on individual models by averaging out
more feature sets for each of the different time intervals, with poor predictions, mitigating the skew of models converged
the 3-, 4- and 5-second intervals outperforming the 1-second at local minima, or capture the complexity of a problem
interval data. When mapped onto real time, as illustrated in for which a single model will not suffice. An ensemble of
Figure 5, it is clear that the one-second interval gains the high- models may improve classification since we have not fine-
est accuracy the most quickly. With only four feature sets, data tuned the hyperparameters, but used a random search and are
collected at 5-second intervals (95% Accurate) outperforms unlikely to have found the globally optimal hyperparameter
the 1-second interval data (92% accurate). However in real- set for classification. Configurations A, B, and C simply use
time, the 5-second data is using data 20 seconds into the file, the same model structure and are trained on different data for
at which point the 1-second data reaches an F-Score of 0.97. each additional second into the feature set, no single model
In real-time, shorter intervals generally outperformed longer in Table VI, consistently achieves the highest accuracy across
ones. time steps, implying that each places different emphasis on
Minutes and seconds are human constructs, and the choice different inputs and abstracted combinations of input data.
of one-second intervals is just an arbitrary small amount of We create two ensemble classifiers, one to capture the
time chosen for these experiments. The pattern indicated here differences between model configurations and one to capture
hints that the interval could be further reduced to improve the different learning emphases across sequence lengths.
accuracy. Further, it appears that sequence length, rather than For the first, we adjust the batch size for models A, B and
a particular time into file execution gives the classifier the C to the batch size for model A (58), so that classification can
information required to make an accurate prediction. This be parallelised, and average the predictions for each sample.
trend of reducing sequence length to gain higher accuracy Adjusting the batch size gives slightly different accuracy
earlier is likely to have limitations, and should be investigated scores to those reported in table VI for models B and C.
in future work. The improvements, outlined in Table VIII, are most marked

8
Time (s) Single Model Ensemble of time
(Config. A) sub-sequence
models (Config.
A)
2 87.08 87.08
3 88.93 89.01
4 91.08 91.16*
5 91.67 92.29*
6 93.01 93.12*
7 94.43 94.06*
8 94.68 95.47*
TABLE IX
S INGLE MODEL CLASSIFIER ACCURACY AGAINST ENSEMBLE CLASSIFIER
USING ALL SUB - SEQUENCES OF THE DATA . * = INCREASE IN PREDICTIVE
CONFIDENCE OVER SINGLE MODEL IS STATISTICALLY SIGNIFICANT AT
5%; ** = AT 1

when classifying the shorter sequences. Though the gains in


accuracy are modest, the increase in confidence for accurate
predictions is statistically significant at 1% for sequences of 4
seconds or less and significant at the 5% level for sequences
of 5 seconds. We calculate this score by taking the difference
between the binary class, b, of the sample and the model
prediction p. Accurate and confident predictions gain a score
close to 1, and inaccurate or less confident predictions are
closer to 0:
conf idence = 1 |b p|
For the second ensemble, different models classify every
sub-sequence of data longer than 2 feature sets, as well as
the entire sequence itself before averaging the predictions.
The number of models increases proportionally with sequence
length. The number of classifications quadratically according
to n(n1)
2 where n is the sequence length. For example, if
the start and end of a sequence is represented as [start, end],
and the model m used to classify a sequence of length n is
represented by mn , then using 4 seconds of data we must
perform the following classifications:

m4 ([1, 4])
m3 ([1, 3]), m3 ([2, 4])
Fig. 6. Frequency distribution of prediction values from sequence length 2 to
5 for all 10 folds. False predictions shown in green, True predictions shown
in dark blue
m2 ([1, 2]), m3 ([2, 3]), m2 ([2, 4])
The results in Table IX indicate that again, the ensemble
classifier, delivers marginal gains by comparison with the
original model for almost every point into execution. The
Time (s) Best accuracy of Ensemble confidence of predictions is not as significant in the early
config.s A,B,C accuracy
(%) stages as for the first (multi-configuration) ensemble. We
2 88.14 (B) 88.81** believe that the short-sequence predictions are more divergent
3 90.25 (A) 90.08** for the early models as a feature set is able to play a more
4 91.61 (A) 92.8**
5 92.37 (A) 92.37* significant role in a short sequence, thus having a stronger
6 93.64 (B) 94.24 influence over the final prediction.
7 95.34 (C) 95.34
8 94.83 (A,B) 95.34 D. Interpreting the classifier
TABLE VIII
B EST ACCURACY OF CONFIG . S A, B AND C WITH ENSEMBLE CLASSIFIER If the gains in accuracy for the ensemble classifier are
ACCURACY. * = INCREASE IN PREDICTIVE CONFIDENCE OVER BEST due to differences in the features learned by the network,
INDIVIDUAL MODEL IS STATISTICALLY SIGNIFICANT AT 5%; ** = AT 1%
this could help to protect against adversarial manipulation of
data. We attempt to interpret what configurations A, B, and
C are using to distinguish malware and benignware. These

9
1 feature off 2 features off 3 features off
Model A -02.35 -02.22 -02.42
Bytes Sent Bytes Sent Packets Received
Model B -02.51 -02.41 -02.56
Bytes Sent Bytes Sent CPU System
Model C -03.13 -03.11 -03.1
Time Time Time
TABLE X
M OST INFLUENTIAL FEATURES WHEN 1, 2 AND 3 FEATURES HAVE BEEN
TURNED OFF 4 SECONDS INTO FILE EXECUTION

preliminary tests seek to gauge whether it is possible to analyse


the decisions made by the trained neural networks.
Neural networks do not lend themselves to interpretation
easily; Model A has nearly 23,000 trainable parameters and
112 nodes in the hidden layers. However, interpretation can
help experts to interpret the results of algorithms and, as
Ribiero et al. [48] argue, to trust those results.
We look at the impact of the 11 features at both the testing
and training phases. The former is intended to give insight
into what the model has learnt and the latter into the degree
to which an RNN is able to find an alternative method of
classification in the absence of a given feature.
Ribiero et al. [48] propose a system for black-box testing
machine learning classifiers to interpret individual (or groups
of) model predictions. This system is predicated on the ability
to turn inputs off, which is easy in the case of categorical
data for which the presence or absence can be represented as a
binary value. It is more difficult to turn numeric data off, as Fig. 7. Impact scores for features with 1, 2 and 3 features turned off 4 seconds
into file execution
zero bears a relationship to the set of possible numeric values
for that input. Because we preprocess the data by normalising
to the training set mean of zero and unit variance, the mean of 7 percentage points fall in accuracy and a gain of of 0.4
of the training samples for each feature is zero. percentage points accuracy for the omission of Swap Use
By setting the test data for a feature (or set of features) for Configuration B. By ranking the changes in accuracy we
to zero, we can approximate the absence of that information can see some clear patterns between the three configurations.
between samples. We examine the impact on the overall Omission of swap use has a negligible impact on accuracy,
accuracy, the true positive rate and the true negative rate of whereas CPU system use unanimously incurs the greatest loss
omitting (sets of) features. in accuracy. The total number of processes appears in the top
We assess the overall impact of turning features off by 4 for A, B, and C though its individual impact score (when 1
observing the fall in accuracy and dividing it by the number feature is missing) does not generally appear in the top 4. The
of features turned off. A single feature incurring a 5 percentage impact score increases relative to others as more features are
point loss attains an impact factor of -5, but two features omitted, this may indicate that total processes are combined
creating the same loss would be awarded -2.5 each. Finally, we with other inputs to create discriminating features, though the
take the average across impact scores to assess the importance input is not highly impactful alone.
of each feature when a given number of features are switched We can begin to understand the decisions made by the 4-
off. second models using these methods, these preliminary findings
Table X and Figure 7 indicates a resemblance between indicate that we may be able to characterise model decisions.
models using configurations A and B, and a different emphasis Characterising the decision process may in turn form the basis
for the model based on configuration C. A and B appear to use for analytics tools to help security analysts further understand
packets, bytes and System CPU use for differentiating the two malware behaviour and patterns within malware families.
classes, which reflects the difference in frequency distribution
of the inputs shown by Figure 2. Model C appears to have V. L IMITATIONS AND FUTURE WORK
a unlikely reliance on the precise time since the last data The experiments in this paper sought to explore the plau-
snapshot. sibility and limitations of online dynamic malware detection.
By altering the training data, we can see how robust config- Whilst the results presented indicate that the entire file is not
uration C is to training without the temporal difference data. necessary to achieve high classification, this is the first step in
The changes in accuracy detailed in Table XI see a maximum a series of questions leading to a possible end point solution to

10
Configuration Rank
Accuracy at 4 seconds (%) samples reported over time.
Omitted Config.A Config.B Config.C A B C If the model is retrained daily this places greater emphasis
feature on the hyperparameter selection being fast. Rather than global
91.53 91.53 90.42
optimisation, it may be preferable to reach the highest accuracy
CPU 86.38 86.72 83.62 1 1 1
System possible in the shortest amount of time. Future work could
Total 88.62 89.83 87.59 2 3 4 examine methods for navigating the hyperparameter search
Pro- space sequentially to reduce the time taken to find a good
cesses
Packets 89.48 91.55 88.62 3 9 7 representation.
Sent Finally, the robustness of this approach is limited if ad-
Bytes 89.83 90 87.41 4-6 4 3 versaries know that the first 5 seconds are being used to
Sent
Packets 89.83 90.69 88.45 4-6 6 6 determine whether a file will run in the network. By planting
Re- long sleeps or benign behaviour at the start of a malicious file,
ceived adversaries could avoid detection in the virtual machine. We
Memory 89.83 91.90 88.79 4-6 10-11 9-10
Use
hypothesised that malicious executables begin attempting their
Time 90 89.66 87.76 7 2 5 objectives as soon as possible to mitigate the chances of being
CPU 90.17 90.86 88.79 8-9 7 9-10 interrupted, but this would be likely to change if malware
User authors knew which parts of the file were being examined by
Bytes 90.17 91.38 86.55 8-9 8 2
Re- security systems. One potential solution could be to train the
ceived model using such samples as we have just described, this is a
Max. 90.34 90.52 88.97 10 5 8 technique suggested by [4] in the authors attempt to classify
Process
ID adversarially crafted static malware data. Another solution
Swap 90.52 91.90 90.52 11 10-11 11 for a live classification system could monitor file behaviour
Use over a sliding window (of 5 seconds for example) and would
TABLE XI
I MPACT ON ACCURACY OF REMOVING ONE FEATURE BEFORE TRAINING potentially be able to detect the malicious behaviour regardless
of the time into the file at which it occurred, in which case
the model should be retrained using machine activity capturing
multiple processes at once together with API call sequences
detecting malware on a device or in a network. In the future we so that the calls from particular files can be distinguished from
would like to examine the robustness of our proposed system one another.
in light of the following:
A larger dataset VI. C ONCLUSIONS
Reduced hyperparameter search time
Robustness to adversarial samples In this paper we have shown that it is possible to achieve
Our experiments indicated that recurrent neural networks an accuracy greater than 93% with just 4 seconds of dynamic
can outperform other machine learning classifiers trained on data using an ensemble of recurrent neural networks and an
the sequential dataset. It is acknowledged that more data accuracy of 98% in less than 20 seconds.
prevents neural networks from over-fitting to the training set The best RNN network configurations discovered through
and allows them to learn better representations of the classes random search each employed bidirectional hidden layers,
[46]. Our own experiments in Sections IV-B and IV-C indicate indicating that making use of the sequence progressing as well
that more data improved accuracy for our proposed system. as regressing in time aided distinction between malicious and
There are exceptions as Huang and Stokes [5] found that benign behavioural data.
increasing their very large dataset for malware classification The RNN models outperformed other machine learning
offered little improvement to classification results, the authors classifiers in analysing the unseen test set, though the other
proffer that for their malware classification task, increasing algorithms performed competitively on the training set. This
dataset size does not give the improvements seen in other indicates that the RNN was more robust against overfitting to
domains such as image classification. However, results in [38], the training set than the other algorithms.
[49], [32] indicate that accuracy increases with volume of data To date this is the first analysis of the extent to which a file
up to a point, before plateauing. can be predicted to be malicious during its execution rather
Beyond possible increases in accuracy, the system should than using the complete log file post-execution.
be tested with more data. If hundreds of thousands of new
samples are being committed for analysis each day, the in- R EFERENCES
formation that the model can glean from these samples needs [1] C. Crime and I. P. Section, How to protect your networks from
to be incorporated into the model. More data increases model ransomware.
training time and it may be necessary to select a sub-sample [2] VirusTotal, Statistics - virustotal, 2017. [Data for 26 April 2017].
[3] P. Vinod, R. Jaipur, V. Laxmi, and M. Gaur, Survey on malware
of new malwares each day upon which to retrain the model. detection methods, in Proceedings of the 3rd Hackers Workshop on
Certainly it is necessary to choose a subset of all malware computer and internet security (IITKHACK09), pp. 7479, 2009.

11
[4] K. Grosse, N. Papernot, P. Manoharan, M. Backes, and P. McDaniel, [25] Y. Bengio, P. Simard, and P. Frasconi, Learning long-term dependencies
Adversarial perturbations against deep neural networks for malware with gradient descent is difficult, IEEE transactions on neural networks,
classification, arXiv preprint arXiv:1606.04435, 2016. vol. 5, no. 2, pp. 157166, 1994.
[5] W. Huang and J. W. Stokes, Mtnet: A multi-task neural network for dy- [26] K. Greff, R. K. Srivastava, J. Koutnk, B. R. Steunebrink, and J. Schmid-
namic malware classification, in Proceedings of the 13th International huber, Lstm: A search space odyssey, IEEE transactions on neural
Conference on Detection of Intrusions and Malware, and Vulnerability networks and learning systems, 2016.
Assessment - Volume 9721, DIMVA 2016, (New York, NY, USA), [27] K. Cho, B. Van Merrienboer, C. Gulcehre, D. Bahdanau, F. Bougares,
pp. 399418, Springer-Verlag New York, Inc., 2016. H. Schwenk, and Y. Bengio, Learning phrase representations using
[6] I. Firdausi, C. lim, A. Erwin, and A. S. Nugroho, Analysis of machine rnn encoder-decoder for statistical machine translation, arXiv preprint
learning techniques used in behavior-based malware detection, in 2010 arXiv:1406.1078, 2014.
Second International Conference on Advances in Computing, Control, [28] J. Chung, C. Gulcehre, K. Cho, and Y. Bengio, Empirical evaluation
and Telecommunication Technologies, pp. 201203, Dec 2010. of gated recurrent neural networks on sequence modeling, CoRR,
[7] W. Hardy, L. Chen, S. Hou, Y. Ye, and X. Li, Dl4md: A deep learning vol. abs/1412.3555, 2014.
framework for intelligent malware detection, in Proceedings of the [29] B. Kolosnjaji, A. Zarras, G. Webster, C. Eckert, and Q. Bai, Deep
International Conference on Data Mining (DMIN), p. 61, The Steering Learning for Classification of Malware System Call Sequences, pp. 137
Committee of The World Congress in Computer Science, Computer 149. Cham: Springer International Publishing, 2016.
Engineering and Applied Computing (WorldComp), 2016.
[30] C. Guarnieri, A. Tanasi, J. Bremer, and M. Schloesser, The cuckoo
[8] R. Pascanu, J. W. Stokes, H. Sanossian, M. Marinescu, and A. Thomas,
sandbox, 2012.
Malware classification with recurrent networks, in 2015 IEEE In-
ternational Conference on Acoustics, Speech and Signal Processing [31] B. Quintero, E. Martnez, V. Manuel Alvarezv, K. Hiramoto, J. Canto,
(ICASSP), pp. 19161920, April 2015. and A. Bermudez, Virustotal, 2004.
[9] G. Wagener, A. Dulaunoy, et al., Malware behaviour analysis, Journal [32] Z. Yuan, Y. Lu, and Y. Xue, Droiddetector: android malware char-
in computer virology, vol. 4, no. 4, pp. 279287, 2008. acterization and detection using deep learning, Tsinghua Science and
[10] R. Tian, R. Islam, L. Batten, and S. Versteeg, Differentiating malware Technology, vol. 21, no. 1, pp. 114123, 2016.
from cleanware using behavioural analysis, in Malicious and Unwanted [33] F. Ahmed, H. Hameed, M. Z. Shafiq, and M. Farooq, Using spatio-
Software (MALWARE), 2010 5th International Conference on, pp. 23 temporal information in api calls with machine learning algorithms
30, IEEE, 2010. for malware detection, in Proceedings of the 2nd ACM Workshop on
[11] T. Shibahara, T. Yagi, M. Akiyama, D. Chiba, and T. Yada, Efficient Security and Artificial Intelligence, pp. 5562, ACM, 2009.
dynamic malware analysis based on network behavior using deep learn- [34] M. Imran, M. T. Afzal, and M. A. Qadir, Using hidden markov
ing, in 2016 IEEE Global Communications Conference (GLOBECOM), model for dynamic malware analysis: First impressions, in 2015 12th
pp. 17, Dec 2016. International Conference on Fuzzy Systems and Knowledge Discovery
[12] U. Bayer, E. Kirda, and C. Kruegel, Improving the efficiency of (FSKD), pp. 816821, Aug 2015.
dynamic malware analysis, in Proceedings of the 2010 ACM Symposium [35] G. E. Dahl, J. W. Stokes, L. Deng, and D. Yu, Large-scale malware
on Applied Computing, pp. 18711878, ACM, 2010. classification using random projections and neural networks, in 2013
[13] A. Graves, Generating sequences with recurrent neural networks, arXiv IEEE International Conference on Acoustics, Speech and Signal Pro-
preprint arXiv:1308.0850v5, 2014. cessing, pp. 34223426, IEEE, 2013.
[14] J. Saxe and K. Berlin, Deep neural network based malware detection [36] P. Burnap, A. Javed, O. F. Rana, and M. S. Awan, Real-time classifi-
using two dimensional binary program features, in 2015 10th Interna- cation of malicious urls on twitter using machine activity data, in 2015
tional Conference on Malicious and Unwanted Software (MALWARE), IEEE/ACM International Conference on Advances in Social Networks
pp. 1120, Oct 2015. Analysis and Mining (ASONAM), pp. 970977, Aug 2015.
[15] L. Nataraj, V. Yegneswaran, P. Porras, and J. Zhang, A comparative [37] Y. A. LeCun, L. Bottou, G. B. Orr, and K.-R. Muller, Efficient
assessment of malware classification using binary texture analysis and backprop, in Neural networks: Tricks of the trade, pp. 948, Springer,
dynamic analysis, in Proceedings of the 4th ACM Workshop on Security 2012.
and Artificial Intelligence, AISec 11, (New York, NY, USA), pp. 2130, [38] W.-C. Wu and S.-H. Hung, Droiddolphin: a dynamic android malware
ACM, 2011. detection framework using big data and machine learning, in Proceed-
[16] A. Damodaran, F. D. Troia, C. A. Visaggio, T. H. Austin, and M. Stamp, ings of the 2014 Conference on Research in Adaptive and Convergent
A comparison of static, dynamic, and hybrid analysis for malware Systems, pp. 247252, ACM, 2014.
detection, Journal of Computer Virology and Hacking Techniques, [39] F. Chollet, Keras, 2015.
vol. 13, no. 1, pp. 112, 2017. [40] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S.
[17] S. S. Hansen, T. M. T. Larsen, M. Stevanovic, and J. M. Pedersen, An Corrado, A. Davis, J. Dean, M. Devin, et al., Tensorflow: Large-scale
approach for detection and family classification of malware based on machine learning on heterogeneous distributed systems, arXiv preprint
behavioral analysis, in 2016 International Conference on Computing, arXiv:1603.04467, 2016.
Networking and Communications (ICNC), pp. 15, Feb 2016.
[41] E. Frank, M. A. Hall, and I. H. Witten, The WEKA Workbench. Online
[18] S. Hsiao, Y. S. Sun, and M. C. Chen, Virtual machine introspec-
Appendix for Data Mining: Practical Machine Learning Tools and
tion based malware behavior profiling and family grouping, CoRR,
Techniques. Morgan Kaufmann, 4 ed., 2016.
vol. abs/1705.01697, 2017.
[42] W. Yan and L. Yu, On accurate and reliable anomaly detection for gas
[19] S. Tobiyama, Y. Yamaguchi, H. Shimada, T. Ikuse, and T. Yagi,
turbine combustors: A deep learning approach, in Proceedings of the
Malware detection with deep neural network using process behavior,
Annual Conference of the Prognostics and Health Management Society,
in 2016 IEEE 40th Annual Computer Software and Applications Con-
2015.
ference (COMPSAC), vol. 2, pp. 577582, June 2016.
[20] B. Kolosnjaji, A. Zarras, G. Webster, and C. Eckert, Deep learning for [43] J. Snoek, H. Larochelle, and R. P. Adams, Practical bayesian optimiza-
classification of malware system call sequences, in Australasian Joint tion of machine learning algorithms, in Advances in neural information
Conference on Artificial Intelligence, pp. 137149, Springer, 2016. processing systems, pp. 29512959, 2012.
[21] M. Neugschwandtner, P. M. Comparetti, G. Jacob, and C. Kruegel, [44] J. Bergstra and Y. Bengio, Random search for hyper-parameter op-
Forecast: skimming off the malware cream, in Proceedings of the 27th timization, Journal of Machine Learning Research, vol. 13, no. Feb,
Annual Computer Security Applications Conference, pp. 1120, ACM, pp. 281305, 2012.
2011. [45] D. P. Kingma and J. Ba, Adam: A method for stochastic optimization,
[22] Z. C. Lipton, A critical review of recurrent neural networks for sequence CoRR, vol. abs/1412.6980, 2014.
learning, CoRR, vol. abs/1506.00019, 2015. [46] N. Srivastava, G. E. Hinton, A. Krizhevsky, I. Sutskever, and
[23] O. Vinyals, A. Toshev, S. Bengio, and D. Erhan, Show and tell: A R. Salakhutdinov, Dropout: a simple way to prevent neural networks
neural image caption generator, in The IEEE Conference on Computer from overfitting., Journal of Machine Learning Research, vol. 15, no. 1,
Vision and Pattern Recognition (CVPR), June 2015. pp. 19291958, 2014.
[24] S. Hochreiter and J. Schmidhuber, Long short-term memory, Neural [47] T. G. Dietterich, Ensemble Methods in Machine Learning, pp. 115.
computation, vol. 9, no. 8, pp. 17351780, 1997. Berlin, Heidelberg: Springer Berlin Heidelberg, 2000.

12
[48] M. T. Ribeiro, S. Singh, and C. Guestrin, why should I trust you?:
Explaining the predictions of any classifier, CoRR, vol. abs/1602.04938,
2016.
[49] M. E. Aminantoa and K. Kimb, Deep learning in intrusion detection
system: An overview, 2016.

13

You might also like