Professional Documents
Culture Documents
System
Abstract
Automatic virus analysis is an important component of the IBM/Sym-
antec Digital Immune System[?]. It attempts to determine whether a
given object is a computer virus and creates antigen if it is. It must do so
without human intervention in order to respond to a virus threat faster
than the virus can spread.
Using this system, we are able to respond to new viruses automatically.
In this paper, I discuss our implementation and why it works so well for
this type of security threat. I end by discussing the barriers that need to
be overcome in order deal with other security threats in a similar manner.
1 Malware
The Digital Immune System is designed specifically to deal with the threat of
computer viruses. However, viruses are merely a subset of all malicious software
(Malware), which in turn is a subset of all software. Malware accounts for a
large proportion of computer security incidents, if not the vast majority, but
they affect Microsoft Windows systems 1 almost exclusively.
Malware is loosely defined as software with malicious intent. If we take the
entire set of software (see figure 1), we assume the vast majority of available
software is not programmed with any malicious intent. Some of this software
contains noticeable bugs, that may or may not cause some damage. However,
we normally assume the intent is benign, and do not consider such anomalies
malicious.
A small portion of the available software was programmed with malicious
intent. This is what we call Malware. However, this term is imprecisely defined.
In order to detect Malware, we need to define a measurable property, with which
we can detect it.
Of all Malware, viruses are the most precisely defined, having a property,
which we will call the virus property, that is defined to be malicious. I define
worms as a type of virus, as they share this virus property. That leaves trojan
1 This is due nearly entirely to the pervasiveness of this family of operating system, and
1
Trojans Viruses
Bugs
Good Bad
Figure 1: Malware
horses. These are problematic in that there is no single property we can use to
identify a Trojan horse.
Before returning to Trojan horses, I will explore viruses in more detail. Co-
hen’s PhD Thesis [?] and book [?] remains the most rigorous theoretical treat-
ment of viruses to this day. In his thesis, Cohen constructs a mathematical
definition of the virus property for a Turing machine, which is very general and
comprehensive. However, for practical purposes we use a definition like this one:
There are other similar definitions, some are more or less precise than the
one I gave. In discussing viruses, I will also refer to the virus property for
convenience as the property of self-replication.
N.B.: These definitions assume that you can define “program” and “routine”
for a given operating system. For a system, where such entities do not exist, we
would need a different definition. The definition also implies that the virus may
modify itself while copying, and the modified version must likewise exhibit the
virus property.
The definition is usually extended to include the creation of new executable
objects, i.e., without requiring an existing host executable to modify. In this
case, it is vital that these new objects are likely to be executed. This will depend
heavily on the operating system and typical user behavior.
This definition has false positives, built in. A “copy” program may copy
itself, and therefore exhibits the virus property. By convention, we do not call
such programs viruses, although, by definition, we should. We must identify
such exception to the rule and prevent these benign programs being flagged as
viruses.
Based on a more general and formal definition, Dr. Cohen proved, amongst
other things, that the virus property is undecidable. This has serious conse-
quences as we will discuss later.
A virus may also contain further properties or payloads apart from its self
propagation, such as destruction or espionage, which we consider properties
Software
Observer 1
Observer 2
separate from the virus property. Although these are of concern to use affected
user, they are not important in detecting the virus itself.
Viruses have taken many forms in the last 14 years, and we will continue to
see many variations in the future. Currently, the most common types are the
so-called macro viruses, the Win32 viruses and script viruses [Edi00]. Recently
we have seen increasing numbers of Visual Basic for Scripting viruses, and we
expect to see more in the future.
The AV industry has managed to define the virus property as malicious in the
public consciousness. This is important for us, but it may not be obvious why.
Self-propagation need not be malicious and in nature is usually encouraged.
There has been attempts to create “good” self-replicating agents (viruses). How-
ever, the AV industry has done a good job at pointing out that there is nothing
that can be done in computing with self-replicating agents that cannot be done
in a more secure and controlled manner with non-self-replicating agents. This
allows us to use “self replicating” as a synonym of “malicious” in the context of
software.
This means that we now have a measurable property that viruses must ex-
hibit: self-replication. This property is well-defined and also always malicious,
so identifying this property will not cause false positives. Thus, we have a start-
ing point for automating the analysis of viruses. Such a system needs no human
to make the judgment call: “is the intent of this software malicious?”. The
system itself can identify a malicious object by observing the self-replication.
The down-side is that there will be false-negatives; viruses that are not
detected. There will always be viruses for which the “tools” for measuring the
virus property have not been invented. For this reason, humans will be called
upon to teach the analysis system how to detect these new viruses. More on
this later.
RUBARBRUBARBFOOBARRUBARBVIRUSCODERUBARBRUBARR
Virus Scanner
String Parsers
matching
Emulator
... ...
We do not have the same luxury with other types of malware. “Trojan horses”
are hard to pin one particular property to. In general, “intent” is hard even for
a human to identify and is impossible to measure, but malicious intent is what
makes code a Trojan horse.
It may be possible to define some subclasses of Trojan horses. But even if we
try to define something as specific as “password stealing Trojan” we may run
into trouble. Does this include a program that reads the password file and send
one password to another machine? This could also be a poorly designed pass-
word authentication system. So, even in cases where we can define a property,
we cannot automatically associate this property with “maliciousness” without
human help.
The antivirus industry deals with viruses piece by piece: for every virus
found, a specific antigen is found and deployed. This means that the industry
is perpetually playing catch up. This is far from satisfactory, but works for two
reasons. Firstly, the industry can usually keep false positives very very low by
only detecting known viruses. Secondly, we can remove the offending virus from
the system, but only if the virus has been analyzed beforehand.2
The classical virus scanner used to be a string search tool and a set of strings.
A version of the UNIX program “grep” that takes binary string patterns would
be sufficient. More recently, virus scanners have become much more complex.
The early string-based virus scanners were susceptible to false positives. One
way of dealing with this is to ensure that the string was unlikely to cause false
positives. This was done by making sure it was long enough and chosen carefully.
A set of checksums over certain parts of the virus is used to augment the string.
This allowed so-called “exact identification”. Virus removal instructions are also
embedded in what we now call a virus signature.
The intolerance for false positives cannot be overemphasized. Other intru-
sion detection tools have an astoundingly high false positive rate, but also fairly
high false negative rates. In contrast, the antivirus industry is a fairly mature
part of the security industry and has learned that false positives often cost cus-
2 However, generic or heuristic disinfection is being incorporated into virus scanners, but it
Central node
Compan
y’s per
imeter
Company node
Client’s PC
tomers as much as false negatives. Dealing with fixed objects and not sentient
beings has the advantage that the product can be thoroughly tested by third
parties and the vendor.
An antivirus has to be updated very frequently. Very soon, we should be
updating our antivirus software on a daily, perhaps even hourly, basis to main-
tain basic protection. This means even more skilled manpower analyzing viruses
around the clock is needed. Even then, the unknown virus that is spreading on
our machine, will not be found using that version, so we need more than that:
we need a Digital Immune System.
Dataflow
1. Determine whether the sample replicates and generate enough samples for
analysis.
2. Then it must generate an antigen for that virus.
3. Finally that antigen must be tested for false-positives and false-negatives.
We work under the assumption that such a system will work well nearly all the
time, and fail in the worst possible manner when it fails. We therefore put a lot
of thought and effort into the tests.
The system is designed with reliability in mind to maintain a high degree of
availability. The system is composed of a number of modules that perform well-
defined tasks and are called from a central queuing and scheduling entity that
we call the Dataflow system (see figure 5). Each module runs to completion and
then terminates. The Dataflow system monitors these modules and terminates
them if they appear to have crashed or are hung. The only part of the system
that never terminates is the Dataflow system itself.
Classifier
The roles of the modules are to prepare, to replicate, to analyze and to test. We
call the preparation phase the classifier as it determines what type of sample it
is and sets up directories and data structures for the next phases.
Replication controller
The replication phase comprises of two modules. The replication controller sets
up one or more concurrent replications and then passes control to the replicators,
via the Dataflow system. When they are finished, they pass control back to the
controller, which determines whether there are enough replicants. It is the
controller that sets the general strategy for replication, but the replicators do
the actual work according to the script it has been given. This means that the
controller must do a part of the actual virus analysis.
There may be many rounds of replication involved, until enough (or any)
replicants are produced. Some viruses are difficult to replicate and require
quite obtuse techniques. The controller picks these techniques as they become
necessary.
Replication
The purpose of the replication is to replicate the virus onto files we have original
copies of. We need these files in the generation of antigen and in testing and
is also standard practice amongst human virus researchers. If replication is
successful, we know that the object is really a virus and we have the samples to
prove it.
The actual replication is done in an isolated environment. We currently
use emulators or virtual machines to run and entire virtual PC, although we
have developed pure hardware solutions as well for experimental purposes. Our
experiments have shown that although using a real replicator machine instead
of a virtual or emulated one, sometimes achieves better results and replicates
faster3 , it is far more difficult to integrate in the automation of the analysis
center.
The virus is inserted into a prepared image with the target operating system
and applications preinstalled. When the emulator is started, a program, we
call the replication controller, is run that executes commands on the operating
system in an attempt to replicate the suspected virus. The program that does
this is script driven and has the ability to detect virus activity on its own. For
example, in some cases, the virus only installs itself to be run the next time the
machine is rebooted. In this case, the script reboots the machine.
3 Although the setup time is typically much longer
1.
Sample, Goats, Applications 2.
Installer
Emulator/Virtual Machine
Operating Apps
System
Disk image
File! Reg! RC
Mon Mon
Comm. server
3.
Extractor
Modified files
Analysis
The analysis takes the replicants that were generated and attempts to generate
antigen. This phase is proprietary to the target antivirus, but involves finding
good detection signatures. In the simplest case, the virus samples are all se-
quenced, and a string is selected to have a statistically low false positive and
false negative rate. Then identification and removal information is generated
based on the replicants and their originals [?][?]. There is usually a quick test
of the generated signature at this point before the new signature is sent to be
integrated into the master set of signatures.
Raw samples
Preanalysis &
sorting Subtraction of originals
!"
Pure samples Pure
Viruses
Sequencing
Signature
Serialization
The signature is added to the current set of signatures at a serialization point
in the system.
Testing
The new combined signature set is tested against the samples that were gener-
ated as well as a standard set of clean files.
Then, at last, the new signature file is released to the communications system
for transportation to the customers.
There is another dimension to these modules: the different types of viruses that
the system handles. In the classifier phase, it is determined what type of sample
we were sent. Currently we handle PC-DOS, Macro, Win32 and can partially
handle VB-Script viruses. In future there will be other types of viruses for which
new modules will have to be written and, of course, the modules will have to be
improved as the viruses evolve.
3.2 Status
A version of the Digital Immune System clients, communication system and
analysis center has been successfully tested in a pilot program for some select
Symantec customers. In regression tests, it handles most of the macro viruses
and about half of the known Win32 and DOS viruses, although this number
is always increasing. It should also be noted that we expect to do better in
production use as the test set we use includes viruses that either do not run or
barely run. Input from the wild should usually only include those viruses that
actually can replicate otherwise they would never have spread in the first place.
Currently, the system is being expanded for production use, so that it will be
capable of handling the expected load a widescale deployment would generate.
This should be deployed this year.
However, even if the communications part of the Digital Immune System is
finished, the analysis center is in constant flux, as new types and subtypes of
viruses are created. The analysis center can handle any previously known type
of virus, but must be taught how to handle new types of viruses. Fortunately,
the introduction of new types of viruses occurs far less frequently than new
viruses of existing types. Most viruses are members of large families, as they
are merely modified versions of other viruses.
5 Conclusion
Computer security is inherently hard, this much is known for sure. As new
potential for misuse show up every day, we can only hope to patch up the
problems as fast as they get abused. With the Digital Immune System analysis
system, we have gone very far in automating the process of dealing with threats
from new viruses, an important part of computer security. We’ve been helped
by the fact that viruses are well-defined entities and that we can equate the
virus property unequivocally with maliciousness.
The success of this approach begs the question of whether we can find similar
techniques to deal with other types of Malware or even other forms of intrusion.
A first step seems to be finding a precise definition and then developing tests
for the properties of the entities. I believe this is possible with Malware, but
whether this approach will result in a useful system remains to be seen. The
approach applied to other forms of intrusion will only have limited success as
we are dealing directly with human beings.
References
[Edi00] Editors. Prevalence table. Virus Bulletin, page 3, September 2000.