How Anti-Virus Works (Some techniques)

Cowpipe · Jun 22, 2014

So with all the allegations surrounding Tiranium antivirus kind of spiralling out of control and several people suggesting that the team restart from scratch and put together a new anti-virus product, I thought now would be a good chance to explain a few of the techniques that can be used (and I would suggest that is they are doing it in VB.NET they start with just a file scanner, as any kind of real-time protection is extremely difficult to implement in .NET and managed languages), learning C/C++ is recommended.

Anyway for anyone interested I'll try to keep my explanation simple and easy to understand, providing a brief overview of the techniques and a start for anyone interested in developing their own anti-virus product.

File Scanning Basics:

When an anti-virus looks through your computer it doesn't inspect every single file on the computer, instead multiple techniques are used to determine if a file can be considered a risk (and so worth scanning) or not. First of all the file extension, extensions such as (.exe, .com, .scr, .dll, .cpl, ocx & .sys) are all types of executables as we know them (in the PE or MZ format), then we have potential "hosts" which include .pdf, .wmf .doc .ppt .xml .zip (yup, wonder how many of you are too young to remember ZIP bombs?) File extensions that cannot be executed without being renamed are typically skipped unless "Smart file extensions" or similar is switched off.

Of course when on "Quick Scan" mode, a much faster scan can be performed by just searching for these files within Program Files and system areas, as well as in memory (a scan of the processes currently running, tracing them to loaded DLL* files etc).

File Identification:

Just because a file is called imateapot.exe that doesn't necessarily mean it is, for example, if our antivirus simply relied on the extension of a file then as a virus writer, all we would need to do is create a malicious Java application with an EXE extension, our anti-virus would start trying to analyse the JAR file as if it was an executable PE/MZ* meaning any signatures* we tried to apply would not match. Like trying to look for a a bag of frozen chips in the fresh fruit department of a supermarket.

File Analysis Basics:

So once we know that a file is a 'risk' file, we can begin to analyse it in detail using some of the techniques I'll describe bellow which can all be done in VB.NET I should note (this is the language Tiranium is programmed in).

So there are basically three main ways that an anti-virus can see the difference between a malicious file and a benign or safe file. The first is using signatures (static and heuristic), this is known as "static analysis", the second is using "dynamic analysis", for example behavioural monitoring, sandboxing, loading the virus into a protected area of memory and monitoring everything it does, redirecting every action away from your computer and into the anti-virus engine for analysis. And the third is reputation based analysis, which is another form of static analysis.

As this post is already pretty long, I'll simply outline some of the basics of each method, advantages and disadvantages and I'll try to explain the technical bits as best as I can but feel free to ask if there is anything you're unsure of, or indeed anything that I have mis explained.

Signature Based Analysis:

Every executable has a specific structure which includes the file header (see 'terms'), the 'text' section, which is 'normally' where executable code is placed and the import table (which reveals which DLLs the program relies on to operate [see 'terms']). Signature based analysis is, in a nutshell scanning specific parts of the executable as above to look for patterns, also known as 'features'.

In traditional signature based analysis the anti-virus constructs a virtual map of the executable file and from that produces something called a binary tree, which is in essence a way of scanning through the entire file and only looking up signatures from the data base which cannot be impossible to match at that point. For example in the sentence: Humpty Dumpty Fell Off The Wall, let's pretend that "Fell Off The Wall" is the signature indicating we've caught the Humpty Dumpty virus.

The binary tree method scans through our database and sees the text (or bytes in an executable) "Humpty", our database doesn't contain any signatures that start with "Humpty" so we skip past that, and again for "Dumpty". "Fell" matches two signatures in our database "Fell Off The Wall" and "Fell Off The Chair", the latter is a different virus.

So comparing each word in our file against it's possible matches in the database we construct a tree which looks something similar to the following:

Fell
|
Off
|
The ------ Chair (route 2)
|
Wall (route 1)

And "walking" along that tree we can see that route 1 or (signature 1 in our database) is a match for something found in our executable. We can now either perform a further in depth analysis of this file if we want to be sure it's malicious, this could include comparing the MD5 or SHA1 hash of the executable to a database of known hashes for this particular virus. (basically a hash is a way of reducing a unique set of data into a specific series of letters and numbers based on that exact data).

Heuristic Analysis:

This technique is similar to the above only the signatures are much more general and other techniques are used, including something called Entropy which is a measure of how random or disorganised a file is (the more random a files contents, the more likely it has been compressed or packed, as most compilers usually produce orderly, logical code), and also detailed "PE analysis" which is essentially looking at the structure of the executable file, is it a standard structure, or is it unusual (an attempt to hide something, maybe a virus?)

With heuristic analysis the aim is to look for new forms of malware that have yet to be discovered (eg: Humpty Dumpty's yet undiscovered nemesis, the machine gun)... One of the best ways to do this is to look as many anti-virus companies have done at something called "document classification", this is an area worthy of an article all on it's own and is fascinating, but essentially there are two main concepts useful to malware analysis. Term Frequency (TF) and Inverse Document Frequency (TFIDF). The first technique simply measures how commonly a word occurs in a document.

For example, a document about cooking will likely contain words such as "prepare", "ingredients", "chop", "bake" etc and these words will frequently in a cooking article but not in an article about banks. Of course other words such as "and", "the", "them" will appear frequently too. This is called "noise" which means that a computer has to have a way to determine that "and" and "the" contribute nothing to telling us what a document is about, whereas the words "ingredient" and "chop" do.

Inverse Document Frequency to the rescue, without going into detail, this method allows us to separate the wheat from the chaff and determine the importance of a word to a document, it does this by excluding words that appear very often among a set of documents (unrelated), so in a library, we would find millions of books containing "and" often but only a few hundred with "ingredient".

Applying this to malware analysis there are a number of techniques which are in the realm of "data mining" and "machine learning", again articles all of their own are needed for me to explain these with any level of detail, so I'll simply skip to two common implementations.

The most often used implementation is something called byte n-grams, a byte is a building block of the computer program and an n-gram is a representation of a series of objects. So now instead of looking for words such as "ingredient" in our files, we look for series of bytes which occur most commonly in malicious programs, for example let's say "4A 3B 2F 2F 3A" represents the word "machine gun" and evil a tool as it is, appears most frequently in backdoor trojans, looking for that series of bytes will flag the file to us as suspicious, and finding combinations of these byte signatures will allow us to be up to 98% sure that we have detected a malicious program.

This specimen can now be automatically uploaded to human malware analysts for a proper looking over, and if it's malicious, static signatures can be made to identify it as particular strain "Trojan.HumptyDumpty.MachineGun" rather than "Trojan.Gen!3297 or whatever.

In future posts if anybody is interested I'll cover some basic dynamic analysis concepts as well as reputation analysis, but unfortunately I've just realised how long and probably boring this post already is. I hope that if you were bored to death, you at least learned something, and thank you very very much for reading, I've put over half an hour into typing all this out lol

Terms:

DLL stands for Dynamic Link Library and is used by Windows to hold "function" or repeatable routines which can be called by many different programs. For example, a dll called kernel32.dll has a function called CreateFile which is the bridge between the program or virus and the operating system. So if the virus wants to download a file from the internet, it may call on CreateFile in kernel32.dll to handle all the heavy lifting involved with saving the file to the hard drive (which I won't go into for sake of simplicity)

PE and MZ stand for Portable Executable and Mark Zbikowski, an MZ executable can only be run in MS-DOS, hence whenever you open a PE file in a hex editor, you'll see the string "This program cannot be run in DOS mode", this is actually a miniature MS-DOS program called the MZ Stub which is designed to alert DOS users that they need to run the application on Windows. Checking for the size and structure of this section (called a file header or MZ_HEADER) is one of the methods anti-virus programs use to look for suspicious files. Some viruses used to hide code in this section which older anti-viruses used to skip to speed up scan times)

Special Note For Programmers:

If you're seriously interested in making you're own anti-virus application in .NET I'd suggest you check out the fantastic PicNet library (https://github.com/PicNet/PicNetML), it's an implementation in .NET of a data mining library called WEKA which is also fantastic but sadly not usable directly from .NET.

You can use PicNet to mine a "training set" or collection of executables for signatures as I detailed above, and this will give you an excellent starting point for building your very own signature database!

If any of you are actually interested in this kind of thing, I'm happy to give you some guidance and help with it.

Manzai · Jun 22, 2014

Great !

Cowpipe · Jun 22, 2014

Update: Added note for programmers.

marg · Jun 22, 2014

Very interesting Cowpipe..!

Cowpipe · Jun 22, 2014

marg said:
Very interesting Cowpipe..!

There seems to be a genuine mystery around how anti-virus scanners work so thought I would shed a little light on it, just the tip of the iceberg really, it's a fascinating area. It can be helpful in my opinion to know how an anti-virus scanner works when reviewing, so as you at least have an explanation as to why that particular scanner performs better or worse than another, other than "number of signatures", "competence of dev team", "it's based off clam av, what did you expect?" etc

marg · Jun 22, 2014

Isn't Clam AV kind of old for an AV? I am not sure though?

Cowpipe · Jun 22, 2014

marg said:
Isn't Clam AV kind of old for an AV? I am not sure though?

The ClamAV databases are getting old now but the actual engine itself is still being developed actively, though I don't rate it personally, especially considering their website still considers "Worm.MyDoom" a 'current threat'

It may still be a threat but 'current'?

Though ClamAV focuses more on 'worms' rather than trojans, which is why it has poor detection rates. Worms tend to spread through email more than trojans and my understanding is that ClamAV is more orientated to scanning mail boxes rather than computers, it's a different set of "training data" that they've used. At least that's my understanding of it, I haven't looked through the project in a long time.

marg · Jun 22, 2014

Thank You for the info..!

Lailson · Jun 22, 2014

Great post friend!
Very didactic and informative, I am fascinated by this area, it clarifies me many questions and maybe I create a program. Waiting for one anxious ''continuation''

Cowpipe · Jun 22, 2014

Lailson said:
Great post friend!
Very didactic and informative, I am fascinated by this area, it clarifies me many questions and maybe I create a program. Waiting for one anxious ''continuation''

Thank you

If enough people are interested in this post I'll create a series, perhaps actually include some code and test programs as well so people can make their very own "anti-virus" to play with

Oxygen · Jun 22, 2014

Thanks for this.

sid_16 · Jun 22, 2014

Great post and good information about the anti virus operation!

Cats-4_Owners-2 · Jun 22, 2014

Thank you, Cowpipe! ..wish I 'could' say "Elementary, my Dear Cowpipe!", but that's your line, Dr;

and it's far from elementary as well.

It's brilliant!!

I've these visions of "MalwareTips Antivirus"!!!

Cowpipe · Jun 22, 2014

Cats-4_Owners-2 said:
Thank you, Cowpipe! ..wish I 'could' say "Elementary, my Dear Cowpipe!", but that's your your line, Dr; and it's far from elementary as well. It's brilliant!! I've these visions of "MalwareTips Antivirus"!!!

You know you might not be far off with that vision of yours

, I've had the urge to put together my own anti-virus for a while now but haven't had the time, so perhaps when things calm down with my current projects, perhaps in the next few weeks, I'll begin to write one, and of course invite everybody in to see how it works

xxtoss23 · Jun 22, 2014

Great

MrExplorer · Jun 22, 2014

Thanks for the great Information

Littlebits · Jun 22, 2014

Cowpipe said:
The ClamAV databases are getting old now but the actual engine itself is still being developed actively, though I don't rate it personally, especially considering their website still considers "Worm.MyDoom" a 'current threat' It may still be a threat but 'current'?

Though ClamAV focuses more on 'worms' rather than trojans, which is why it has poor detection rates. Worms tend to spread through email more than trojans and my understanding is that ClamAV is more orientated to scanning mail boxes rather than computers, it's a different set of "training data" that they've used. At least that's my understanding of it, I haven't looked through the project in a long time.

ClamAV was originally developed as an AV scanner for Linux, later ClamWin was created to run on Windows as a on-demand scanner only, then Immunet Free AV finally included real-time protection support for Windows which also added cloud community support.

ClamAV is still widely used on Linux servers that power websites and is included in some Linux-based router firewalls.

Enjoy!!

nissimezra · Jun 22, 2014

Great post thx

omidomi · Jun 22, 2014

Great

Kate_L · Jun 23, 2014

Good info, I see people that love computer security as much as I do, this makes me happy. I should give something to this community also

How Anti-Virus Works (Some techniques)

Level 16

Manzai

Level 16

Level 12

Level 16

Level 12

Level 16

Level 12

Level 13

Level 16

Level 44

Level 20

Level 39

Level 16

Level 24

Level 28

Retired Staff

Level 25

Level 71

in memoriam

Similar threads