- Jun 16, 2014
- 781
So with all the allegations surrounding Tiranium antivirus kind of spiralling out of control and several people suggesting that the team restart from scratch and put together a new anti-virus product, I thought now would be a good chance to explain a few of the techniques that can be used (and I would suggest that is they are doing it in VB.NET they start with just a file scanner, as any kind of real-time protection is extremely difficult to implement in .NET and managed languages), learning C/C++ is recommended.
Anyway for anyone interested I'll try to keep my explanation simple and easy to understand, providing a brief overview of the techniques and a start for anyone interested in developing their own anti-virus product.
File Scanning Basics:
File Identification:
File Analysis Basics:
Signature Based Analysis:
Heuristic Analysis:
In future posts if anybody is interested I'll cover some basic dynamic analysis concepts as well as reputation analysis, but unfortunately I've just realised how long and probably boring this post already is. I hope that if you were bored to death, you at least learned something, and thank you very very much for reading, I've put over half an hour into typing all this out lol
Terms:
DLL stands for Dynamic Link Library and is used by Windows to hold "function" or repeatable routines which can be called by many different programs. For example, a dll called kernel32.dll has a function called CreateFile which is the bridge between the program or virus and the operating system. So if the virus wants to download a file from the internet, it may call on CreateFile in kernel32.dll to handle all the heavy lifting involved with saving the file to the hard drive (which I won't go into for sake of simplicity)
PE and MZ stand for Portable Executable and Mark Zbikowski, an MZ executable can only be run in MS-DOS, hence whenever you open a PE file in a hex editor, you'll see the string "This program cannot be run in DOS mode", this is actually a miniature MS-DOS program called the MZ Stub which is designed to alert DOS users that they need to run the application on Windows. Checking for the size and structure of this section (called a file header or MZ_HEADER) is one of the methods anti-virus programs use to look for suspicious files. Some viruses used to hide code in this section which older anti-viruses used to skip to speed up scan times)
Special Note For Programmers:
Anyway for anyone interested I'll try to keep my explanation simple and easy to understand, providing a brief overview of the techniques and a start for anyone interested in developing their own anti-virus product.
File Scanning Basics:
When an anti-virus looks through your computer it doesn't inspect every single file on the computer, instead multiple techniques are used to determine if a file can be considered a risk (and so worth scanning) or not. First of all the file extension, extensions such as (.exe, .com, .scr, .dll, .cpl, ocx & .sys) are all types of executables as we know them (in the PE or MZ format), then we have potential "hosts" which include .pdf, .wmf .doc .ppt .xml .zip (yup, wonder how many of you are too young to remember ZIP bombs?) File extensions that cannot be executed without being renamed are typically skipped unless "Smart file extensions" or similar is switched off.
Of course when on "Quick Scan" mode, a much faster scan can be performed by just searching for these files within Program Files and system areas, as well as in memory (a scan of the processes currently running, tracing them to loaded DLL* files etc).
Of course when on "Quick Scan" mode, a much faster scan can be performed by just searching for these files within Program Files and system areas, as well as in memory (a scan of the processes currently running, tracing them to loaded DLL* files etc).
File Identification:
Just because a file is called imateapot.exe that doesn't necessarily mean it is, for example, if our antivirus simply relied on the extension of a file then as a virus writer, all we would need to do is create a malicious Java application with an EXE extension, our anti-virus would start trying to analyse the JAR file as if it was an executable PE/MZ* meaning any signatures* we tried to apply would not match. Like trying to look for a a bag of frozen chips in the fresh fruit department of a supermarket.
File Analysis Basics:
So once we know that a file is a 'risk' file, we can begin to analyse it in detail using some of the techniques I'll describe bellow which can all be done in VB.NET I should note (this is the language Tiranium is programmed in).
So there are basically three main ways that an anti-virus can see the difference between a malicious file and a benign or safe file. The first is using signatures (static and heuristic), this is known as "static analysis", the second is using "dynamic analysis", for example behavioural monitoring, sandboxing, loading the virus into a protected area of memory and monitoring everything it does, redirecting every action away from your computer and into the anti-virus engine for analysis. And the third is reputation based analysis, which is another form of static analysis.
As this post is already pretty long, I'll simply outline some of the basics of each method, advantages and disadvantages and I'll try to explain the technical bits as best as I can but feel free to ask if there is anything you're unsure of, or indeed anything that I have mis explained.
So there are basically three main ways that an anti-virus can see the difference between a malicious file and a benign or safe file. The first is using signatures (static and heuristic), this is known as "static analysis", the second is using "dynamic analysis", for example behavioural monitoring, sandboxing, loading the virus into a protected area of memory and monitoring everything it does, redirecting every action away from your computer and into the anti-virus engine for analysis. And the third is reputation based analysis, which is another form of static analysis.
As this post is already pretty long, I'll simply outline some of the basics of each method, advantages and disadvantages and I'll try to explain the technical bits as best as I can but feel free to ask if there is anything you're unsure of, or indeed anything that I have mis explained.
Signature Based Analysis:
Every executable has a specific structure which includes the file header (see 'terms'), the 'text' section, which is 'normally' where executable code is placed and the import table (which reveals which DLLs the program relies on to operate [see 'terms']). Signature based analysis is, in a nutshell scanning specific parts of the executable as above to look for patterns, also known as 'features'.
In traditional signature based analysis the anti-virus constructs a virtual map of the executable file and from that produces something called a binary tree, which is in essence a way of scanning through the entire file and only looking up signatures from the data base which cannot be impossible to match at that point. For example in the sentence: Humpty Dumpty Fell Off The Wall, let's pretend that "Fell Off The Wall" is the signature indicating we've caught the Humpty Dumpty virus.
The binary tree method scans through our database and sees the text (or bytes in an executable) "Humpty", our database doesn't contain any signatures that start with "Humpty" so we skip past that, and again for "Dumpty". "Fell" matches two signatures in our database "Fell Off The Wall" and "Fell Off The Chair", the latter is a different virus.
So comparing each word in our file against it's possible matches in the database we construct a tree which looks something similar to the following:
Fell
|
Off
|
The ------ Chair (route 2)
|
Wall (route 1)
And "walking" along that tree we can see that route 1 or (signature 1 in our database) is a match for something found in our executable. We can now either perform a further in depth analysis of this file if we want to be sure it's malicious, this could include comparing the MD5 or SHA1 hash of the executable to a database of known hashes for this particular virus. (basically a hash is a way of reducing a unique set of data into a specific series of letters and numbers based on that exact data).
In traditional signature based analysis the anti-virus constructs a virtual map of the executable file and from that produces something called a binary tree, which is in essence a way of scanning through the entire file and only looking up signatures from the data base which cannot be impossible to match at that point. For example in the sentence: Humpty Dumpty Fell Off The Wall, let's pretend that "Fell Off The Wall" is the signature indicating we've caught the Humpty Dumpty virus.
The binary tree method scans through our database and sees the text (or bytes in an executable) "Humpty", our database doesn't contain any signatures that start with "Humpty" so we skip past that, and again for "Dumpty". "Fell" matches two signatures in our database "Fell Off The Wall" and "Fell Off The Chair", the latter is a different virus.
So comparing each word in our file against it's possible matches in the database we construct a tree which looks something similar to the following:
Fell
|
Off
|
The ------ Chair (route 2)
|
Wall (route 1)
And "walking" along that tree we can see that route 1 or (signature 1 in our database) is a match for something found in our executable. We can now either perform a further in depth analysis of this file if we want to be sure it's malicious, this could include comparing the MD5 or SHA1 hash of the executable to a database of known hashes for this particular virus. (basically a hash is a way of reducing a unique set of data into a specific series of letters and numbers based on that exact data).
Heuristic Analysis:
This technique is similar to the above only the signatures are much more general and other techniques are used, including something called Entropy which is a measure of how random or disorganised a file is (the more random a files contents, the more likely it has been compressed or packed, as most compilers usually produce orderly, logical code), and also detailed "PE analysis" which is essentially looking at the structure of the executable file, is it a standard structure, or is it unusual (an attempt to hide something, maybe a virus?)
With heuristic analysis the aim is to look for new forms of malware that have yet to be discovered (eg: Humpty Dumpty's yet undiscovered nemesis, the machine gun)... One of the best ways to do this is to look as many anti-virus companies have done at something called "document classification", this is an area worthy of an article all on it's own and is fascinating, but essentially there are two main concepts useful to malware analysis. Term Frequency (TF) and Inverse Document Frequency (TFIDF). The first technique simply measures how commonly a word occurs in a document.
For example, a document about cooking will likely contain words such as "prepare", "ingredients", "chop", "bake" etc and these words will frequently in a cooking article but not in an article about banks. Of course other words such as "and", "the", "them" will appear frequently too. This is called "noise" which means that a computer has to have a way to determine that "and" and "the" contribute nothing to telling us what a document is about, whereas the words "ingredient" and "chop" do.
Inverse Document Frequency to the rescue, without going into detail, this method allows us to separate the wheat from the chaff and determine the importance of a word to a document, it does this by excluding words that appear very often among a set of documents (unrelated), so in a library, we would find millions of books containing "and" often but only a few hundred with "ingredient".
Applying this to malware analysis there are a number of techniques which are in the realm of "data mining" and "machine learning", again articles all of their own are needed for me to explain these with any level of detail, so I'll simply skip to two common implementations.
The most often used implementation is something called byte n-grams, a byte is a building block of the computer program and an n-gram is a representation of a series of objects. So now instead of looking for words such as "ingredient" in our files, we look for series of bytes which occur most commonly in malicious programs, for example let's say "4A 3B 2F 2F 3A" represents the word "machine gun" and evil a tool as it is, appears most frequently in backdoor trojans, looking for that series of bytes will flag the file to us as suspicious, and finding combinations of these byte signatures will allow us to be up to 98% sure that we have detected a malicious program.
This specimen can now be automatically uploaded to human malware analysts for a proper looking over, and if it's malicious, static signatures can be made to identify it as particular strain "Trojan.HumptyDumpty.MachineGun" rather than "Trojan.Gen!3297 or whatever.
With heuristic analysis the aim is to look for new forms of malware that have yet to be discovered (eg: Humpty Dumpty's yet undiscovered nemesis, the machine gun)... One of the best ways to do this is to look as many anti-virus companies have done at something called "document classification", this is an area worthy of an article all on it's own and is fascinating, but essentially there are two main concepts useful to malware analysis. Term Frequency (TF) and Inverse Document Frequency (TFIDF). The first technique simply measures how commonly a word occurs in a document.
For example, a document about cooking will likely contain words such as "prepare", "ingredients", "chop", "bake" etc and these words will frequently in a cooking article but not in an article about banks. Of course other words such as "and", "the", "them" will appear frequently too. This is called "noise" which means that a computer has to have a way to determine that "and" and "the" contribute nothing to telling us what a document is about, whereas the words "ingredient" and "chop" do.
Inverse Document Frequency to the rescue, without going into detail, this method allows us to separate the wheat from the chaff and determine the importance of a word to a document, it does this by excluding words that appear very often among a set of documents (unrelated), so in a library, we would find millions of books containing "and" often but only a few hundred with "ingredient".
Applying this to malware analysis there are a number of techniques which are in the realm of "data mining" and "machine learning", again articles all of their own are needed for me to explain these with any level of detail, so I'll simply skip to two common implementations.
The most often used implementation is something called byte n-grams, a byte is a building block of the computer program and an n-gram is a representation of a series of objects. So now instead of looking for words such as "ingredient" in our files, we look for series of bytes which occur most commonly in malicious programs, for example let's say "4A 3B 2F 2F 3A" represents the word "machine gun" and evil a tool as it is, appears most frequently in backdoor trojans, looking for that series of bytes will flag the file to us as suspicious, and finding combinations of these byte signatures will allow us to be up to 98% sure that we have detected a malicious program.
This specimen can now be automatically uploaded to human malware analysts for a proper looking over, and if it's malicious, static signatures can be made to identify it as particular strain "Trojan.HumptyDumpty.MachineGun" rather than "Trojan.Gen!3297 or whatever.
In future posts if anybody is interested I'll cover some basic dynamic analysis concepts as well as reputation analysis, but unfortunately I've just realised how long and probably boring this post already is. I hope that if you were bored to death, you at least learned something, and thank you very very much for reading, I've put over half an hour into typing all this out lol
Terms:
DLL stands for Dynamic Link Library and is used by Windows to hold "function" or repeatable routines which can be called by many different programs. For example, a dll called kernel32.dll has a function called CreateFile which is the bridge between the program or virus and the operating system. So if the virus wants to download a file from the internet, it may call on CreateFile in kernel32.dll to handle all the heavy lifting involved with saving the file to the hard drive (which I won't go into for sake of simplicity)
PE and MZ stand for Portable Executable and Mark Zbikowski, an MZ executable can only be run in MS-DOS, hence whenever you open a PE file in a hex editor, you'll see the string "This program cannot be run in DOS mode", this is actually a miniature MS-DOS program called the MZ Stub which is designed to alert DOS users that they need to run the application on Windows. Checking for the size and structure of this section (called a file header or MZ_HEADER) is one of the methods anti-virus programs use to look for suspicious files. Some viruses used to hide code in this section which older anti-viruses used to skip to speed up scan times)
Special Note For Programmers:
If you're seriously interested in making you're own anti-virus application in .NET I'd suggest you check out the fantastic PicNet library (https://github.com/PicNet/PicNetML), it's an implementation in .NET of a data mining library called WEKA which is also fantastic but sadly not usable directly from .NET.
You can use PicNet to mine a "training set" or collection of executables for signatures as I detailed above, and this will give you an excellent starting point for building your very own signature database! If any of you are actually interested in this kind of thing, I'm happy to give you some guidance and help with it.
You can use PicNet to mine a "training set" or collection of executables for signatures as I detailed above, and this will give you an excellent starting point for building your very own signature database! If any of you are actually interested in this kind of thing, I'm happy to give you some guidance and help with it.
Last edited: