Creating anti-virus signatures

Kate_L · Jun 23, 2014

Hello,

This post will cover the three main types of signature detections. The most common signatures are hashes, byte-signature and heuristics. This post is going to focus primarily on creating signatures for Microsoft Portable Executable (PE).

The reason for this post is because there is very little information on "creating anti-virus signatures", I see people that love computer security as much as I do. I will do my best, a few articles exist on how to create signatures with ClamAV. I'm hoping the information in this post will be helpful.

Packed code creates a dilemma for signature detection. If the files have been packed or compressed, the file will need to uncompressed or dumped before scanned. Anti-virus engines use emulators and unpackers to get the files to an uncompressed or dumped state before scanning the file. If the files are compressed or packed tools such as TitanEngine or Immunity Debugger could be used for creating dumps or uncompressed files.

Data that has been obfuscated or compressed should never be used as a candidate for a signature. As in the case of file hashes such as MD5; changing one byte of data can change the obfuscated or compressed code. Since the bytes can be easily changed by different data or key, there is a chance that the data will not be present in other variants.

It is a lot to say about reverse engineering and code analyzing, I will show some examples and there will be more clear.

Tools

ClamAV - Hex Byte Scanning, regex, md5 file scanning, md5 sectional scanning, sigtool ( tool for creating signatures and hashes )
Yara - A powerful rule based scanner that supports many conditions and data types, does not support hashing
ssdeep - A tool for creating and comparing context triggered piecewise hash.
Titan Engine - I don't have an epic description

Hash Signatures

The most basic and easiest type of signature is a hash value. A hash value is created by a hash function that is a procedure or mathematical function which converts a large amount of data into a single value. The most commonly used hash function is MD5 and SHA-1. These hash functions are extremely accurate.

Code:

md5("This is a bad malware") = "e4fee76675e45750b9e144247f92fd38"

We save this in myMalware.txt

Md5 based signatures can be created using ClamAV. Yara does not support file hashing. ClamAV requires two attributes in order to create a MD5 hash signature. The first is the file size in bytes and the second is the MD5 hash. ClamAV comes with a tool called sigtool that can be used to generate MD5 signatures. Sigtool can be found in the "bin" directory in the installation folder of ClamAV.

if you are using Windows you can Shift + Right Click and open Command line here

Code:

\bin>sigtool.exe --md5 myMalware.txt 
you will have an output like : e4fee76675e45750b9e144247f92fd38:21:myMalware.txt

e4fee76675e45750b9e144247f92fd38 = md5
21 = size
myMalware.txt = Malware name

Now we will save e4fee76675e45750b9e144247f92fd38:21:Not-A-Virus@TestSignature in a file myFirstSig.hdb (make sure you have the signature file and the "malware" we created in the textfile in /bin).

Code:

Loading virus signature database, please wait... |
Loading virus signature database, please wait... done
D:\Malware Research\ClamWinPortable\App\clamwin\bin\myMalware.txt: Not-A-Virus@TestSignature.UNOFFICIAL FOUND

----------- SCAN SUMMARY -----------
Known viruses: 1
Engine version: 0.98.1
Scanned directories: 0
Scanned files: 1
Infected files: 1
Data scanned: 0.00 MB
Data read: 0.00 MB (ratio 0.00:1)
Time: 0.053 sec (0 m 0 s)

Byte-Signatures

Byte-signature or byte detections are a signature based off a sequence of file bytes that are present in a file or data stream. Byte signatures are a very common form of detection and have been used since the first anti-virus scanner. Their usefulness is due to the accuracy they provide for detecting a sequence of bytes.

Code:

Malware_Name:1 for PE and 0 for all files: hexadecimal representation of the opcodes

Heuristics

The last type of signature detections is heuristics. Heuristics is used when the malware is too complex for hash and byte-signatures. Heuristics is a general term for the different techniques used to detect malware by their behavior.

Each anti-virus engine uses different algorithms and different proprietary techniques. A simple example of creating a heuristics signature would include an API logger and rules based off the APIs.

Rule A
An API call to RtlMoveMemory with a string of "SOFTWARE\Classes\http\shell\open\..."

Rule B
An API call to CreateMutexA with a string of "Mlwr"

Rule C
An API call to GetSystemDirectory

And now it will check:
if ( Rule A && Rule B && Rule C )
then Process = Malware

This post is just an introductory to creating anti-virus signatures. It is a lot more, I tried to make it so everybody will understand. If someone wanna learn this is a good start, there are some good free tools. It is funny, you can make a signatures even if you don't know RE or ASM

A lot of Cloud AV uses: MD5 / SHA1 / SHA256 for quick detection. There is a website for this VirSign

BTW, ClamAV team is now accepting "Community Signatures" into the official database. More details here http://www.clamav.net/lang/en/2014/02/18/introducing-clamav-community-signatures/ if someone is interested

Cowpipe · Jun 23, 2014

Fantastic post! Nice clearly laid out and easily digestible information

Kate_L · Jun 23, 2014

I will write more but I wanna know about what.

YARA + VirusTotal = love

marg · Jun 23, 2014

actually Cowpipe its a little over my head but, I can figure out some things. I am really Tech challenged but, I try anyway.

Cowpipe · Jun 23, 2014

marg said:
actually Cowpipe its a little over my head but, I can figure out some things. I am really Tech challenged but, I try anyway.

Sorry, meant to say for anyone who isn't knowledgeable about security but is technically minded, got my words mixed up there marg. Which parts don't you understand particularly well? I can try to explain them if you're interested

Oh and just for the record I've got the impression from your posts that you aren't "really" tech challenged, I think you're more knowledgeable than you believe

marg · Jun 23, 2014

Thank You for the kind words Cowpipe! I am confused on how the Cloud works?

Cowpipe · Jun 23, 2014

marg said:
Thank You for the kind words Cowpipe! I am confused on how the Cloud works?

When people talk about "the cloud" they are simply referring to a remote server. So an example of "cloud storage" is Dropbox, you can upload your files to 'the cloud' or in simpler terms, you can send your files through the internet to the Dropbox servers which will store them securely.

Cloud servers can also do processing work, so when you hear about 'cloud antivirus', that just means the software on your computer is talking over the internet to a server, perhaps a suspicious file is sent to that server which contains specialist analysis software and will usually have more processing power than your computer, then the results are sent back to your computer 'from the cloud'.

The Cloud (Dropbox)--------------the internet------------------cowpipe's computer
|
| ^ picture saved to the cloud
|
the internet
|
| ^ Picture sent through the internet
|
marg's computer

So in the above diagram, both you and me can send files to our Dropbox account which you can think of as our own little corner of the cloud or (dropbox in this case) server. Downloading pictures works the same way.

In essence all 'cloud computing' is, is a name for the trend of 'doing more processing on the server' rather than on the users computer.

Hopefully that makes sense, but if you're still not sure, let me know and I'll try to clear it up

marg · Jun 23, 2014

Thanks Cowpipe! It makes me feel better knowing the cloud is checking things under 360 TS.

Cowpipe · Jun 23, 2014

marg said:
Thanks Cowpipe! It makes me feel better knowing the cloud is checking things under 360 TS.

The beauty of it is that there are thousands of people scanning their files on the 360 TS cloud, and so if one user has a new virus on their computer, which the cloud can identify, it will automatically roll that out, so if that previously unknown virus winds up on your computer, it will be removed straight away, instead of you having to wait potentially hours or days for a human analyst to obtain the sample, write a detection for it and send the update to your anti-virus software, by which time the virus could well have embedded itself

Kate_L · Jun 23, 2014

You download a file and it checks the MD5 / SHA1 / SHA256 if the file is safe then all it is ok, else file is unsafe and blocked.

Now how "real" AV work

Panda Cloud: it has the (Behavioural analysis & Process Monitor), if one of the rules is triggered, it sends to the server that that file is "Suspicious File" and it sends to all the PCs wold wide.

Baidu Antivirus: the scan and detection is really fast, well it is because they use MD5 / SHA1 / SHA256. The Baidu Cloud it uses the MD5 and the "Kaspersky" cloud engine it is the same from what I know.

360 Qihoo: the QVM it uses rules like I showed above (heuristics) + hash. When you see the number (at the end of the name) that is the rule it triggered.

I hope that now you will understand better. The ViruSign it has the same system for quick detection. Read more here

Malware1 · Jun 23, 2014

OpenSecLabs said:
Baidu Antivirus: the scan and detection is really fast, well it is because they use MD5 / SHA1 / SHA256. The Baidu Cloud it uses the MD5 and the "Kaspersky" cloud engine it is the same from what I know.

It's not the same. Baidu copies detections from Kaspersky, ESET and Microsoft.

Kate_L · Jun 23, 2014

I didn't knew that, thank you for the update.

Kate_L · Jun 24, 2014

First sorry for the double post, if anyone wanna learn new things I will make an post about it just send me a PM

Oxygen · Jul 8, 2014

Interesting post, thanks for this.

WinXPert · Jul 8, 2014

Just found this out, great post by the way. I used to create AV signatures during good old DOS days with Thunderbyte AV. Doing it in Windows is very new to me and is giving me nose bleeds. Hope the topic moves on. This is the kind of stuff I like to read even if it's beyond my current comprehension.

Learning and reading new stuffs everyday. that's me

Cowpipe · Jul 9, 2014

WinXPert said:
Just found this out, great post by the way. I used to create AV signatures during good old DOS days with Thunderbyte AV. Doing it in Windows is very new to me and is giving me nose bleeds. Hope the topic moves on. This is the kind of stuff I like to read even if it's beyond my current comprehension.

Learning and reading new stuffs everyday. that's me

Wow, I haven't heard anyone talk about TBAV for years!

Did you used to request samples from them?

WinXPert · Jul 9, 2014

Nope. No internet then when I've used it. I got the program from a shareware CD I've bought

I wonder how many of us here knew about TBAV

Littlebits · Jul 9, 2014

How would you create generic signatures? many AV's use generic signatures which can detect a wide range of threats.

Thanks.

Kate_L · Jul 9, 2014

It is similar to heuristics, it is parts o code / behavior, this ca make False Positives.
Heuristic analysis is often considered a generic AV detection mechanism, not a virus-specific detection mechanism. What is not always considered is that the converse is also true; generic solutions use heuristic rule-sets as part of the diagnostic process

For example:
Trojan@WinLock it uses some API and locks the OS. you can take the code that does that and make a Generic signature (Trojan[at]WinLock.Gen), this way next time a sample will come it will be detected as ".Gen" this is how a lot of backdoors, bots are detected (even if it is FUD using a crypter).

“Generic detection” is a term applied when the scanner looks for a number of known variants, using a search string that can detect all of the variants. While it may detect a currently unknown variant in which the same search string can be found, it’s only a heuristic detection if it involves the use of a scoring mechanism. Otherwise it’s really a special case of virus-specific detection. Some systems use a hybrid approach, where a scoring system is added to the generic detection capabilities to give a probability of the variance or family membership with differing degrees of certainty. For instance, if the similarity is close enough, the scanner may report “a variant of x,” or if less sure, it may report “probably a variant of x”.

Any questions feel free to ask or if you want me to make an article about something.

Nico@FMA · Jul 9, 2014

OpenSecLabs said:
It is similar to heuristics, it is parts o code / behavior, this ca make False Positives.
Heuristic analysis is often considered a generic AV detection mechanism, not a virus-specific detection mechanism. What is not always considered is that the converse is also true; generic solutions use heuristic rule-sets as part of the diagnostic process

For example:
Trojan@WinLock it uses some API and locks the OS. you can take the code that does that and make a Generic signature (Trojan[at]WinLock.Gen), this way next time a sample will come it will be detected as ".Gen" this is how a lot of backdoors, bots are detected (even if it is FUD using a crypter).

“Generic detection” is a term applied when the scanner looks for a number of known variants, using a search string that can detect all of the variants. While it may detect a currently unknown variant in which the same search string can be found, it’s only a heuristic detection if it involves the use of a scoring mechanism. Otherwise it’s really a special case of virus-specific detection. Some systems use a hybrid approach, where a scoring system is added to the generic detection capabilities to give a probability of the variance or family membership with differing degrees of certainty. For instance, if the similarity is close enough, the scanner may report “a variant of x,” or if less sure, it may report “probably a variant of x”.

Any questions feel free to ask or if you want me to make an article about something.

100% correct, also detection rules are based upon protocol rules.
Its rather simple file X has a particular set of authorization within the OS, the moment a file exceeds or goes past its own credentials it will flag file X as possible suspicious and a behavior type of scanning technique will kick in and sends the file to the cloud to analyse it to the same file on thousands of client computers.
Now if all these files behave different then your own local X file, then it will also flag it as Gen Detection (Assuming that the malware itself has not been classified yet.

Also in the near future if a similar file shows the same infection symptoms then the whole family of that file type is being blacklisted for the time being as a better save then sorry thingy.

So in short said lets say file X shows the following symptoms:
1 Creates Register keys without reason
2 Tries to rename / corrupt semi critical system files
3 Makes lots of connections to outside sources

And lets say for the sake of argument that this malware is called W32/Trojan.ihackyousilly.EA
The next time if a unknown file shows similar behavior then while the malware causing it might not have been identified your AV is still capable of stopping it due to the fact it looks at the behavior and matches that to known and recorded "bad" behaviors.
And might label the file as a possible variant of the malware mentioned above.

Its like the real world viruses you might not know the virus itself but you certainly are able to see which family it did come from and if this is a new strain.
And as such computer malware is being classified pretty much the same when using HEUR, BEH or TRIP technology within resident and cloud engines.
Also file score based upon its own source code and actions is great way to determine of a file is bad.

Because lets say there are 10 points to be awarded in order to classify a file as clean or bad. Then a AV + Cloud can award points to the file based upon its behavior.

As long the file stays within lets say 3 points trashhold its being classed as save anything beyond that will flag it as either susp or malware depending how close the score gets to 10.
While also means that if the same file is found on a different PC across the world it will automatic get that same score.
Which means that a AV engine does not need to know the file and does not need to know the malware associated.
It just has to look up the hash, sha2 and file details and read the score.
Hence why most Asian AV vendors do not have own signatures and detection rules as they just can use public databases where unknown flagged files are being listed.
Then they just pair up a known AV engine alongside their own engine to cover the "known" ones.
Its really simple when you think about it BUT do not be foolish by thinking that this is just a everyone thingy, creating solid rules based upon true malware signatures is a VERY hard job and a specialist one.
The amount of data processed before a rule is being triggered is such a fragile and complex process that to a new AV company like those Asian ones its cheaper and more efficient to buy a last year generation scanning engine from Bitdefender or others who sell their older engines. As building your own infrastructure to support a AV program across thousands of computers backed up by a cloud is a insanely complex work.
And to achieve good results you sure as hell want to do it correctly as if you do not people will get hit with waves and waves of FP alerts and will lose files.
And this is exactly what we see with most new AV brands. The software might be ok, but their supporting infrastructure and the whole setup is not mature enough yet to support clients in the way like Symantec, Kaspersky or Sophos does.
So even tho HEUR scanning is a MASSIVE plus to ANY AV it is also the downfall of most new ones.
Because HEUR detection is based upon prediction and scores and getting that in sync with millions of clients around the world takes know how which most of these new Asian brands just do not have.

Cheers

Creating anti-virus signatures

in memoriam

Level 16

in memoriam

Level 13

Level 16

Level 13

Level 16

Level 13

Level 16

in memoriam

Level 76

in memoriam

in memoriam

Level 44

Level 25

Level 16

Level 25

Retired Staff

in memoriam

Level 27

Similar threads