Andy Ful
From Hard_Configurator Tools
Thread author
Verified
Honorary Member
Top Poster
Developer
Well-known
- Dec 23, 2014
- 8,774
Interesting white paper about Machine Learning used in Kaspersky products (but also in other AVs).
Here are some interesting fragments:
The white paper:
Here are some interesting fragments:
Algorithms must allow us to quickly adapt them to malware writers’ counteractions
(...)
After applying machine learning to malware detection, we have to face the fact that our data distribution isn’t fixed:
• Active adversaries (malware writers) constantly work on avoiding detections and releasing new versions of malware files that differ significantly from those that have been seen during the training phase.
• Thousands of software companies produce new types of benign executables that are significantly different from previously known types. The data on these types was lacking in the training set, but the model, nevertheless, needs to recognize them as benign.
This causes serious changes in data distribution and raises the problem of detection rate degradation over time in any machine learning implementation. Cybersecurity vendors that implement machine learning in their antimalware solutions face this problem and need to overcome it. The architecture needs to be flexible and has to allow model updates ‘on the fly’ between retraining. Vendors must also have effective processes for collecting and labeling new samples, enriching training datasets and regularly retraining models. Detection rate (% of malware detected) 100% 95% 90% 85%
Detecting new malware in pre-execution with similarity hashing
(...)
We were interested in features that were robust against small changes in a file. These features would detect new modifications of malware, but would not require more resources for calculation. Performance and scalability are the key priorities of the first stages of anti-malware engine processing.
To address this, we focused on extracting features that could be:
• calculated quickly, like statistics derived from file byte content or code disassembly
• directly retrieved from the structure of the executable, like a file format description.
Using this data, we calculated a specific type of hash functions called locality-sensitive hashes (LSH).
Regular cryptographic hashes of two almost identical files differ as much as hashes of two very different files. There is no connection between the similarity of files and their hashes. However, LSHs of almost identical files map to the same binary bucket – their LSHs are very similar – with high probability. LSHs of two different files differ substantially.
Deep learning against rare attacks
Typically, machine learning faces tasks when malicious and benign samples are numerously represented in the training set. But some attacks are so rare that we have only one example of malware for training. This is typical for high-profile targeted attacks. In this case, we use a very specific deep learning-based model architecture. We call this approach exemplar network (ExNet).
Deep learning in post execution behavior detection
The approaches described earlier were considered in the framework of static analysis, when an object description is extracted and analyzed before the object’s execution in the real user environment.
Static analysis at the pre-execution stage has a number of significant advantages. The main advantage is that it is safe for the user. An object can be detected before it starts to act on a real user’s machine. But it faces issues with advanced encryption, obfuscation techniques and the use of a wide variety of high-level script languages, containers, and fileless attack scenarios. These are situations when post-execution behavior detection comes into play.
We also use deep learning methods to address the task of behavior detection. In the post-execution stage, we are working with behavior logs provided by the threat behavior engine. The behavior log is the sequence of system events occurring during the process execution, together with corresponding arguments.
Distillation: packing the updates
The way we detect malware in-lab is different from algorithms optimal for user products. Some of the most powerful classification models require a large amount of resources like CPU/GPU time and memory, along with expensive feature extractors.
For example, since most of the modern malware writers use advanced packers and obfuscators to hide payload functionality, machine learning models really benefit from using execution logs from an in-lab sandbox with advanced behavior logging. At the same time, gathering these kinds of logs in a pre-execution phase on a user’s machine could be computationally intense. It could result in notable system performance degradation.
It is more effective to keep and run those ‘heavy’ models in-lab. Once we know that a particular file is malware, we use the knowledge we have gained from the models to train lightweight classifiers that are going to work in our products.
The white paper: