Serious Discussion Machine Learning for Malware Detection (KasperskyLab)

Andy Ful · Jan 24, 2025

Interesting white paper about Machine Learning used in Kaspersky products (but also in other AVs).
Here are some interesting fragments:

Algorithms must allow us to quickly adapt them to malware writers’ counteractions

(...)
After applying machine learning to malware detection, we have to face the fact that our data distribution isn’t fixed:

• Active adversaries (malware writers) constantly work on avoiding detections and releasing new versions of malware files that differ significantly from those that have been seen during the training phase.
• Thousands of software companies produce new types of benign executables that are significantly different from previously known types. The data on these types was lacking in the training set, but the model, nevertheless, needs to recognize them as benign.

This causes serious changes in data distribution and raises the problem of detection rate degradation over time in any machine learning implementation. Cybersecurity vendors that implement machine learning in their antimalware solutions face this problem and need to overcome it. The architecture needs to be flexible and has to allow model updates ‘on the fly’ between retraining. Vendors must also have effective processes for collecting and labeling new samples, enriching training datasets and regularly retraining models. Detection rate (% of malware detected) 100% 95% 90% 85%

Detecting new malware in pre-execution with similarity hashing

(...)
We were interested in features that were robust against small changes in a file. These features would detect new modifications of malware, but would not require more resources for calculation. Performance and scalability are the key priorities of the first stages of anti-malware engine processing.

To address this, we focused on extracting features that could be:
• calculated quickly, like statistics derived from file byte content or code disassembly
• directly retrieved from the structure of the executable, like a file format description.

Using this data, we calculated a specific type of hash functions called locality-sensitive hashes (LSH).

Regular cryptographic hashes of two almost identical files differ as much as hashes of two very different files. There is no connection between the similarity of files and their hashes. However, LSHs of almost identical files map to the same binary bucket – their LSHs are very similar – with high probability. LSHs of two different files differ substantially.

Deep learning against rare attacks

Typically, machine learning faces tasks when malicious and benign samples are numerously represented in the training set. But some attacks are so rare that we have only one example of malware for training. This is typical for high-profile targeted attacks. In this case, we use a very specific deep learning-based model architecture. We call this approach exemplar network (ExNet).

Deep learning in post execution behavior detection

The approaches described earlier were considered in the framework of static analysis, when an object description is extracted and analyzed before the object’s execution in the real user environment.

Static analysis at the pre-execution stage has a number of significant advantages. The main advantage is that it is safe for the user. An object can be detected before it starts to act on a real user’s machine. But it faces issues with advanced encryption, obfuscation techniques and the use of a wide variety of high-level script languages, containers, and fileless attack scenarios. These are situations when post-execution behavior detection comes into play.

We also use deep learning methods to address the task of behavior detection. In the post-execution stage, we are working with behavior logs provided by the threat behavior engine. The behavior log is the sequence of system events occurring during the process execution, together with corresponding arguments.

Distillation: packing the updates

The way we detect malware in-lab is different from algorithms optimal for user products. Some of the most powerful classification models require a large amount of resources like CPU/GPU time and memory, along with expensive feature extractors.

For example, since most of the modern malware writers use advanced packers and obfuscators to hide payload functionality, machine learning models really benefit from using execution logs from an in-lab sandbox with advanced behavior logging. At the same time, gathering these kinds of logs in a pre-execution phase on a user’s machine could be computationally intense. It could result in notable system performance degradation.

It is more effective to keep and run those ‘heavy’ models in-lab. Once we know that a particular file is malware, we use the knowledge we have gained from the models to train lightweight classifiers that are going to work in our products.

The white paper:

https://media.kaspersky.com/en/enterprise-security/Kaspersky-Lab-Whitepaper-Machine-Learning.pdf

Bot · Jan 24, 2025

Thanks for sharing this insightful white paper. Indeed, the dynamic nature of malware makes it crucial for machine learning models to be adaptable and flexible. Regular updates, effective data collection, and retraining are key to maintaining a high detection rate.

Szellem · Jan 25, 2025

Very interesting. It's good to read these and know a bit more.
In the last 2-3 software versions, Kaspersky also lists the Cybersecurity Technologies that are used in the program.
On the Security tab, it details this:

Andy Ful · Jan 25, 2025

In the context of FUDs, the first chart is interesting:

It suggests, that the protection rate against a few hours old malware can be around 70%. According to known malware statistics, there can be 20% of 0-day-old malware. Probably 1/3 of them (7%) can be a few hours old. The infection rate of those malware samples is practically untested in the Real-World tests (AV-Test, AV-Comparatives, SE Labs, etc.) because collecting/selecting the malware samples can usually take several hours, and after that delay, most of those samples are already known to the tested AVs.

I suspect that in Real-World tests the tiny differences in the AVs protection can be much smaller than mainly undetected differences related to a few hours old malware.

Andy Ful · Jan 25, 2025

Post updated.

Here is an example of how the test delay could impact the protection rate of many AVs.
In the example, I assumed that when the home users run the malware, only 1 per 10 malware is unknown to the AV (70% samples detected/blocked) and 9 per 10 malware is already known (100% samples detected/blocked):

1/10 * 70% + 9/10 * 100% = 97% <------- protection rate in the wild

1/10 * 95% + 9/10 * 100% = 99.5% <------- protection rate in the Real-World test

After the delay, many unknown samples "become better known" to AV. The protection rate for the previously unknown samples increases to 95% (instead of the initial 70%), and the final result is similar to the Real-World tests.

Edit1.
By "become better known," I mean that due to test delay, some previously unknown samples can now be detected/blocked by possible improvements of ML classifiers or new signatures.
For simplicity, I skipped the samples that became inactive during the delay time. After including them, the difference between the AV protection in the wild and the protection shown in the Real-World tests can be even greater.

Edit2.
Using Application Control, file reputation lookup, or detonation in the sandbox can significantly increase AV protection. Some AVs (like Norton or Avast) use such features by default, some others require tweaking the settings. On Windows 8+, the file reputation lookup (for files originating from the Internet) can be applied via the Windows built-in SmartScreen.

Search

Serious Discussion Machine Learning for Malware Detection (KasperskyLab)

Andy Ful

From Hard_Configurator Tools

Bot

AI-powered Bot

Szellem

Level 9

Andy Ful

From Hard_Configurator Tools

Andy Ful

From Hard_Configurator Tools

Similar threads