Here are some thoughts to consider
Exploit Protection Metrics
While you're testing against in-the-wild malware, it would be awesome to see more specific data on how solutions handle exploits and vulnerability-based attacks. For example, a metric that shows how well a product prevents a zero-day exploit from dropping its payload, or how it responds to common attack vectors like malicious Office macros or a compromised browser. This goes beyond just file-based detection and gets into the core of how a security suite's behavioral engine works.
From the point of view of our methodology, the system, applications, and browser are updated daily (except for the system), so effective exploitation is almost impossible in accordance with patch management practices.
See 8.4, 8.5, 8.6 in the methodology --->
Methods Of Carrying Out Automatic Tests » AVLab Cybersecurity Foundation
To be honest, I also see a problem here with the lack of availability of exploits in the wild.
Resource Impact (CPU, RAM, Disk I/O)
For many users, especially those with older hardware or who are on a budget, system performance is a big deal. A security suite that offers great protection but slows down their computer to a crawl isn't a practical option. Including a metric in the downloadable data that shows the average CPU, RAM, and disk I/O usage during the test, both at idle and during active scanning, would be super helpful. This would let people balance protection with performance.
This test cannot be part of the Advanced In-The-Wild Malware Test. I think AV-C does this better; we do not have the know-how to perform this test. In addition, interest from vendors may be low, and the costs of maintaining the server and machines to carry out this test may be high. This would require financial support from the community. We are not funded by the government like other lab.
False Positive Rates (with context)
This is a huge one. A product can have a perfect 100% detection rate, but if it's constantly flagging legitimate files or applications as malicious, it's a nightmare to use. Providing a list of false positives, along with a brief explanation of why the product flagged the file, would be invaluable. This transparency helps users understand the "cost" of a more aggressive protection stance. You could even create a separate metric for this, similar to what AV-Comparatives does with their False Positive Test.
To be honest, we have it on our map, and it can be done faster than you think. I think it's the main point in the next major update.
We have plenty of FP in the wild, so adding, for example, 1% of clean installers from the entire malware set in each edition could be interesting.
Network-Level Protection Data
You mentioned C&C connections in your initial post, which is fantastic. To build on that, it would be cool to see a breakdown of a product's network-level protection. For example, a metric that shows how many malicious URLs or IPs were blocked before any malware was even downloaded. This highlights the effectiveness of a product's web and network shields, which are a critical first line of defense.
We're already doing it – see the PRE_LAUNCH level
- web-level protection
- IP/URL reputation
- on-the-fly scanning
- payload scanning, e.g. Check Point + ZoneAlarm
- on-access scanning (but before launch)
All of this is first-layer protection, “network protection,” as PRE_LAUNCH in our Advanced In-The-Wild Malware Test.
Offline Protection Scenarios
What happens when the system isn't connected to the internet? Many threats are still able to execute or spread on an air-gapped network. Testing how a product's signature-based and heuristic engines perform without a cloud connection would provide a more complete picture of its capabilities.
By cutting off access to the AV network / cloud scanning, you also cut off access to the malware internet. Technically, many samples will not work, and it will be difficult to prove the usefulness of such a test. This test was more valuable a decade ago or more, when cloud development was not as widespread. Note that even if you want to scan your system online, you need to download the latest signatures.
OFFLINE protection, I think this should be a feature that cuts off all processes from the Internet except for processes of AV/EDR. Offline isolation is available in business solutions.
This is more complicated. Most AVs use heuristic + behavioral analysis both locally and in the cloud. The comprehensive analysis is currently time&resource-consuming and available only in the cloud. On the contrary, the local analysis depends on already trained models and does not consume much time and resources.
By testing products offline, one can only see how important the cloud backend is for a particular AV. You will not get information on how strong the heuristic and behavioral engines are. Of course, such information about local abilities can be important for some users and worth testing.
Very good opinion, which is why most manufacturers will not go for this type of testing, because you cut them off from their infrastructure.
The way I see it, the offline test isn't meant to be the definitive measure of an AV's effectiveness.
Instead, it's a way to assess a very specific aspect of its design, its resilience and performance when its cloud resources are unavailable.
Think of it as two separate, but equally important, metrics.
Online Performance
The most crucial test. This is the AV's full potential, leveraging all of its local and cloud-based resources to provide the best possible protection. This is the scenario that reflects most users' day-to-day experience.
Offline Performance
This is the 'stress test.' It's a way to measure how well the on-device engines can handle a threat on their own. This is a crucial data point for users who may have unreliable internet connections or for scenarios where malware might be designed to disrupt cloud communication.
By combining the results of both tests, you get a much more complete picture of the product's overall capabilities and its reliance on a constant internet connection. I completely agree that the overall strength of the heuristic and behavioral engines is best judged when they're working in tandem with the cloud backend.
There was also an opinion about malware with modified SHA to test 0-day protection:
Please note that our tests also include 0-day threats, but it is practically impossible to show which file is a 0-day threat for a given AV. We do this in general by providing protection after launch, i.e. POST-LAUNCH:
- it can be assumed that many of these samples are 0-day for the manufacturer, but this also depends on the protection in the browser, whether it is implemented or weak, such as Comodo/Xcitium,
- for AV1 and AV2, the same 0-day file may be 0-day, but it does not have to be, because they may have different signatures. The same file downloaded from a URL/IP may be on the blacklist of URLs, or it may be downloaded, so for one it will be a PRE result and for the other a POST LAUNCH result.
The modified sample is no longer from the in-the-wild set, so the test cannot be classified as Real World. In addition, such a modified file cannot be delivered from the original URL/IP—it must be a different protocol. In addition, such a modified file cannot be delivered from the original URL/IP – it must be a different protocol. This is a separate test, but it can be automated. We can add it to the road map and implement it in the future.
We have already considered your opinion in AMTSO, and there are more questions than answers in the case of modified samples, but the topic is still open.
Additionally, there is another problem, because depending on the placement of empty bytes,
moving or adding empty bytes may result in different detection – sometimes missing, sometimes additional. What's more, why not extend the test to include false positives for clean files with additional bytes? It's getting messy
