AV-Comparatives Randomness in the AV Labs testing.

Andy Ful · Sep 21, 2020

Let's assume that we have the below results for the test:
2 AVs with 0 undetected malware
4 AVs with 1 undetected malware
4 AVs with 2 undetected malware
3 AVs with 3 undetected malware
1 AV with 4 undetected malware

It can be very close to probabilities for hypothetical AV when:

k = 60 (number of samples that compromised the hypothetical AV on the large pule of samples 300000);
m = 300000 (the number of samples in the large pule of samples);
n = 10250 (number of samples included in the AV Lab test);

We can calculate the probabilities:
p(0) = 0.124 (15.5 * 0.124 ~ 2)
p(1) = 0.264 (15.5 * 0.264 ~ 4)
p(2) = 0.275 (15.5 * 0.275 ~ 4)
p(3) = 0.188 (15.5 * 0.188 ~ 3)
p(4) = 0.095 (15.5 * 0.095 ~ 1)

We can see that these probabilities are approximately proportional to the number of AVs for the concrete amount of undetected malware. The proportionality constant is about 15.5 . We can compare this statistics to the AV-Comparatives Malware test for March 2020:
3 AVs with 0 undetected malware (F-Secure, G Data, NortonLifeLock)
4 AVs with 1 undetected malware (ESET, K7, TotalAV, Total Defense)
3 AVs with 2 undetected malware (Avast, AVG, Bitdefender)
3 AVs with 3 undetected malware (Avira, Kaspersky, VIPRE)
1 AV with 4 undetected malware (Panda)

Malware Protection Test March 2020

The Malware Protection Test March 2020 assesses program’s ability to protect a system against malicious files before, during or after execution.

www.av-comparatives.org

The difference is minimal. For example, if Norton and Total Defense would miss one malware more, then the results for 14 AVs would be very close to random trials for one hypothetical AV.

It seems that a similar conclusion was made by AV-Comparatives because it awarded all 14 AVs.

The 8 products (*) got lower awards due to false alarms.

Edit.
It seems that the same conclusion can be derived from cluster analysis made in the report:

The AVs mentioned in my statistical model belong to the cluster one (see at the last column) and were avarded. Other AVs belong to other clusters.

Here is what AV-Comparatives say about the importance of clusters:
"Our tests use much more test cases (samples) per product and month than any similar test performed by other testing labs. Because of the higher statistical significance this achieves, we consider all the products in each results cluster to be equally effective, assuming that they have a false-positives rate below the industry average."
Real-World Protection Test Methodology - AV-Comparatives (av-comparatives.org)

Andy Ful · Sep 21, 2020

So, what we can say about AV-Comparatives Malware tests?
There were about 15000 different malware variants in March 2020 (according to the SonicWall statistics) and the statistical model from the previous post assumed m = 300000 malware (large pule of samples). It is probable that the tested samples included many polymorphic brothers, so one SonicWall malware variant was duplicated on average 20 times (20*15000 = 300000).
I am not sure (I will check it tomorrow), but the large set of samples could be also several times greater if we would proportionally increase the number k of missed samples (confirmed, the differences are only minimal and not important even for m=8000000, k=1600).

Now we can also understand what probably happened to TrendMicro which was compromised 82 times in the March test and was not compromised at all in September 2019.

Malware Protection Test March 2020

The Malware Protection Test March 2020 assesses program’s ability to protect a system against malicious files before, during or after execution.

www.av-comparatives.org

Malware Protection Test September 2019

The Malware Protection Test September 2019 assesses program’s ability to protect a system against malicious files before, during or after execution.

www.av-comparatives.org

TrendMicro could simply miss a few SonicWall malware variants that have many polymorphic variations among the tested samples. Other AVs apparently detected the malware samples that had many polymorphic variations.

Furthermore, the k=60 (missed samples in the large set) is very small as compared to m=300000 (number of samples in the large set). It means that the samples were mostly not 0-day, but rather a few days old on average.

Andy Ful · Sep 22, 2020

Now the example of the test scorings that can be interpreted without calculations.
0 missed - 1 AV
1 missed - 1 AV
2 missed - 2 AVs
3 missed - 2 AVs
4 missed - 0
...
11 missed - 0
12 missed - 2AVs
13 missed - 0
...
17 missed - 0
18 missed - 1 AV
19 missed - 0
...
42 missed - 0
43 missed - 1 AV

For the 6 first AVs, it is possible to find the statistic model similar to random trials of 1 hypothetical AV.
But any such model cannot explain the big gaps between the rest of AVs. So, the first 6 AVs should be treated as equally good and awarded. The last 4 AVs should not be awarded.

The above example was in fact taken from the real test made recently by MRG Effitas:

"Out of ten tested security products, the following six managed to meet the specification to attain our Q2 2020 360 Degree Certification.
• Avast Business Antivirus
• Bitdefender Endpoint Security
• CrowdStrike Falcon Protect
• ESET Endpoint Security
• Microsoft Windows Defender
• Symantec Endpoint Protection "

Andy Ful · Sep 23, 2020

What is really done in AV Labs tests?
They try first to decrease the number of malicious samples in-the-wild and next perform the tests on the samples that are representative in some way to all samples in-the-wild.
How it looks in numbers per month:

A few millions of malware ----> a few hundreds of the test samples

How the AV Labs can do it is a kind of magic.

Anyway, as it can be seen from the SonicWall reports, the number of different variants of malware is a few tenths thousands malware per month. They are probably good representatives of all samples in-the-wild.
So, one must introduce a statistical model, because there are many possibilities of choosing the set of tested samples from hundreths times greater set of different malware variants.

The statistical model in this thread is one of possible models. We simply take randomly the n samples from the larger set of m samples. The number of possibilities is enormous and can be calculated by using Binomial function B(m,n). For example:
B(30000, 100) ~ 4.7 * 10^289, which is much greater than the number of atoms in the Universe.
The details of the model are in OP. When we use this model to the real data, it is assumed that the tested AVs have very different sets of missed malware in the wild.

Andy Ful · Oct 17, 2020

@McMcbrad,

Although I agree with most things you mentioned in your post, a few things have to be clarified:

In this thread, I tried to show that most scorings in the AV tests can be easily explained by simple statistical random models.
I also confirmed that there are many such models. In many cases, another useful model can be constructed by changing proportionally the values of m and k. So, these models are independent of SonicWall statistics (although it was a good starting point).
I do not insist that any of these models reflect the reality, but I claim that such models cannot be rejected on the basis of information about testing procedures (available publicly).
From the fact that random models can explain the results of most AVs in one particular test, it follows that such a particular test alone cannot give us (the readers) sufficient information to compare the AVs protection in the wild.
Some useful information can be derived only by comparing several similar tests. For example by looking for the consistently high scorings (or consistently low scorings).
The conclusions taken from the models presented in this thread are consistent with awards proposed by AV testing labs. They are also consistent with cluster analysis (first cluster) made by AV-Comparatives in Malware Protection tests.

There are some things we probably not fully agree with, but they are not relevant to this thread.

Edit: The post was slightly edited to be clearer.

plat · Oct 17, 2020

Assuming each malware sample has equal weight, statistically, the larger the sample size AND the larger the subject pool, the less statistically significant is each missed sample. Here's where it gets "fun" to add the color red to the graphs! Drama, drama look at the difference--when it could be statistically not significant.

Your calculations are the utopia of AV lab testing, Andy Ful. They should all be so clean. Further, as someone else stated above me, one quarter's results should not be the gospel ad infinitum for a given brand as there is so much inherent variability--both for samples and subjects. These lab tests aren't clean; it's virtually impossible.

Now you have the addition of "deviations from defaults" as part of the testing regimen of AV Comparatives for Business, for one. That's where the serious money is, both in security solutions and targets for threats like ransomware. Not all "deviations" are created equal, right?

Andy Ful · Oct 17, 2020

AV-Comparatives Malware Protection tests 2019-2020 (four tests) part two.

This is a continuation of the post in another thread, where the impact of polymorphic samples was skipped:

AV-Comparatives - Consumer Malware Protection Test September 2020

We released our Consumer Malware Protection Test. Any samples that have not been detected e.g. on-access are executed on the test system. A false alarm test is also included. While in the Real-World Protection Test the vector is the web, in the Malware Protection Test the vectors can be e.g...

malwaretips.com

In this post, I am going to examine cumulative results for the last 2 years (March 2019, September 2019, March 2020, September 2020), on the assumption that a strangely high number of missed samples was not caused by several different malware but by one polymorphic malware. Most AVs had such strange results. For example, Kaspersky had 13 missed samples in March 2019 and 9 missed samples in September 2019. What if there were in fact only two polymorphic malware, one in 13 variants and the second in 9 variants? Let's look at the results, where 9+ missed samples were replaced by one polymorphic sample:

----------------Missed samples----Clusters
Avast, AVG.........1+0+2+0...........1,1,1,1
F-Secure ............1+1+0+1...........1,1,1,1
McAfee .............1+1+1+0............1,1,1,1
Norton...............(2)+(2)+0+2.......1,1,1,1
ESET...................1+1+1+2 ..........1,1,1,1
Kaspersky...........1+1+3+1..... .....1,1,1,1
Panda ................1+1+4+1...........1,1,1,1
Microsoft............2+4+1+0...........1,1,1,1
Bitdefender.........1+5+2+1...........1,1,1,1
K7.......................5+5+1+2...........1,1,1,1
Avira* ................0+4+3+4............1,1,1,2
VIPRE ................4+1+3+4 ...........1,1,1,2
Total Defense.....5+1+1+4............1,1,1,2

As we can see, the differences between AVs almost vanished. So, in Malware Protection tests, even four different tests from two years are probably not sufficient to see important differences between popular AVs. The final scoring can highly depend on how many polymorphic samples and polymorphic variations were present in the tests. Without knowing it, the AV comparison on the base of such tests is not reliable at all.
The polymorphic samples could also explain the ridiculous results of four tests in the case of TrendMicro ( 0 missed samples in two tests from the year 2019 and 82+175 = 257 samples in the year 2020 ????).

The situation is clearer and easier to explain in the case of the Real-World tests, because from the results we know that the polymorphic samples are absent.

Malware Protection Test March 2019

The Malware Protection Test March 2019 assesses program’s ability to protect a system against malicious files before, during or after execution.

www.av-comparatives.org

Malware Protection Test September 2019

The Malware Protection Test September 2019 assesses program’s ability to protect a system against malicious files before, during or after execution.

www.av-comparatives.org

Malware Protection Test March 2020

The Malware Protection Test March 2020 assesses program’s ability to protect a system against malicious files before, during or after execution.

www.av-comparatives.org

Malware Protection Test September 2020

The Malware Protection Test September 2020 assesses program’s ability to protect a system against malicious files before, during or after execution.

www.av-comparatives.org

Andy Ful · Oct 17, 2020

McMcbrad said:
The important differences can neither be seen on this, nor any other test, neither you can predict them in any way. ...

Ha, ha. I am not so brave to claim firmly the above, because no one would believe it without solid reasoning and many people will not believe it even with very solid reasoning. But, I have tried hard to show why we cannot interpret easily the results of AV tests and why something that seems easy to understand (colorful charts of AV scorings) is not easy at all due to MAGIC.

Millions of malware in the wild --------> MAGIC -------> thousands of tested samples

So, probably the magician would be more appropriate to show the difference.

Lenny_Fox · Nov 29, 2020

Andy Ful said:
Let's consider the example of the initial pule of 30000 sufficiently different malware variants in the wild and the particular AV which failed to detect 100 of them.
Next, we make for the above results a trial to choose 380 samples from these 30000 and calculate the probabilities for finding in these 380 samples 0, 1, 2, or 3 undetected malware.

Nice read with impressive statistics, but your assumption is not correct, no testlab in the world manages to catch 30.000 new malware samples per month.

It is very hard to find malware samples in the wild which can bypass an updated Windows 10 pc. Next you have to check the malware being genuine malware without alerting all AV's. After the genuine-malware check you have to check whether the infection is repeatable and still alive. Last step is to define which variants of the same malware to include based on prevalence.

AV-comparatives has between 150 to 250 samples monthly
AV-test between 100 and 200 monthly
All other smaller test labs manage at best 25 to 75 monthly.

Your calculation is applicable for Zoo-ed malware, but the sample sets of Zoo-ed malware are around 5.000 to 15.000 (defending on size of test lab) per month.

Andy Ful · Nov 29, 2020

Lenny_Fox said:
Nice read with impressive statistics, but your assumption is not correct, no testlab in the world manages to catch 30.000 new malware samples per month.

Please read the thread carefully. This number (30000) is not related to any AV testing lab. Search for SonicWall Real-Time Deep Memory Inspection (RTDMI).

Furthermore, the statistics will change only a little when you increase proportionally this number and the number of missed samples in the wild. Of course, the true numbers are not known publicly. Only the AV vendors could give us insight there.

Lenny_Fox · Nov 29, 2020

@Andy Ful

Sonic Wall said:
SonicWall reveals that a new Capture Cloud engine has discovered hundreds of new malware variants not seen before by sandboxing technology (now I am really terrified).

Through the use of previously unannounced patent-pending technology, SonicWall Capture Labs security researchers engineered an advanced method for identifying and mitigating threats through deep memory inspection — all in real time. (of course in what scenario would you catch malware with batch based stand alone off-line computing)

SonicWall RTDMI is a patent-pending technology and process utilized by the SonicWall Capture Cloud to identify and mitigate even the most insidious modern threats, including Intel Meltdown exploits. bla bla bla (don't forget to add patent-pending a few times) bla bla bla patent-pending bla bla bla patent-pending bla bla bla

Which insight did I miss reading this? Pleas help a clueless member to see the point your are making

Andy Ful · Nov 29, 2020

Lenny_Fox said:
@Andy Ful

Which insight did I miss reading this? Pleas help a clueless member to see the point your are making

Maybe this?
(1) AV-Comparatives - Randomness in the AV Labs testing. | MalwareTips Community

Lenny_Fox · Nov 29, 2020

Andy Ful said:
Anyway, the extended analysis suggests that Malware Hub samples are fresher as compared to those used by AV-Comparatives.

You cant compare them, because you don't have the Comparatives set. It is pure and utter speculation with fake facts (as I explained in the first post). A complex formula with wrong input does not produce a valid outcome. Andy PLEASE

I may overreact a bit in these times with conspiracy theories. What is next? Are the people from testlabs also known child-abusers working for the 1%?

Andy Ful · Nov 29, 2020

Lenny_Fox said:
You cant compare them, because you don't have the Comparatives set. It is pure and utter speculation with fake facts (as I explained in the first post). A complex formula with wrong input does not produce a valid outcome. Andy PLEASE

I may overreact a bit in these times with conspiracy theories. What is next? Are the people from testlabs also known child-abusers working for the 1%?

You are irritated without the right reason.

All that I presented in this thread is already well known by any AV testing Lab. It is also written in plain text:
"Our tests use much more test cases (samples) per product and month than any similar test performed by other testing labs. Because of the higher statistical significance this achieves, we consider all the products in each results cluster to be equally effective, assuming that they have a false-positives rate below the industry average."
Real-World Protection Test Methodology - AV-Comparatives (av-comparatives.org)
As can be seen from the AV-Comparatives reports, the best AVs are usually in the first cluster (10 Avs or more), so they can be equally effective on malware in-the-wild (despite the differences in the particular test).

I think that you want the model presented in this thread to be more than it is. Like any model, it is based on some assumptions. These assumptions cannot be proved or rejected, because the real data are not known. They could be verified only when all data gathered by the AV vendors and AV testing Labs would be publicly known.

Anyway, even such a simple model is fully compatible with the AV-Comparatives Real-World rewards. In fact, the model is very similar to the cluster method used in the AV-Comparatives reports. The model and the cluster method simply say that when looking at the results of a single test, there is a big random factor. So, several AVs can in fact have different protection in the wild in the tested period, as compared with the scorings in the one particular test.

The model does not say that one cannot say anything interesting about AVs, especially when comparing the results of many reports.

Andy Ful · Dec 6, 2020

In the previous posts, I tried to show that even after averaging the results of several tests most AVs seem to have an important random factor. For example:
How the hell WD works on Windows Home & Pro? | MalwareTips Community

In this post, I will present a slightly different approach and compare the results of tests between two years. The tests include the results from the years 2019 and 2020 (AV-Comparatives Real-World, AV-Comparatives Malware Protection, AV-Test, SE Labs).
The data can be derived from the attachment chances_to_be_infected.txt in the post:
How big are your chances to be infected? | MalwareTips Community

The first column contains the numbers of missed samples in Real-World tests. The second column contains the numbers of missed samples in Malware Protection tests. The AVs are sorted by the total number of missed samples.

2019
1. Norton..............(2_______2)
2. TrendMicro.....(7_______0)
3. Avira..................(15______9)
4. Microsoft........(12_____13)
5. F-Secure..........(12_____25)
6. Kaspersky.......(14_____24)
7. McAfee............(35_____30)
8. Avast................(26_____57)

2020 (up to October)
1. F-Secure............(4_______1)
2. Norton...............(6_______2)
3. Kaspersky.........(4_______5)
4. Avast..................(11______2)
5. Microsoft..........(24_____12)
6. Avira...................(27_____17)
7. McAfee..............(41______7)
8. TrendMicro......(4_____257)

The randomness of most AVs is obvious. For example:

In the year 2019 F-Secure and Kaspersky were behind Avira and Microsoft. In the year 2020 the opposite is true.
In the year 2019 Trend Micro was close to the top and in the year 2020 it was the last one.
Only Norton was consistently close to the top and McAfee consistently close to the bottom.

Andy Ful · Dec 6, 2020

McMcbrad said:
These tests are baffling me up to extreme. Trend Micro uses more aggressive reputation checks than Norton and what I see as results doesn't match the results on my tests at all. I am unsure to what the random factor is.

From time to time any AV can sporadically miss many samples. That is obvious when one chooses several thousands of samples (about 0.1% of samples in the wild) from several millions of samples.
In the year 2020 Trend Micro had the lowest false-positives rate among all AVs (and missed many samples). In the year 2019, the false-positives rate was the worst (and the protection was among the best).
If Trend Micro has got the top web-protection then the customers could be protected if the missed samples were web-originated in the wild. Other AVs (with worse web-protection) could be bypassed in 0-day in the wild, and when tested by AV-Comparatives in Malware Protection test, these samples could be already detected by signatures.

The test results cannot be easily translated to the AV protection in the wild.

Andy Ful · Apr 21, 2025

The OP has been updated.

The probability of finding x=0, 1, 2, 3, ... undetected malware was calculated in the OP:
p( x ) = B( m - k , n - x ) * B( k , x ) / B( m , n )
where B( p , q ) is binomial coefficient.

I noticed (by numerical experiments) that for sufficiently large numbers of samples in the wild ( m >> k , n ) and a small number of missed samples ( x << n ), the function p(x) depends on the infection rate ( r = k/m ) and the number of tested samples ( n ). Now we can use the probabilistic approximate formula:

p( x ) ~ B( n , x ) * r ^ x * (1 - r ) ^ ( n - x )

So, increasing the number of in-the-wild samples does not change significantly the probabilities if the infection rate k/m does not change and m is big enough.
We do not know how exactly the AV Labs choose the malware samples. But most probably, they choose the test samples from large feeds (over 300,000 suspicious and malicious threats per day) and eventually remove some morphed samples of the same malware. If so, the approximate formula for p(x) is very accurate.
The example of the malware feed:

Threat Feeds - MRG Effitas

MRG offers threat feeds containing suspicious and malicious binaries, URLs and categorised malware, many less than 24 hours old.

www.mrg-effitas.com

If we know the average infection rate of top AVs, the formula for p(x) can be used to determine if a particular AV can be awarded in a test (as a top AV) or not. For example, the missing samples threshold can be calculated as:
p(t) < 0.05
It means that missing t samples disqualifies the particular AV result from the top award, because such a result can happen for a top AV due to pure accident with chances less than 5%.

Edit.
Corrected a typo error in the formula for p(x).

Andy Ful · Apr 21, 2025

The AV-Test example.

AV-Test Consumer Real-World tests 2023-2024.
Avast + Bitdefender + F-Secure + Kaspersky missed 8 malware samples from a total of 3354 samples.

The average infection rate of the top AVs:
r = 1/4 * 8/3554 ~ 1/1800

Threshold criteria:
p(t) = 0.05 for t = 1/1800

Threshold table
total ( n ) ........missed ( t )
100 .....................1
700 .....................2
1600 ...................3

The Real-World tests included 200-400 malware samples, so any result with more than 1 missed sample should cause the score to be 5.5 or less, exactly as in the AV-Test awards.
These results do not prove that the method used by AV-Test is the same as described in this thread.

Edit.
Corrected a typo error in the formula for p(x).

Andy Ful · Apr 21, 2025

AVLab example.
(Post corrected)

AVLab 2023-2024 + January, March 2025
Number of samples in all tests: 607+759+958+510+275+521+459+380+245+343+468+416+567+447 = 6955

Number of samples tested by Avast + Bitdefender + F-Secure:
3*6955 - (510+380+245+468+416+447) = 18399
Kaspersky was skipped due to an insufficient number of tests.

Number of samples missed by Avast + Bitdefender + F-Secure:
1+1+1+1+5+5+4+13 = 31

Average infection rate of top AVs
r = 31/18399 ~ 1/600

Threshold criteria:
p(t) = 0.05 for r = 1/600

2 missed per 400 samples gives the 99.5% threshold for passed samples.
2 missed per 500 samples gives the 99.6% threshold for passed samples.
3 missed per 600 samples gives the 99.5% threshold for passed samples.
3 missed per 800 samples gives the 99.6% threshold for passed samples.

If the number of samples is 400-800, then the fixed threshold for passed samples is about 99.5%.
Currently, AVLab uses the fixed threshold of 99% for passed samples, which seems too low.

Edit 1.
Corrected a typo error in the formula for p(x).

Edit 2.
Corrected missed samples for F-Secure and Bitdefender, and the data based on the corrected infection rate.

AV-Comparatives Randomness in the AV Labs testing.

From Hard_Configurator Tools

From Hard_Configurator Tools

From Hard_Configurator Tools

From Hard_Configurator Tools

From Hard_Configurator Tools

@McMcbrad,​

Level 29

From Hard_Configurator Tools

From Hard_Configurator Tools

Level 22

From Hard_Configurator Tools

Level 22

From Hard_Configurator Tools

Level 22

From Hard_Configurator Tools

From Hard_Configurator Tools

From Hard_Configurator Tools

From Hard_Configurator Tools

From Hard_Configurator Tools

From Hard_Configurator Tools

Similar threads

@McMcbrad,