AV-Comparatives Randomness in the AV Labs testing.

Andy Ful · Sep 18, 2020

Post updated in April 2025.

There are about 300.000 new malware threats every day (Windows OS).
Let's consider the example of the initial pool of 30.000 sufficiently different malware variants in the wild and the particular AV that failed to detect 100 of them.
Next, we make for the above results a trial to choose 380 samples from these 30.000 and calculate the probabilities for finding in these 380 samples 0, 1, 2, or 3 undetected malware.
m=30000
n=380
k=100

As it can be easily calculated, the probability of finding x=0, 1, 2, 3, ... undetected malware is as follows:
p( x ) = B( m - k , n - x ) * B( k , x ) / B( m , n )
where B( p , q ) is binomial coefficient.

After some simple calculations, we have:
p(x) = ( m-k )! * k! * ( m - n )! * n! / [ x! * ( k - x )! * ( n - x )! * ( m - k - n + x )! * m! ]

For sufficiently large numbers of samples in the wild ( m >> k , n ) and a small number of missed samples ( x << n ), the function p(x) depends on the infection rate ( r = k/m ) and the number of tested samples ( n ):

p( x ) ~ B( n , x ) * r ^ x * (1 - r ) ^ ( n - x )

So, increasing the number of in-the-wild samples does not change significantly the probabilities if the infection rate k/m does not change and m is big enough.
Here are the results of calculations for the number of tested samples n=380, infection rate k/m=1/300, and x = 0,1,2, 3:
p(0)=0.28
p(1)=0.36
p(2)=0.23
p(3)=0.1

These probabilities show that one particular AV can have a different number of undetected malware (0, 1, 2, 3, ...) when we preselect a smaller pool of samples from the much larger set.

We can compare these probabilities with the results of the AV-Comparatives Real-world test (July-August 2020):
4 AVs with 0 undetected malware
5 AVs with 1 undetected malware
3 AVs with 2 undetected malware
1.5 AVs with 3 undetected malware (I added 0.5 AV for Norton)

Real-World Protection Test July-August 2020 - Factsheet

Take a look at the Real-World Protection July-August 2020 Factsheet, a short overview of the results, more details will in the full release!

www.av-comparatives.org

We can calculate the ratios of the probabilities and numbers of AVs for the particular numbers of undetected malware:
p(0)/p(1) = 0.79 ~ 4 AVs/5 AVs
p(0)/p(2) = 1.2 ~ 4 AVs/3 AVs
p(1)/p(2) = 1.6 ~ 5 AVs/3 Avs
p(0)/p(3) = 2.9 ~ 4 AVs/1.5 AVs
p(1)/p(3) = 3.7 ~ 5 AVs/1.5 AVs
p(2)/p(3) = 2.4 ~ 3 AVs/1.5 AVs
etc.

As we can see, the AV-Comparatives test results for AVs with 0, 1, 2, or 3 undetected malware are very close to the results of the random trials for one particular AV.

It means that F-Secure, G-Data, Panda, TrendMicro, Avast, AVG, BitDefender, Avira, Eset, K7, Microsoft, and Norton could have the same number of undetected malware in the wild (July and August). But anyway, they would have different numbers of undetected samples in the July_August test by pure statistics.

Conclusion.
One test with 380 malware samples is not especially reliable for a period of two months.
Even if the in-the-wild malware detection is the same for any two AVs, they can easily score as 0 undetected malware or 2 undetected malware.

Edit 1.
Post shortened. Added the approximate formula for p(x) and used it to calculate probabilities (instead of the exact formula).

Edit 2.
We do not know how exactly the AV Labs choose the malware samples. But most probably, they chose the test samples from large feeds (over 300,000 suspicious and malicious threats per day) and then remove some morphed samples of the same malware. If so, the approximate formula for p(x) is very accurate.
The example of the malware feed:

Threat Feeds - MRG Effitas

MRG offers threat feeds containing suspicious and malicious binaries, URLs and categorised malware, many less than 24 hours old.

www.mrg-effitas.com

Edit 3.
Corrected the typo error in the formula for p(x).

Like a Western! · Sep 18, 2020

so what

Andy Ful · Sep 18, 2020

Like a Western! said:
so what

So, you lost your time.

oldschool · Sep 18, 2020

Andy Ful said:
As it can be easily calculated the probability to find x=0, 1, 2, 3, ... undetected malware is as follows:
p(x) = B(m-k , n-x)* B(k , x) / B(m , n)
where B(p , q) is binomial coefficient.

Sure, easy for you!

cruelsister · Sep 18, 2020

Finally p values in AV testing! I love it! Consider also that of the many samples the testing sites employ may be coded so similarly to make any actual differences inconsequential, thus reducing any meaningful result. In essence the null hypothesis in action.

But the charts and graphs are just so pretty!

Kongo · Sep 18, 2020

oldschool said:
Sure, easy for you!

Way too many variables, I feel you.

Andy Ful · Sep 18, 2020

Of course, this was only an example. No one knows how AV-Labs make the preselection from large malware set to the final much smaller set of tested samples. This can also have an impact on the scoring differences between AV Labs.

ZeroDay · Sep 18, 2020

Nice example @Andy Ful The p values are getting me a little excited too lol. It makes a lot of sense to work things out that way since we're dealing with probabilities and variables in a situation such as this.

I sense that this thread was inspired by another thread on which you've been having a little debate recently?

Andy Ful · Sep 18, 2020

ZeroDay said:
Nice example @Andy Ful The p values are getting me a little excited too lol. It makes a lot of sense to work things out that way since we're dealing with probabilities and variables in a situation such as this.

I sense that this thread was inspired by another thread on which you've been having a little debate recently?

Yes.

But, it has been also inspired by the nice anecdotal reasoning from the SE Labs report which I included in my post from the 2019 year (look at the red text):
https://malwaretips.com/threads/se-...otection-january-march-2019.92970/post-818342

The author of the report knows much about the randomness problem of the results. Here is the direct link to this report:

Home Anti-Malware Protection 2019 Q1 - SE LABS ®

Home Anti-Malware Protection 2019 Q1

selabs.uk

Digmor Crusher · Sep 18, 2020

Yup, you take the top 7 or 8 AV's market wise and the difference protection wise is so minimal that the only reasons for choosing an AV for anyone should be price and performance on your computer.

Andy Ful · Sep 19, 2020

The minimal differences in AVs protection were discussed many times on MT. It can be also deduced from AV Labs reports. Simply, all awarded AVs in one particular test in a particular award category should be considered to have the same in-the-wild protection (at the time of testing), despite slightly different scoring in the test. This different scoring is mainly due to statistic fluctuations.

These tests can have a meaning for AV vendors because the missed samples can help them to find detection errors and improve the detection engines.
The consumers can conclude something from these tests only when comparing the results of several tests made by several AV Labs. The simplest method is watching how often the AV was awarded, without watching how many % it has got in the test.
Some conclusions can be also made when the AV has got consistently top results (or consistently worse results) in several tests.

Much of the discussion on AV Lab test threads about the not good result of a particular AV or about its suddenly stellar result is useless. Most AVs have to get such results sooner or later due to pure statistics, without decreasing or increasing the real in-the-wild AV protection.

Andy Ful · Sep 19, 2020

As an example, I listed below the AVs that scored (6 6 6) in the AV-Test tests (August 2019 - June 2020)
AhnLab 1
Avira 1
Bitdefender 1
BullGuard 2
F-Secure 3
Kaspersky 5
McAfee 2
Microsoft 1
Norton 5
TrendMicro 2
Vipre 1

So by these simple statistics, such AVs like Kaspersky, Norton should be considered the top AVs in these tests followed by F-Secure. If we include the results from AV-Comparatives (July-October 2019, February-May 2020, July-August 2020), then we can also add F-Secure (and maybe TrendMicro) to the top AVs.
These AVs are probably considered as the top AVs by many people. There can be probably a few other top AVs, but it is not easy to find them on the basis of the chosen tests.

Post edited (extended the testing periods).

Jan Willy · Sep 19, 2020

Andy Ful said:
These AVs are probably considered as the top AVs by many people.

Because nobody can prove that those AV programs are not top.

Andy Ful · Sep 19, 2020

What about Bitdefender?
I checked the detection of fresh samples in the AV-Test ( August 2019 - June 2020 ) + AV-Comparatives ( July-October 2019, February-May 2020, July-August 2020 ).
Bitdefender blocked only one sample less than Kaspersky (2079 + 1837 = 3916 total samples, 9 missed by BIS). So, it is very probable that there is no real difference in protection between KIS and BIS.

silversurfer · Sep 19, 2020

Andy Ful said:
What about Bitdefender?
I checked the detection of fresh samples in the AV-Test ( August 2019 - June 2020 ) + AV-Comparatives ( July-October 2019, February-May 2020, July-August 2020 ).
Bitdefender blocked only one sample less than Kaspersky (2079 + 1837 = 3916 total samples, 9 missed by BIS). So, it is very probable that there is no real difference in protection between KIS and BIS.

It depends on what we believe are fresh samples for this kind of "Real-World Protection" tests, there aren't being used much samples are really fresh like 0-day, so it's better to check by yourself and going to collect samples which are only a few hours old to be known on VT, then you will see that Bitdefender has a delay of detection in general compared to ESET, Symantec, Kaspersky, etc, but might be blocks/detects some fresh malware by BB...

Andy Ful · Sep 19, 2020

silversurfer said:
It depends on what we believe are fresh samples for this kind of "Real-World Protection" tests, ...

I used the term "fresh samples" for all AV-Comparatives samples and all the samples labeled by AV-Test as "Protection against 0-day malware attacks ...". I do not think that these samples are as fresh as in Malware Hub tests, because KIS blocked all 2079 samples in AV-Test tests and missed 8 samples from 1837 samples in AV-Comparatives tests.
Anyway, both Labs include in the blocked/detected events also the blocked access to URLs, so the testing procedure is very different from Malware Hub.
The Malware Hub tests are similar to AV-Comparatives Malware tests.

I could try to estimate if the samples used by AV-Comparatives in Malware tests are as fresh as in MH tests. This can be done by comparing the ratio of samples undetected to the number of all samples for the particular AV, for example, Kaspersky (several tests have to be analyzed). But, if I correctly recall the tests for Kaspersky were done on the tweaked (not default) setup.

Andy Ful · Sep 19, 2020

I made a quick examination of Kaspersky detection from 29.08.2019 (214 samples) and found 1 missed sample that could cause infection. This is about 8 times greater than for the AV-Comparatives Malware test from September 2019 and March 2020 (12 missed malware per 20805 total samples).
(1/214)/(12/20805) ~ 8
I skipped some tests with TAM. It seems that the AV-Comparatives Malware tests do not use as fresh samples as Malware Hub.
But the AV-Comparatives Real-world tests have comparable fresh samples to MalwareHub:
(1/214 )/(8/1837) = 1.1

Edit.
Please treat these estimations as suggestions only. There are some problems with interpreting the results on MH, because of missed samples that were not counted for infection (3 cases in our case). Usually, they are related to malware that detected the VM environment. But, there can be also a kind of spying malware (no persistence) that injects code to already running Windows processes or malware that hides under Svchost, etc. Such malware usually starts and quickly terminates - it is hard to see it working without detailed analysis via Any.Run or another sandbox analysis.

Andy Ful · Sep 20, 2020

I was curious if the result will change if I extend the number of samples (214 samples are very little), so I included next 335 samples (total 214+335=549). The total pule was gathered in the period 15.07.2019-13.05.2020. Kaspersky was compromised 7 times and 14 samples were missed (but not counted as infection). The comparison with AV-Comparatives (1837 samples 7 infections):
(7/549)/ (8/1837) = 3

When comparing it to the previous result we can see how unreliable can be taking a few hundreds of malware samples. The first 214 samples gave 1 infection and the next 335 gave 6. Anyway, the extended analysis suggests that Malware Hub samples are fresher as compared to those used by AV-Comparatives.

Edit.
From previous tests on MH, made for Kaspersky (Endpoint, Cloud, and free) it looks like Kaspersky in Malware Hub tests can be compromised approximately 1 time per 100 fresh samples which is similar to the above results 7/549 ~ 1.3 .

Andy Ful · Sep 21, 2020

Andy Ful said:
...

Is reliable the assumption of 30000 sufficiently different malware variants in the wild for two months? Yes, it is. In the first half of 2019, SonicWall Real-Time Deep Memory Inspection (RTDMI) technology unveiled 74,360 ‘never-before-seen’ malware variants (about 25000 per 2 months).
...

I found the updated number of ‘never-before-seen’ malware variants in Q1 and Q2 for the year 2020:

https://www.sonicwall.com/resources/2020-cyber-threat-report-mid-year-update-pdf/

It is probable that in July-August 2020 the number of never-before-seen variants could be greater than 30000 like in my example. But, as we can see, the statistics still work when we proportionally increase the number of undetected samples. For example:

k = 250 (number of never-before-seen malware that compromised the hypothetical AV in July-August when tested on all 75000 samples seen in-the-wild);
m = 75000 (approximate number of never-before-seen malware variants in July-August 2020);
n = 380 (number of samples tested by the AV-Comparatives in July-August 2020);

p(0) = 0.28 (probability to be not compromised by malware)
p(1) = 0.36 (probability of 1 infection event)
p(2) = 0.23 (probability of 2 infection events)
p(3) = 0.1 (probability of 3 infection events)

By the way, I made a mistake by writing in OP the value of p(3) = 0.01.
The correct value should be 0.1 .

The differences with values calculated in the OP are on the third decimal place, so the values rounded to the second decimal place are the same and the statistics work well.

FireHammer · Sep 21, 2020

Hi, I dont know how you people think, but when I hear the words: undetected Malware!.

AV-Comparatives Randomness in the AV Labs testing.

From Hard_Configurator Tools

Level 9

From Hard_Configurator Tools

Level 85

Level 43

Level 37

From Hard_Configurator Tools

Level 30

From Hard_Configurator Tools

Level 26

From Hard_Configurator Tools

From Hard_Configurator Tools

Level 13

From Hard_Configurator Tools

Super Moderator

From Hard_Configurator Tools

From Hard_Configurator Tools

From Hard_Configurator Tools

From Hard_Configurator Tools

Level 10

Similar threads