AV-Comparatives Randomness in the AV Labs testing.

Disclaimer
  1. This test shows how an antivirus behaves with certain threats, in a specific environment and under certain conditions.
    We encourage you to compare these results with others and take informed decisions on what security products to use.
    Before buying an antivirus you should consider factors such as price, ease of use, compatibility, and support. Installing a free trial version allows an antivirus to be tested in everyday use before purchase.

Andy Ful

Level 72
Verified
Trusted
Content Creator
Dec 23, 2014
6,109
Let's consider the example of the initial pule of 30000 sufficiently different malware variants in the wild and the particular AV which failed to detect 100 of them.
Next, we make for the above results a trial to choose 380 samples from these 30000 and calculate the probabilities for finding in these 380 samples 0, 1, 2, or 3 undetected malware.
m=30000
n=380
k=100

As it can be easily calculated the probability to find x=0, 1, 2, 3, ... undetected malware is as follows:
p(x) = B(m-k , n-x)* B(k , x) / B(m , n)
where B(p , q) is binomial coefficient.

After some simple calculations we have:
p(x) = (m-k)! * k! * (m-n)! * n! / [x! * (k-x)! * (n-x)! *(m-k-n+x)! * m!]

Here are the results of calculations for x= 0,1,2, and 3:
p(0)=0.28
p(1)=0.36
p(2)=0.23
p(3)=0.10

These probabilities show that one particular AV can have a different number of undetected malware (0, 1, 2, 3, ...) when we preselect a smaller pule of samples from the much larger set.

We can compare these probabilities with the results of the AV-Comparatives Real-world test (July-August 2020):
4 AVs with 0 undetected malware
5 AVs with 1 undetected malware
3 AVs with 2 undetected malware
1.5 AVs with 3 undetected malware (I added 0.5 AV for Norton)

We can calculate the ratios of the probabilities and numbers of Avs for the particular numbers of undetected malware:
p(0)/p(1) = 0.77 ~ 4 AVs/5 AVs
p(0)/p(2) = 1.22 ~ 4 AVs/3 AVs
p(1)/p(2) = 1.57 ~ 5 AVs/3 Avs
p(0)/p(3) = 2.8 ~ 4 AVs/1.5 AVs
p(1)/p(3) = 3.6 ~ 5 AVs/1.5 AVs
p(2)/p(3) = 2.3 ~ 3 AVs/1.5 AVs
etc.

As we can see the AV-Comparatives test results for AVs which have 0, 1, 2, or 3 undetected malware are very close to results of the random trials for one particular AV.

It means that F-Secure, G-Data, Panda, TrendMicro, Avast, AVG, BitDefender, Avira, Eset, K7, Microsoft, and Norton could have in fact the same real number of undetected malware (100 from 30000). But anyway, they would have different numbers of undetected samples in the July_August test by pure statistics.

Is reliable the assumption of 30000 sufficiently different malware variants in the wild for two months? Yes, it is. In the first half of 2019, SonicWall Real-Time Deep Memory Inspection (RTDMI) technology unveiled 74,360 ‘never-before-seen’ malware variants (about 25000 per 2 months).

Is reliable the assumption of 100 undetected malware from 30000? Yes, it is.
This gives on average about 1 undetected malware in 380 samples.

Conclusion.
One test with 380 malware samples is not especially reliable for a period of two months.
Even if the real malware detection is the same for any two AVs they can easily score as 0 undetected malware and 2 undetected malware.

Edit.
About the impact of the greater number of ‘never-before-seen’ malware variants on calculations:
 
Last edited:

cruelsister

Level 38
Verified
Trusted
Content Creator
Apr 13, 2013
2,751
Finally p values in AV testing! I love it! Consider also that of the many samples the testing sites employ may be coded so similarly to make any actual differences inconsequential, thus reducing any meaningful result. In essence the null hypothesis in action.

But the charts and graphs are just so pretty!
 

ZeroDay

Level 29
Verified
Aug 17, 2013
1,856
Nice example @Andy Ful The p values are getting me a little excited too lol. It makes a lot of sense to work things out that way since we're dealing with probabilities and variables in a situation such as this.

I sense that this thread was inspired by another thread on which you've been having a little debate recently?
 

Andy Ful

Level 72
Verified
Trusted
Content Creator
Dec 23, 2014
6,109
Nice example @Andy Ful The p values are getting me a little excited too lol. It makes a lot of sense to work things out that way since we're dealing with probabilities and variables in a situation such as this.

I sense that this thread was inspired by another thread on which you've been having a little debate recently?
Yes.:)
But, it has been also inspired by the nice anecdotal reasoning from the SE Labs report which I included in my post from the 2019 year (look at the red text):
https://malwaretips.com/threads/se-...otection-january-march-2019.92970/post-818342

The author of the report knows much about the randomness problem of the results. Here is the direct link to this report:
 
Last edited:

Andy Ful

Level 72
Verified
Trusted
Content Creator
Dec 23, 2014
6,109
The minimal differences in AVs protection were discussed many times on MT. It can be also deduced from AV Labs reports. Simply, all awarded AVs in one particular test in a particular award category should be considered to have the same in-the-wild protection (at the time of testing), despite slightly different scoring in the test. This different scoring is mainly due to statistic fluctuations.

These tests can have a meaning for AV vendors because the missed samples can help them to find detection errors and improve the detection engines.
The consumers can conclude something from these tests only when comparing the results of several tests made by several AV Labs. The simplest method is watching how often the AV was awarded, without watching how many % it has got in the test.
Some conclusions can be also made when the AV has got consistently top results (or consistently worse results) in several tests.

Much of the discussion on AV Lab test threads about the not good result of a particular AV or about its suddenly stellar result is useless. Most AVs have to get such results sooner or later due to pure statistics, without decreasing or increasing the real in-the-wild AV protection.
 
Last edited:

Andy Ful

Level 72
Verified
Trusted
Content Creator
Dec 23, 2014
6,109
As an example, I listed below the AVs that scored (6 6 6) in the AV-Test tests (August 2019 - June 2020)
AhnLab 1
Avira 1
Bitdefender 1
BullGuard 2
F-Secure 3
Kaspersky 5
McAfee 2
Microsoft 1
Norton 5
TrendMicro 2
Vipre 1

So by these simple statistics, such AVs like Kaspersky, Norton should be considered the top AVs in these tests followed by F-Secure. If we include the results from AV-Comparatives (July-October 2019, February-May 2020, July-August 2020), then we can also add F-Secure (and maybe TrendMicro) to the top AVs.
These AVs are probably considered as the top AVs by many people. There can be probably a few other top AVs, but it is not easy to find them on the basis of the chosen tests.

Post edited (extended the testing periods).
 
Last edited:

Andy Ful

Level 72
Verified
Trusted
Content Creator
Dec 23, 2014
6,109
What about Bitdefender?
I checked the detection of fresh samples in the AV-Test ( August 2019 - June 2020 ) + AV-Comparatives ( July-October 2019, February-May 2020, July-August 2020 ).
Bitdefender blocked only one sample less than Kaspersky (2079 + 1837 = 3916 total samples, 9 missed by BIS). So, it is very probable that there is no real difference in protection between KIS and BIS.
 

silversurfer

Level 74
Verified
Trusted
Content Creator
Malware Hunter
Aug 17, 2014
6,314
What about Bitdefender?
I checked the detection of fresh samples in the AV-Test ( August 2019 - June 2020 ) + AV-Comparatives ( July-October 2019, February-May 2020, July-August 2020 ).
Bitdefender blocked only one sample less than Kaspersky (2079 + 1837 = 3916 total samples, 9 missed by BIS). So, it is very probable that there is no real difference in protection between KIS and BIS.
It depends on what we believe are fresh samples for this kind of "Real-World Protection" tests, there aren't being used much samples are really fresh like 0-day, so it's better to check by yourself and going to collect samples which are only a few hours old to be known on VT, then you will see that Bitdefender has a delay of detection in general compared to ESET, Symantec, Kaspersky, etc, but might be blocks/detects some fresh malware by BB...
 

Andy Ful

Level 72
Verified
Trusted
Content Creator
Dec 23, 2014
6,109
It depends on what we believe are fresh samples for this kind of "Real-World Protection" tests, ...
I used the term "fresh samples" for all AV-Comparatives samples and all the samples labeled by AV-Test as "Protection against 0-day malware attacks ...". I do not think that these samples are as fresh as in Malware Hub tests, because KIS blocked all 2079 samples in AV-Test tests and missed 8 samples from 1837 samples in AV-Comparatives tests.
Anyway, both Labs include in the blocked/detected events also the blocked access to URLs, so the testing procedure is very different from Malware Hub.
The Malware Hub tests are similar to AV-Comparatives Malware tests.

I could try to estimate if the samples used by AV-Comparatives in Malware tests are as fresh as in MH tests. This can be done by comparing the ratio of samples undetected to the number of all samples for the particular AV, for example, Kaspersky (several tests have to be analyzed). But, if I correctly recall the tests for Kaspersky were done on the tweaked (not default) setup.
 
Last edited:

Andy Ful

Level 72
Verified
Trusted
Content Creator
Dec 23, 2014
6,109
I made a quick examination of Kaspersky detection from 29.08.2019 (214 samples) and found 1 missed sample that could cause infection. This is about 8 times greater than for the AV-Comparatives Malware test from September 2019 and March 2020 (12 missed malware per 20805 total samples).
(1/214)/(12/20805) ~ 8
I skipped some tests with TAM. It seems that the AV-Comparatives Malware tests do not use as fresh samples as Malware Hub.
But the AV-Comparatives Real-world tests have comparable fresh samples to MalwareHub:
(1/214 )/(8/1837) = 1.1

Edit.
Please treat these estimations as suggestions only. There are some problems with interpreting the results on MH, because of missed samples that were not counted for infection (3 cases in our case). Usually, they are related to malware that detected the VM environment. But, there can be also a kind of spying malware (no persistence) that injects code to already running Windows processes or malware that hides under Svchost, etc. Such malware usually starts and quickly terminates - it is hard to see it working without detailed analysis via Any.Run or another sandbox analysis.
 
Last edited:

Andy Ful

Level 72
Verified
Trusted
Content Creator
Dec 23, 2014
6,109
I was curious if the result will change if I extend the number of samples (214 samples are very little), so I included next 335 samples (total 214+335=549). The total pule was gathered in the period 15.07.2019-13.05.2020. Kaspersky was compromised 7 times and 14 samples were missed (but not counted as infection). The comparison with AV-Comparatives (1837 samples 7 infections):
(7/549)/ (8/1837) = 3

When comparing it to the previous result we can see how unreliable can be taking a few hundreds of malware samples.
The first 214 samples gave 1 infection and the next 335 gave 6. Anyway, the extended analysis suggests that Malware Hub samples are fresher as compared to those used by AV-Comparatives.

Edit.
From previous tests on MH, made for Kaspersky (Endpoint, Cloud, and free) it looks like Kaspersky in Malware Hub tests can be compromised approximately 1 time per 100 fresh samples which is similar to the above results 7/549 ~ 1.3 .
 
Last edited:

Andy Ful

Level 72
Verified
Trusted
Content Creator
Dec 23, 2014
6,109
...

Is reliable the assumption of 30000 sufficiently different malware variants in the wild for two months? Yes, it is. In the first half of 2019, SonicWall Real-Time Deep Memory Inspection (RTDMI) technology unveiled 74,360 ‘never-before-seen’ malware variants (about 25000 per 2 months).
...
I found the updated number of ‘never-before-seen’ malware variants in Q1 and Q2 for the year 2020:

Never seen 2020.png


It is probable that in July-August 2020 the number of never-before-seen variants could be greater than 30000 like in my example. But, as we can see, the statistics still work when we proportionally increase the number of undetected samples. For example:
  • k = 250 (number of never-before-seen malware that compromised the hypothetical AV in July-August when tested on all 75000 samples seen in-the-wild);
  • m = 75000 (approximate number of never-before-seen malware variants in July-August 2020);
  • n = 380 (number of samples tested by the AV-Comparatives in July-August 2020);
p(0) = 0.28 (probability to be not compromised by malware)
p(1) = 0.36 (probability of 1 infection event)
p(2) = 0.23 (probability of 2 infection events)
p(3) = 0.1 (probability of 3 infection events)

By the way, I made a mistake by writing in OP the value of p(3) = 0.01.
The correct value should be 0.1 .

The differences with values calculated in the OP are on the third decimal place, so the values rounded to the second decimal place are the same and the statistics work well.
 
Top