AV-Comparatives Randomness in the AV Labs testing.

Disclaimer
  1. This test shows how an antivirus behaves with certain threats, in a specific environment and under certain conditions.
    We encourage you to compare these results with others and take informed decisions on what security products to use.
    Before buying an antivirus you should consider factors such as price, ease of use, compatibility, and support. Installing a free trial version allows an antivirus to be tested in everyday use before purchase.

Andy Ful

From Hard_Configurator Tools
Thread author
Verified
Honorary Member
Top Poster
Developer
Well-known
Dec 23, 2014
8,040
Let's assume that we have the below results for the test:
2 AVs with 0 undetected malware
4 AVs with 1 undetected malware
4 AVs with 2 undetected malware
3 AVs with 3 undetected malware
1 AV with 4 undetected malware

It can be very close to probabilities for hypothetical AV when:
  • k = 60 (number of samples that compromised the hypothetical AV on the large pule of samples 300000);
  • m = 300000 (the number of samples in the large pule of samples);
  • n = 10250 (number of samples included in the AV Lab test);
We can calculate the probabilities:
p(0) = 0.124 (15.5 * 0.124 ~ 2)
p(1) = 0.264 (15.5 * 0.264 ~ 4)
p(2) = 0.275 (15.5 * 0.275 ~ 4)
p(3) = 0.188 (15.5 * 0.188 ~ 3)
p(4) = 0.095 (15.5 * 0.095 ~ 1)

We can see that these probabilities are approximately proportional to the number of AVs for the concrete amount of undetected malware. The proportionality constant is about 15.5 . We can compare this statistics to the AV-Comparatives Malware test for March 2020:
3 AVs with 0 undetected malware (F-Secure, G Data, NortonLifeLock)
4 AVs with 1 undetected malware (ESET, K7, TotalAV, Total Defense)
3 AVs with 2 undetected malware (Avast, AVG, Bitdefender)
3 AVs with 3 undetected malware (Avira, Kaspersky, VIPRE)
1 AV with 4 undetected malware (Panda)

The difference is minimal. For example, if Norton and Total Defense would miss one malware more, then the results for 14 AVs would be very close to random trials for one hypothetical AV.

It seems that a similar conclusion was made by AV-Comparatives because it awarded all 14 AVs.
The 8 products (*) got lower awards due to false alarms.

Edit.
It seems that the same conclusion can be derived from cluster analysis made in the report:

1602800666137.png

The AVs mentioned in my statistical model belong to the cluster one (see at the last column) and were avarded. Other AVs belong to other clusters.

Here is what AV-Comparatives say about the importance of clusters:
"Our tests use much more test cases (samples) per product and month than any similar test performed by other testing labs. Because of the higher statistical significance this achieves, we consider all the products in each results cluster to be equally effective, assuming that they have a false-positives rate below the industry average."
Real-World Protection Test Methodology - AV-Comparatives (av-comparatives.org)
 
Last edited:

Andy Ful

From Hard_Configurator Tools
Thread author
Verified
Honorary Member
Top Poster
Developer
Well-known
Dec 23, 2014
8,040
So, what we can say about AV-Comparatives Malware tests?
There were about 15000 different malware variants in March 2020 (according to the SonicWall statistics) and the statistical model from the previous post assumed m = 300000 malware (large pule of samples). It is probable that the tested samples included many polymorphic brothers, so one SonicWall malware variant was duplicated on average 20 times (20*15000 = 300000).
I am not sure (I will check it tomorrow), but the large set of samples could be also several times greater if we would proportionally increase the number k of missed samples (confirmed, the differences are only minimal and not important even for m=8000000, k=1600).

Now we can also understand what probably happened to TrendMicro which was compromised 82 times in the March test and was not compromised at all in September 2019.

TrendMicro could simply miss a few SonicWall malware variants that have many polymorphic variations among the tested samples. Other AVs apparently detected the malware samples that had many polymorphic variations.

Furthermore, the k=60 (missed samples in the large set) is very small as compared to m=300000 (number of samples in the large set). It means that the samples were mostly not 0-day, but rather a few days old on average.
 
Last edited:

Andy Ful

From Hard_Configurator Tools
Thread author
Verified
Honorary Member
Top Poster
Developer
Well-known
Dec 23, 2014
8,040
Now the example of the test scorings that can be interpreted without calculations.
0 missed - 1 AV
1 missed - 1 AV
2 missed - 2 AVs
3 missed - 2 AVs
4 missed - 0
...
11 missed - 0
12 missed - 2AVs
13 missed - 0
...
17 missed - 0
18 missed - 1 AV
19 missed - 0
...
42 missed - 0
43 missed - 1 AV

For the 6 first AVs, it is possible to find the statistic model similar to random trials of 1 hypothetical AV.
But any such model cannot explain the big gaps between the rest of AVs. So, the first 6 AVs should be treated as equally good and awarded. The last 4 AVs should not be awarded.

The above example was in fact taken from the real test made recently by MRG Effitas:

AVCQ22020.png

"Out of ten tested security products, the following six managed to meet the specification to attain our Q2 2020 360 Degree Certification.
• Avast Business Antivirus
• Bitdefender Endpoint Security
• CrowdStrike Falcon Protect
• ESET Endpoint Security
• Microsoft Windows Defender
• Symantec Endpoint Protection
"
 

Andy Ful

From Hard_Configurator Tools
Thread author
Verified
Honorary Member
Top Poster
Developer
Well-known
Dec 23, 2014
8,040
What is really done in AV Labs tests?
They try first to decrease the number of malicious samples in-the-wild and next perform the tests on the samples that are representative in some way to all samples in-the-wild.
How it looks in numbers per month:

A few millions of malware ----> a few hundreds of the test samples

How the AV Labs can do it is a kind of magic.

Anyway, as it can be seen from the SonicWall reports, the number of different variants of malware is a few tenths thousands malware per month. They are probably good representatives of all samples in-the-wild.
So, one must introduce a statistical model, because there are many possibilities of choosing the set of tested samples from hundreths times greater set of different malware variants.

The statistical model in this thread is one of possible models. We simply take randomly the n samples from the larger set of m samples. The number of possibilities is enormous and can be calculated by using Binomial function B(m,n). For example:
B(30000, 100) ~ 4.7 * 10^289, which is much greater than the number of atoms in the Universe.
The details of the model are in OP. When we use this model to the real data, it is assumed that the tested AVs have very different sets of missed malware in the wild.
 
Last edited:

Andy Ful

From Hard_Configurator Tools
Thread author
Verified
Honorary Member
Top Poster
Developer
Well-known
Dec 23, 2014
8,040

@McMcbrad,​


Although I agree with most things you mentioned in your post, a few things have to be clarified:
  1. In this thread, I tried to show that most scorings in the AV tests can be easily explained by simple statistical random models.
  2. I also confirmed that there are many such models. In many cases, another useful model can be constructed by changing proportionally the values of m and k. So, these models are independent of SonicWall statistics (although it was a good starting point).
  3. I do not insist that any of these models reflect the reality, but I claim that such models cannot be rejected on the basis of information about testing procedures (available publicly).
  4. From the fact that random models can explain the results of most AVs in one particular test, it follows that such a particular test alone cannot give us (the readers) sufficient information to compare the AVs protection in the wild.
  5. Some useful information can be derived only by comparing several similar tests. For example by looking for the consistently high scorings (or consistently low scorings).
  6. The conclusions taken from the models presented in this thread are consistent with awards proposed by AV testing labs. They are also consistent with cluster analysis (first cluster) made by AV-Comparatives in Malware Protection tests.
There are some things we probably not fully agree with, but they are not relevant to this thread.(y)

Edit: The post was slightly edited to be clearer.
 
Last edited:

plat

Level 29
Top Poster
Sep 13, 2018
1,793
Assuming each malware sample has equal weight, statistically, the larger the sample size AND the larger the subject pool, the less statistically significant is each missed sample. Here's where it gets "fun" to add the color red to the graphs! Drama, drama look at the difference--when it could be statistically not significant.

Your calculations are the utopia of AV lab testing, Andy Ful. They should all be so clean. Further, as someone else stated above me, one quarter's results should not be the gospel ad infinitum for a given brand as there is so much inherent variability--both for samples and subjects. These lab tests aren't clean; it's virtually impossible.

Now you have the addition of "deviations from defaults" as part of the testing regimen of AV Comparatives for Business, for one. That's where the serious money is, both in security solutions and targets for threats like ransomware. Not all "deviations" are created equal, right?
 

Andy Ful

From Hard_Configurator Tools
Thread author
Verified
Honorary Member
Top Poster
Developer
Well-known
Dec 23, 2014
8,040
AV-Comparatives Malware Protection tests 2019-2020 (four tests) part two.

This is a continuation of the post in another thread, where the impact of polymorphic samples was skipped:

In this post, I am going to examine cumulative results for the last 2 years (March 2019, September 2019, March 2020, September 2020), on the assumption that a strangely high number of missed samples was not caused by several different malware but by one polymorphic malware. Most AVs had such strange results. For example, Kaspersky had 13 missed samples in March 2019 and 9 missed samples in September 2019. What if there were in fact only two polymorphic malware, one in 13 variants and the second in 9 variants? Let's look at the results, where 9+ missed samples were replaced by one polymorphic sample:

----------------Missed samples----Clusters
Avast, AVG.........1+0+2+0...........1,1,1,1
F-Secure ............1+1+0+1...........1,1,1,1
McAfee .............1+1+1+0............1,1,1,1

Norton...............(2)+(2)+0+2.......1,1,1,1

ESET...................1+1+1+2 ..........1,1,1,1
Kaspersky...........1+1+3+1..... .....1,1,1,1
Panda ................1+1+4+1...........1,1,1,1
Microsoft............2+4+1+0...........1,1,1,1
Bitdefender.........1+5+2+1...........1,1,1,1

K7.......................5+5+1+2...........1,1,1,1
Avira* ................0+4+3+4............1,1,1,2
VIPRE ................4+1+3+4 ...........1,1,1,2
Total Defense.....5+1+1+4............1,1,1,2


As we can see, the differences between AVs almost vanished. So, in Malware Protection tests, even four different tests from two years are probably not sufficient to see important differences between popular AVs. The final scoring can highly depend on how many polymorphic samples and polymorphic variations were present in the tests. Without knowing it, the AV comparison on the base of such tests is not reliable at all.
The polymorphic samples could also explain the ridiculous results of four tests in the case of TrendMicro ( 0 missed samples in two tests from the year 2019 and 82+175 = 257 samples in the year 2020 ????).😵

The situation is clearer and easier to explain in the case of the Real-World tests, because from the results we know that the polymorphic samples are absent. :)

 
Last edited:

Andy Ful

From Hard_Configurator Tools
Thread author
Verified
Honorary Member
Top Poster
Developer
Well-known
Dec 23, 2014
8,040
The important differences can neither be seen on this, nor any other test, neither you can predict them in any way. ...
Ha, ha. I am not so brave to claim firmly the above, because no one would believe it without solid reasoning and many people will not believe it even with very solid reasoning. But, I have tried hard to show why we cannot interpret easily the results of AV tests and why something that seems easy to understand (colorful charts of AV scorings) is not easy at all due to MAGIC.

Millions of malware in the wild --------> MAGIC -------> thousands of tested samples

So, probably the magician would be more appropriate to show the difference.:)
 

Lenny_Fox

Level 22
Verified
Top Poster
Well-known
Oct 1, 2019
1,120
Let's consider the example of the initial pule of 30000 sufficiently different malware variants in the wild and the particular AV which failed to detect 100 of them.
Next, we make for the above results a trial to choose 380 samples from these 30000 and calculate the probabilities for finding in these 380 samples 0, 1, 2, or 3 undetected malware.
Nice read with impressive statistics, but your assumption is not correct, no testlab in the world manages to catch 30.000 new malware samples per month.

It is very hard to find malware samples in the wild which can bypass an updated Windows 10 pc. Next you have to check the malware being genuine malware without alerting all AV's. After the genuine-malware check you have to check whether the infection is repeatable and still alive. Last step is to define which variants of the same malware to include based on prevalence.

AV-comparatives has between 150 to 250 samples monthly
AV-test between 100 and 200 monthly
All other smaller test labs manage at best 25 to 75 monthly.


Your calculation is applicable for Zoo-ed malware, but the sample sets of Zoo-ed malware are around 5.000 to 15.000 (defending on size of test lab) per month.
 
  • Like
Reactions: harlan4096

Andy Ful

From Hard_Configurator Tools
Thread author
Verified
Honorary Member
Top Poster
Developer
Well-known
Dec 23, 2014
8,040
Nice read with impressive statistics, but your assumption is not correct, no testlab in the world manages to catch 30.000 new malware samples per month.
Please read the thread carefully. This number (30000) is not related to any AV testing lab. Search for SonicWall Real-Time Deep Memory Inspection (RTDMI). :)(y)
Furthermore, the statistics will change only a little when you increase proportionally this number and the number of missed samples in the wild. Of course, the true numbers are not known publicly. Only the AV vendors could give us insight there.
 
Last edited:

Lenny_Fox

Level 22
Verified
Top Poster
Well-known
Oct 1, 2019
1,120
@Andy Ful

Sonic Wall said:
SonicWall reveals that a new Capture Cloud engine has discovered hundreds of new malware variants not seen before by sandboxing technology (now I am really terrified).

Through the use of previously unannounced patent-pending technology, SonicWall Capture Labs security researchers engineered an advanced method for identifying and mitigating threats through deep memory inspection — all in real time. (of course :) in what scenario would you catch malware with batch based stand alone off-line computing)

SonicWall RTDMI is a patent-pending technology and process utilized by the SonicWall Capture Cloud to identify and mitigate even the most insidious modern threats, including Intel Meltdown exploits. bla bla bla (don't forget to add patent-pending a few times) bla bla bla patent-pending bla bla bla patent-pending bla bla bla

Which insight did I miss reading this? Pleas help a clueless member to see the point your are making
 
Last edited:

Lenny_Fox

Level 22
Verified
Top Poster
Well-known
Oct 1, 2019
1,120
Anyway, the extended analysis suggests that Malware Hub samples are fresher as compared to those used by AV-Comparatives.
You cant compare them, because you don't have the Comparatives set. It is pure and utter speculation with fake facts (as I explained in the first post). A complex formula with wrong input does not produce a valid outcome. Andy PLEASE

I may overreact a bit in these times with conspiracy theories. What is next? Are the people from testlabs also known child-abusers working for the 1%?
 

Andy Ful

From Hard_Configurator Tools
Thread author
Verified
Honorary Member
Top Poster
Developer
Well-known
Dec 23, 2014
8,040
You cant compare them, because you don't have the Comparatives set. It is pure and utter speculation with fake facts (as I explained in the first post). A complex formula with wrong input does not produce a valid outcome. Andy PLEASE

I may overreact a bit in these times with conspiracy theories. What is next? Are the people from testlabs also known child-abusers working for the 1%?
You are irritated without the right reason.:)(y)
All that I presented in this thread is already well known by any AV testing Lab. It is also written in plain text:
"Our tests use much more test cases (samples) per product and month than any similar test performed by other testing labs. Because of the higher statistical significance this achieves, we consider all the products in each results cluster to be equally effective, assuming that they have a false-positives rate below the industry average."
Real-World Protection Test Methodology - AV-Comparatives (av-comparatives.org)
As can be seen from the AV-Comparatives reports, the best AVs are usually in the first cluster (10 Avs or more), so they can be equally effective on malware in-the-wild (despite the differences in the particular test).

I think that you want the model presented in this thread to be more than it is. Like any model, it is based on some assumptions. These assumptions cannot be proved or rejected, because the real data are not known. They could be verified only when all data gathered by the AV vendors and AV testing Labs would be publicly known.

Anyway, even such a simple model is fully compatible with the AV-Comparatives Real-World rewards. In fact, the model is very similar to the cluster method used in the AV-Comparatives reports. The model and the cluster method simply say that when looking at the results of a single test, there is a big random factor. So, several AVs can in fact have different protection in the wild in the tested period, as compared with the scorings in the one particular test.

The model does not say that one cannot say anything interesting about AVs, especially when comparing the results of many reports.(y)
 

Andy Ful

From Hard_Configurator Tools
Thread author
Verified
Honorary Member
Top Poster
Developer
Well-known
Dec 23, 2014
8,040
In the previous posts, I tried to show that even after averaging the results of several tests most AVs seem to have an important random factor. For example:
How the hell WD works on Windows Home & Pro? | MalwareTips Community

In this post, I will present a slightly different approach and compare the results of tests between two years. The tests include the results from the years 2019 and 2020 (AV-Comparatives Real-World, AV-Comparatives Malware Protection, AV-Test, SE Labs).
The data can be derived from the attachment chances_to_be_infected.txt in the post:
How big are your chances to be infected? | MalwareTips Community

The first column contains the numbers of missed samples in Real-World tests. The second column contains the numbers of missed samples in Malware Protection tests. The AVs are sorted by the total number of missed samples.

2019
1. Norton..............(2_______2)
2. TrendMicro.....(7_______0)
3. Avira..................(15______9)
4. Microsoft........(12_____13)
5. F-Secure..........(12_____25)
6. Kaspersky.......(14_____24)
7. McAfee............(35_____30)
8. Avast................(26_____57)


2020 (up to October)
1. F-Secure............(4_______1)
2. Norton...............(6_______2)
3. Kaspersky.........(4_______5)
4. Avast..................(11______2)
5. Microsoft..........(24_____12)
6. Avira...................(27_____17)
7. McAfee..............(41______7)
8. TrendMicro......(4_____257)

The randomness of most AVs is obvious. For example:
  • In the year 2019 F-Secure and Kaspersky were behind Avira and Microsoft. In the year 2020 the opposite is true.
  • In the year 2019 Trend Micro was close to the top and in the year 2020 it was the last one.
  • Only Norton was consistently close to the top and McAfee consistently close to the bottom.
 

Andy Ful

From Hard_Configurator Tools
Thread author
Verified
Honorary Member
Top Poster
Developer
Well-known
Dec 23, 2014
8,040
These tests are baffling me up to extreme. Trend Micro uses more aggressive reputation checks than Norton and what I see as results doesn't match the results on my tests at all. I am unsure to what the random factor is.
  1. From time to time any AV can sporadically miss many samples. That is obvious when one chooses several thousands of samples (about 0.1% of samples in the wild) from several millions of samples.
  2. In the year 2020 Trend Micro had the lowest false-positives rate among all AVs (and missed many samples). In the year 2019, the false-positives rate was the worst (and the protection was among the best).
  3. If Trend Micro has got the top web-protection then the customers could be protected if the missed samples were web-originated in the wild. Other AVs (with worse web-protection) could be bypassed in 0-day in the wild, and when tested by AV-Comparatives in Malware Protection test, these samples could be already detected by signatures.
The test results cannot be easily translated to the AV protection in the wild.
 

About us

  • MalwareTips is a community-driven platform providing the latest information and resources on malware and cyber threats. Our team of experienced professionals and passionate volunteers work to keep the internet safe and secure. We provide accurate, up-to-date information and strive to build a strong and supportive community dedicated to cybersecurity.

User Menu

Follow us

Follow us on Facebook or Twitter to know first about the latest cybersecurity incidents and malware threats.

Top