Disclaimer
  1. This test shows how an antivirus behaves with certain threats, in a specific environment and under certain conditions.
    We encourage you to compare these results with others and take informed decisions on what security products to use.
    Before buying an antivirus you should consider factors such as price, ease of use, compatibility, and support. Installing a free trial version allows an antivirus to be tested in everyday use before purchase.

Andy Ful

Level 63
Verified
Trusted
Content Creator
Let's assume that we have the below results for the test:
2 AVs with 0 undetected malware
4 AVs with 1 undetected malware
4 AVs with 2 undetected malware
3 AVs with 3 undetected malware
1 AV with 4 undetected malware

It can be very close to probabilities for hypothetical AV when:
  • k = 60 (number of samples that compromised the hypothetical AV on the large pule of samples 300000);
  • m = 300000 (the number of samples in the large pule of samples);
  • n = 10250 (number of samples included in the AV Lab test);
We can calculate the probabilities:
p(0) = 0.124 (15.5 * 0.124 ~ 2)
p(1) = 0.264 (15.5 * 0.264 ~ 4)
p(2) = 0.275 (15.5 * 0.275 ~ 4)
p(3) = 0.188 (15.5 * 0.188 ~ 3)
p(4) = 0.095 (15.5 * 0.095 ~ 1)

We can see that these probabilities are approximately proportional to the number of AVs for the concrete amount of undetected malware. The proportionality constant is about 15.5 . We can compare this statistics to the AV-Comparatives Malware test for March 2020:
3 AVs with 0 undetected malware (F-Secure, G Data, NortonLifeLock)
4 AVs with 1 undetected malware (ESET, K7, TotalAV, Total Defense)
3 AVs with 2 undetected malware (Avast, AVG, Bitdefender)
3 AVs with 3 undetected malware (Avira, Kaspersky, VIPRE)
1 AV with 4 undetected malware (Panda)

The difference is minimal. For example, if Norton and Total Defense would miss one malware more, then the results for 14 AVs would be very close to random trials for one hypothetical AV.

It seems that a similar conclusion was made by AV-Comparatives because it awarded all 14 AVs.
The 8 products (*) got lower awards due to false alarms.

Edit.
It seems that the same conclusion can be derived from cluster analysis made in the report:

1602800666137.png
The AVs mentioned in my statistical model belong to the cluster one (see at the last column) and were avarded. Other AVs belong to other clusters.
 
Last edited:

Andy Ful

Level 63
Verified
Trusted
Content Creator
So, what we can say about AV-Comparatives Malware tests?
There were about 15000 different malware variants in March 2020 (according to the SonicWall statistics) and the statistical model from the previous post assumed m = 300000 malware (large pule of samples). It is probable that the tested samples included many polymorphic brothers, so one SonicWall malware variant was duplicated on average 20 times (20*15000 = 300000).
I am not sure (I will check it tomorrow), but the large set of samples could be also several times greater if we would proportionally increase the number k of missed samples (confirmed, the differences are only minimal and not important even for m=8000000, k=1600).

Now we can also understand what probably happened to TrendMicro which was compromised 82 times in the March test and was not compromised at all in September 2019.

TrendMicro could simply miss a few SonicWall malware variants that have many polymorphic variations among the tested samples. Other AVs apparently detected the malware samples that had many polymorphic variations.

Furthermore, the k=60 (missed samples in the large set) is very small as compared to m=300000 (number of samples in the large set). It means that the samples were mostly not 0-day, but rather a few days old on average.
 
Last edited:

Andy Ful

Level 63
Verified
Trusted
Content Creator
Now the example of the test scorings that can be interpreted without calculations.
0 missed - 1 AV
1 missed - 1 AV
2 missed - 2 AVs
3 missed - 2 AVs
4 missed - 0
...
11 missed - 0
12 missed - 2AVs
13 missed - 0
...
17 missed - 0
18 missed - 1 AV
19 missed - 0
...
42 missed - 0
43 missed - 1 AV

For the 6 first AVs, it is possible to find the statistic model similar to random trials of 1 hypothetical AV.
But any such model cannot explain the big gaps between the rest of AVs. So, the first 6 AVs should be treated as equally good and awarded. The last 4 AVs should not be awarded.

The above example was in fact taken from the real test made recently by MRG Effitas:

AVCQ22020.png
"Out of ten tested security products, the following six managed to meet the specification to attain our Q2 2020 360 Degree Certification.
• Avast Business Antivirus
• Bitdefender Endpoint Security
• CrowdStrike Falcon Protect
• ESET Endpoint Security
• Microsoft Windows Defender
• Symantec Endpoint Protection
"
 

Andy Ful

Level 63
Verified
Trusted
Content Creator
What is really done in AV Labs tests?
They try first to decrease the number of malicious samples in-the-wild and next perform the tests on the samples that are representative in some way to all samples in-the-wild.
How it looks in numbers per month:

A few millions of malware ----> a few hundreds of the test samples

How the AV Labs can do it is a kind of magic.

Anyway, as it can be seen from the SonicWall reports, the number of different variants of malware is a few tenths thousands malware per month. They are probably good representatives of all samples in-the-wild.
So, one must introduce a statistical model, because there are many possibilities of choosing the set of tested samples from hundreths times greater set of different malware variants.

The statistical model in this thread is one of possible models. We simply take randomly the n samples from the larger set of m samples. The number of possibilities is enormous and can be calculated by using Binomial function B(m,n). For example:
B(30000, 100) ~ 4.7 * 10^289, which is much greater than the number of atoms in the Universe.
The details of the model are in OP. When we use this model to the real data, it is assumed that the tested AVs have very different sets of missed malware in the wild.
 
Last edited:

McMcbrad

Level 1
Hmm extremely interesting work, @Andy Ful and I have to agree with you on one thing.
Consistency is the key to measuring how effective a solution is.

When we talk about a malware set, we can't really come to any sort of conclusion without knowing what exactly was in it.
This is like measuring your calories input today, based on how many times you have eaten, but without specifying what your food was.
The SonicWall statistics mean nothing as well. You might have hundreds, thousand, million or trillion new samples, but lets nor forget that sophisticated cybercriminal tools have the following perks:

  • Frequent updates
  • Encoding
  • Archiving and other obfuscating techniques
  • "Noise" a.k.a inserting random and pointless logics, methods and classes within the executable
  • polymorphism
  • and many others

Because of all these factors, you might have 1 million new malware samples this month, but let's say 990 k belonged to the emotet and dridex family and most AV's can detected them based on AI model.
That automatically means most solutions will get a decent score this month, but what will happen if/when an ENTIRELY new piece of malware is introduced?
With the constant strive to evade detection (the cat and mouse game), you might have 1 billion samples that differ from each other and simultaneously, exhibit similar behaviour, if not the same.
Only consistency within these tests can be some sort of proof that the solution's AI is not reactive, but proactive.

Also, does the malware set consist only of executables, where you have reputation, MOTW, digital signature, etc, or do you have also scripts and Microsoft Office documents? Former is easy to handle, but latter might be tricky.
Did the same contain PUPs?
Without a detailed explanation on every single piece of malware they used, I simply won't trust any of their tests, but would rather continue testing on my own.
It's also very important to note that home products are not designed to be tough and stellar, that's reserved to corporate solutions. Automation and lack of (inexperienced) user intervention is the key when designing products for personal use and it's naïve for someone to believe that the same can have 100% success rate on a proper test. However this is fine because:
  • Home users are unlikely to be the target of sophisticated attacks. Advanced and persistent malware requires a heavy investment, which most likely won't be worth it.
  • With the introduction of smart phones, many tasks are not carried out on Windows devices.
  • Users rely more and more on trusted platforms and apps to shop, stream, and discover, which reduces their browsing time and together with that, their likelihood of contracting malware.
  • Windows 10 itself has become a lot more secure.
  • Backup and cloud storage services come activated by default with every smartphone, meaning sensitive data is unlikely to be stored on a laptop and users are unlikely to pay a ransom.
  • With the introduction of IoT, cyber-criminals can't focus solely on Windows anymore.
  • 2-factor authentication and new banking security standards have made money stealing, muling and laundering much more difficult and risky.
Social engineering and SCAMS requiring payments through Bitcoins or gift cards (Amazon order scam, refund scam, tech support scam, FBI scam, tax scam, dating scam and many others) are now the most profitable attack vector, sometimes requiring minimal set of skills and ensuring criminal won't get caught easy.
Efforts in reaching a 100% malware detection rate would simply be a waste of time and resources, decreasing the ability to block other dangers.
Heavy security (where needed) can be achieved via system policies and hardening, proper solution administration/ endpoint detection and response tools, and user training.
 
Last edited:

Andy Ful

Level 63
Verified
Trusted
Content Creator

@McMcbrad,​


Although I agree with most things you mentioned in your post, a few things have to be clarified:
  1. In this thread, I tried to show that most scorings in the AV tests can be easily explained by simple statistical random models.
  2. I also confirmed that there are many such models. In many cases, another useful model can be constructed by changing proportionally the values of m and k. So, these models are independent of SonicWall statistics (although it was a good starting point).
  3. I do not insist that any of these models reflect the reality, but I claim that such models cannot be rejected on the basis of information about testing procedures (available publicly).
  4. From the fact that random models can explain the results of most AVs in one particular test, it follows that such a particular test alone cannot give us (the readers) sufficient information to compare the AVs protection in the wild.
  5. Some useful information can be derived only by comparing several similar tests. For example by looking for the consistently high scorings (or consistently low scorings).
  6. The conclusions taken from the models presented in this thread are consistent with awards proposed by AV testing labs. They are also consistent with cluster analysis (first cluster) made by AV-Comparatives in Malware Protection tests.
There are some things we probably not fully agree with, but they are not relevant to this thread.(y)

Edit: The post was slightly edited to be clearer.
 
Last edited:

McMcbrad

Level 1

@McMcbrad,​


Although I agree with most things you mentioned in your post, a few things have to be clarified:
  1. In this thread, I tried to show that most scorings in the AV tests can be easily explained by simple statistical random models.
  2. I also confirmed that there are many such models. In many cases, another useful model can be constructed by changing proportionally the values of m and k.
  3. I do not claim that any of these models reflect the reality, but I claim that such models cannot be rejected on the basis of information about testing procedures (available publicly).
  4. From the fact that random models can explain the results of most AVs in one particular test, it follows that such a particular test alone cannot give us (the readers) sufficient information to compare the AVs protection in the wild.
  5. Some useful information can be derived only by comparing several similar tests. For example by looking for the consistently high scorings (or consistently low scorings).
  6. The conclusions taken from the models presented in this thread are consistent with awards proposed by AV testing labs. They are also consistent with cluster analysis (first cluster) made by AV-Comparatives in Malware Protection tests.
There are some things we probably not fully agree with, but they are not relevant to this thread.(y)
Sorry, I read somewhere that someone didn’t trust the tests and I mixed the threads 😅😅😅
 

McMcbrad

Level 1
The AV Labs do enormous work, but the results from test reports are commonly misunderstood by readers.
Anyway, the testing procedures are far from perfect due to the high complexity of the malware world.
I’ve stopped following those long time ago. If you think the detection/protection is not perfect, performance tests are even worse, with some products deemed both heaviest and lightest by 2 different organisations. It is clear that they don’t have a set standard, even though AMTSO is supposed to regulate them. I believe these tests bring unnecessary confusion to users.
 

Andy Ful

Level 63
Verified
Trusted
Content Creator
I’ve stopped following those long time ago. If you think the detection/protection is not perfect, performance tests are even worse, with some products deemed both heaviest and lightest by 2 different organisations. It is clear that they don’t have a set standard, even though AMTSO is supposed to regulate them. I believe these tests bring unnecessary confusion to users.
Anyway, the situation without any tests would be even worse.:unsure:
 

plat1098

Level 21
Verified
Assuming each malware sample has equal weight, statistically, the larger the sample size AND the larger the subject pool, the less statistically significant is each missed sample. Here's where it gets "fun" to add the color red to the graphs! Drama, drama look at the difference--when it could be statistically not significant.

Your calculations are the utopia of AV lab testing, Andy Ful. They should all be so clean. Further, as someone else stated above me, one quarter's results should not be the gospel ad infinitum for a given brand as there is so much inherent variability--both for samples and subjects. These lab tests aren't clean; it's virtually impossible.

Now you have the addition of "deviations from defaults" as part of the testing regimen of AV Comparatives for Business, for one. That's where the serious money is, both in security solutions and targets for threats like ransomware. Not all "deviations" are created equal, right?
 

McMcbrad

Level 1
Assuming each malware sample has equal weight, statistically, the larger the sample size AND the larger the subject pool, the less statistically significant is each missed sample. Here's where it gets "fun" to add the color red to the graphs! Drama, drama look at the difference--when it could be statistically not significant.

Your calculations are the utopia of AV lab testing, Andy Ful. They should all be so clean. Further, as someone else stated above me, one quarter's results should not be the gospel ad infinitum for a given brand as there is so much inherent variability--both for samples and subjects. These lab tests aren't clean; it's virtually impossible.

Now you have the addition of "deviations from defaults" as part of the testing regimen of AV Comparatives for Business, for one. That's where the serious money is, both in security solutions and targets for threats like ransomware. Not all "deviations" are created equal, right?
Even tho it will be slightly off-topic, still in the light of tests analysis and interpretation, I will add that “Assuming each malware sample has the same weight” is the biggest issue here.

Malware comes in many flavours, could be the Lavasoft Ad-aware toolbar that pops up unexpected, or it might be a PSW stealer and your accounts could be hacked in minutes.
It might be a downloader whose C&C is dead ir it might be a hacktool trying to reveal your neighbour’s Wi-Fi password.
Without information on “weight” or severity to put it simply, these numbers are meaningless and so is their cluster analysis.
 

Andy Ful

Level 63
Verified
Trusted
Content Creator
AV-Comparatives Malware Protection tests 2019-2020 (four tests) part two.

This is a continuation of the post in another thread, where the impact of polymorphic samples was skipped:

In this post, I am going to examine cumulative results for the last 2 years (March 2019, September 2019, March 2020, September 2020), on the assumption that a strangely high number of missed samples was not caused by several different malware but by one polymorphic malware. Most AVs had such strange results. For example, Kaspersky had 13 missed samples in March 2019 and 9 missed samples in September 2019. What if there were in fact only two polymorphic malware, one in 13 variants and the second in 9 variants? Let's look at the results, where 9+ missed samples were replaced by one polymorphic sample:

----------------Missed samples----Clusters
Avast, AVG.........1+0+2+0...........1,1,1,1
F-Secure ............1+1+0+1...........1,1,1,1
McAfee .............1+1+1+0............1,1,1,1

Norton...............(2)+(2)+0+2.......1,1,1,1

ESET...................1+1+1+2 ..........1,1,1,1
Kaspersky...........1+1+3+1..... .....1,1,1,1
Panda ................1+1+4+1...........1,1,1,1
Microsoft............2+4+1+0...........1,1,1,1
Bitdefender.........1+5+2+1...........1,1,1,1

K7.......................5+5+1+2...........1,1,1,1
Avira* ................0+4+3+4............1,1,1,2
VIPRE ................4+1+3+4 ...........1,1,1,2
Total Defense.....5+1+1+4............1,1,1,2


As we can see, the differences between AVs almost vanished. So, in Malware Protection tests, even four different tests from two years are probably not sufficient to see important differences between popular AVs. The final scoring can highly depend on how many polymorphic samples and polymorphic variations were present in the tests. Without knowing it, the AV comparison on the base of such tests is not reliable at all.
The polymorphic samples could also explain the ridiculous results of four tests in the case of TrendMicro ( 0 missed samples in two tests from the year 2019 and 82+175 = 257 samples in the year 2020 ????).😵

The situation is clearer and easier to explain in the case of the Real-World tests, because from the results we know that the polymorphic samples are absent. :)

 
Last edited:

McMcbrad

Level 1
AV-Comparatives Malware Protection tests 2019-2020 (four tests) part two.

This is a continuation of the post in another thread, where the impact of polymorphic samples was skipped:

In this post, I am going to examine cumulative results for the last 2 years (March 2019, September 2019, March 2020, September 2020), on the assumption that a strangely high number of missed samples was not caused by several different malware but by one polymorphic malware. Most AVs had such strange results. For example, Kaspersky had 13 missed samples in March 2019 and 9 missed samples in September 2019. What if there were in fact only two polymorphic malware, one in 13 variants and the second in 9 variants? Let's look at the results, where 9+ missed samples were replaced by one polymorphic sample:

----------------Missed samples----Clusters
Avast, AVG.........1+0+2+0............1,1,1,1
F-Secure ............1+1+0+1........... 1,1,1,1
McAfee .............1+1+1+0............1,1,1,1
Norton...............(2)+(2)+0+2.......1,1,1,1
ESET...................1+1+1+2 ...........1,1,1,1
Kaspersky..........1+1+3+1..... ......1,1,1,1
Panda ................1+1+4+1............1,1,1,1
Microsoft...........2+4+1+0............1,1,1,1
Bitdefender........1+5+2+1............1,1,1,1
K7.......................5+5+1+2............1,1,1,1
Avira* ................0+4+3+4............1,1,1,2
VIPRE ................4+1+3+4 ...........1,1,1,2
Total Defense....5+1+1+4............1,1,1,2

As we can see, the differences between AVs almost vanished. So, in Malware Protection tests, even four different tests from two years are probably not sufficient to see important differences between popular AVs. The final scoring can highly depend on how many polymorphic samples and polymorphic variations were present in the tests. Without knowing it, the AV comparison on the base of such tests is not reliable at all.
The polymorphic samples could also explain the ridiculous results of four tests in the case of TrendMicro ( 0 missed samples in two tests from the year 2019 and 82+175 = 257 samples in the year 2020 ????).😵

The situation is clearer and easier to explain in the case of the Real-World tests, because from the results we know that the polymorphic samples are absent. :)

The important differences can neither be seen on this, nor any other test, neither you can predict them in any way. The truth about the so-called "difference" is that there is no difference whatsoever - there are temporary leaders, prone to fail at some point. None of these engines shines with some great innovation or machine learning, that others haven't yet discovered and utilised. It's just a matter of which AV the attackers will try to evade.

No matter how we evaluate these results, we are missing critical information on:
  1. What malware was used
  2. What logic goes into the AV product
As we would never learn any of the above, all results, be it consistent or not, perfect or terrible should be taken with a grain of salt.
 
Last edited:

Andy Ful

Level 63
Verified
Trusted
Content Creator
The important differences can neither be seen on this, nor any other test, neither you can predict them in any way. ...
Ha, ha. I am not so brave to claim firmly the above, because no one would believe it without solid reasoning and many people will not believe it even with very solid reasoning. But, I have tried hard to show why we cannot interpret easily the results of AV tests and why something that seems easy to understand (colorful charts of AV scorings) is not easy at all due to MAGIC.

Millions of malware in the wild --------> MAGIC -------> thousands of tested samples

So, probably the magician would be more appropriate to show the difference.:)
 
Top