This test shows how an antivirus behaves with certain threats, in a specific environment and under certain conditions. We encourage you to compare these results with others and take informed decisions on what security products to use. Before buying an antivirus you should consider factors such as price, ease of use, compatibility, and support. Installing a free trial version allows an antivirus to be tested in everyday use before purchase.

Not when the population you are researching changes all the time, there you are wrong. The active in the wild-malware changes every month (and also their prevalence)

Not when the population you are researching changes all the time, there you are wrong. The active in the wild-malware changes every month (and also their prevalence)

You claim that the changes in results are an indication of to small sample set. But this only applies for more or less stable populations. So taking two years of monthly results is arbitrary (since it is not applicable on changing populations).

When you are doing a poll on how many people have a valid driving license in Poland, the number of inhabitants only change marginally (people die and children are born). When you are testing 100 to 250 malware samples every month , the malware population of which this sample is taken changes every month (new malwares appear and malware prevelence change).

Some claim that half a million new malware appears every day. Researchers say that around half a million new malware binaries appear every day, of which only a fraction are unique (less than 0.5 percent) from even a lower number of families (less than 0,1 percent). This clearly shows that the 100 to 250 in the wild malware samples are probably taken from a heavily changed population, explaining the high difference in outcome.

This is different from the "do you have a driving license poll in Poland" where the number of people who could be potentially asked (the population) only marginally changes in a month. According to Goofle Poland jad 37.5 million inhabitants in 2021, with a total of 500.000 death and 300,000 born the total population in Poland only changes 2.1 percent per year of .2 per month.

So you are right that real world malware sample sets are small, but when there are 500.000 malware executables new per day of which only 0,35 % are unique, this would mean that around 1750 unique malware every day total up to around 60.000 every month. According to the sample test calculator a sample size of 382 would be sufficient.

When you take into account that unique malware can often be identified because they are part of a malware family (generic fingerprints), the sample set needed could further decrease. This would imply that the real world test samples (100 to 250 per month) used by the mainstream test labs could well be representative.

You claim that the changes in results are an indication of to small sample set. But this only applies for more or less stable populations. So taking two years of monthly results is arbitrary (since it is not applicable on changing populations).

When you are doing a poll on how many people have a valid driving license in Poland, the number of inhabitants only change marginally (people die and children are born). When you are testing 100 to 250 malware samples every month , the malware population of which this sample is taken changes every month (new malwares appear and malware prevelence change).

Some claim that half a million new malware appears every day. Researchers say that around half a million new malware binaries appear every day, of which only a fraction are unique (less than 0.5 percent) from even a lower number of families (less than 0,1 percent). This clearly shows that the 100 to 250 in the wild malware samples are probably taken from a heavily changed population, explaining the high difference in outcome.

This is different from the "do you have a driving license poll in Poland" where the number of people who could be potentially asked (the population) only marginally changes in a month. According to Goofle Poland jad 37.5 million inhabitants in 2021, with a total of 500.000 death and 300,000 born the total population in Poland only changes 2.1 percent per year of .2 per month.

So you are right that real world malware sample sets are small, but when there are 500.000 malware executables new per day of which only 0,35 % are unique, this would mean that around 1750 unique malware every day total up to around 60.000 every month. According to the sample test calculator a sample size of 382 would be sufficient.

When you take into account that unique malware can often be identified because they are part of a malware family (generic fingerprints), the sample set needed could further decrease. This would imply that the real world test samples (100 to 250 per month) used by the mainstream test labs could well be representative.

fwiw, I took several university courses in statistics eons ago, nuances forgotten, and between you and @Andy Ful this discussion seems a little "apples and oranges" -- but friendly too!

Not exactly. The fluctuations are always big if the number of tested samples in one test is very very small compared to the number of total samples. Even significant changes in the population cannot change it (until the number of tested samples is relatively small).
You can see this on the charts of AV-Comparatives, AV-Test, and SE Labs tests (over the years 2019-2022 and probably before this period). Furthermore the term "more or less stable" can depend on the unknown-yet statistical model, which can be not much sensitive to the changes in the population.
Starting from this point, we can increase the number of tests and observe if the fluctuations decrease. The changes in the population are now unimportant because we take the cumulative number of missed samples over a much longer period (many tests).
The changing population can have some impact when we compare the results of two periods, like 2019-2020 with 2021-2022. But no one knows how big it can be.

Did you notice that this calculator is insensitive to the changes in the population? If you use a population size 100 times bigger (6000000), then you get the sample size 385 ??? It can be an example that significant changes in population can be not important for some statistical models.
From the testing methodology (for example AV-Comparatives), we know that one test (about 300 samples per month) cannot differentiate between AVs contained in the same award group (usually 2/3 of tested products). So, several tests are required.
This calculator uses a particular statistical model that is clearly incompatible with the data related to AV tests.

It can be an example that significant changes in population can be not important for some statistical models.
From the testing methodology (for example AV-Comparatives), we know that one test (about 300 samples per month) cannot differentiate between AVs contained in the same award group (usually 2/3 of tested products). So, several tests are required.
This calculator uses a particular statistical model that is clearly incompatible with the data related to AV tests.

(1) The fluctuations are always big if the number of tested samples in one test is very very small compared to the number of total samples.

(2) Did you notice that this calculator is insensitive to the changes in the population? If you use a population size 100 times bigger (6000000), then you get the sample size 385. This calculator uses a particular statistical model that is clearly incompatible with the data related to AV tests.

(1) Again that is not valid when the population itself changes a lot (new malwares every day). It is true for relatively stable populations (like demographic research)

(2) Not insensitive, it changes a little. That is how sampling of large data sets works (finding the smallest statistical relevant sample set). The Cochran formula (link) allows you to calculate an ideal sample size given a desired level of precision, desired confidence level, and the estimated proportion of the attribute present in the population. Cochran's formula is considered especially appropriate in situations with large populations. Have a look at the youtube video I posted earlier in which the formula is explained

Another reason why I think the sample set used by AV-Comparatives could be statistically relevant is that they are associated with the University of Innsbruck and three other universities (link). In the academic research world the holy trinity is (a) using explicit references to earlier research, (b) using statiscally relevant sample sizes for both research and reference populations and (c) being transparant in the result validation. The partners page of AV- comparatives also states that they developed their testing method in co-operation with the university of Innsbruck. This makes it very unlikely that the sample size of the test sets of AV-Comparatives are not statistical relevant IMO.

As @simmerskool noticed this is becoming an apples versus oranges discussion (the third-party links I provided don't seem to convince you), so I rest my case.

Normally I would also say that is very, very unlikely that one individual (Andy Ful) could outsmart a bunch of Universities (on deciding what is and what is not a statistical relevant sample set), but since you have proven to outsmart one of the largest leading tech firms (with your security tools Configure Defender, Simple Windows Hardening, Hard_Configurator and the upcoming HomeApplocker), I am not betting my right hand against that claim and settle for let us agree to disagree

The calculator depends only on 1 free parameter (Population Proportion) which is set to 50%. So, it calculates a very simple statistical model. When you want to tackle realistic data from AV tests, then there are some important free parameters that can significantly increase the fluctuations:

The sample prevalence (in the tested pule).

How old are the samples (in the tested pule).

How many missed samples are shared by 2, 3, 4, 5, 6, ... AVs (total samples pule).

How many missed samples are shared by 2, 3, 4, 5, 6, ... AVs (tested samples pule).

So, the realistic statistical model for data with possible big fluctuations must include many free parameters. Such a statistical model does not exist yet (as far as I know). That is why do not use any concrete statistical model.
One can use a sample size calculator and known formulas, but there is no way to verify how the calculated result differs from the real one.

I know the AV-Comparatives article about statistically relevant sample size, but they do not use it in their statistical analysis for some reason. They use a clustering method instead, which gives the same results for the first awarded group as a random model from my thread:

Let's consider the example of the initial pule of 30000 sufficiently different malware variants in the wild and the particular AV which failed to detect 100 of them. Next, we make for the above results a trial to choose 380 samples from these 30000 and calculate the probabilities for finding in...

malwaretips.com

Here is how the clustering method works in practice:

The AVs from any particular cluster cannot be differentiated in the test due to experiment errors (mostly due to statistical fluctuations related to the test methodology). In this particular chart, 2/3 of all AVs are in the first cluster.

I will play a little with the calculator, but not for one test results but for two-year cumulative results. In two-year cumulative results the fluctuations are relatively smaller.

Simple statistical model? The 50 percent default (for population proportion) it is the normal distribution model of the famous mathematician Carl Friedrich Gaus and the backbone of probability calculation. I changed the text of the spoiler for clarity (illustrating how malware families might impact population proportion and the sample size) in my previous post.

I won't be arguing your last post. Thanks for the interesting discussion

You are unnecessarily irritated.
It is good to discuss a problem with someone who tries to understand the details (thanks for that).
The Gauss distribution is a very simple statistical model (only 2 free parameters).

Let's forget about statistical arguments and simply use the proposed sample size calculator to calculate something useful from the last R-W AV-Comparatives test.

We are interested to find out if the differences between AVs in the blocked samples are statistically meaningful. As it was calculated by @Max90 :

The sample size is 245 and the margin error is 5%. Let's use these values for further calculations. Avast scored with 626 samples (perfect result) so the margin error for Avast is 5/100 * 626 ~ 31 samples. The real result of Avast must lie somewhere in the interval 595-626. The result is highly disappointing because the worst AV scored 619 and also lies in this interval.

So let's use this calculator to obtain the conclusion made by AV-Comparatives statistics, which says that the margin error for Avast is less than 2 samples.
the margin error < 2/626 ~ 0.319%

The number of samples should be 30101 which is 48 times more than the samples in the test. This sample size is also greater than samples in the two-year period in AV-Comparatives + AV-Test + SE Labs (7548 samples). So, if the AV-Comparatives guys are right, then they must use a very different statistical model.
They have far more information about the samples, including the parameters mentioned by me in the previous posts. So, they can create an improved model to reduce the sample size.

Post corrected.
Previously I used 4 samples as a margin error for Avast, but the correct value is less than 2. Avast (first cluster) is well separated from Total Defense (second cluster) when:
"Avast blocked" - error > "Total Defense blocked" + error

By trial and error, I have got an idea. What if we take as a "Population Proportion" the percentage of malware that could be blocked by Avast in the wild? I took this value (approximately) from my two-year statistics:
Population Proportion ~ (7548 - 13)/7548 ~ 99.828%
So we test the population where all malware samples in the wild are divided into two categories: "blocked by Avast " and "Not blocked by Avast".
Now we can use this value 99.828% as in my previous post (instead of 20%):

Wow! The calculated sample size is 642, almost the same as in the test (626). I am not sure if this is a coincidence or something relevant. It will require some investigation.

The previous post revealed some interesting possibilities. So, now I will apply the calculator to the two-year statistics from the OP. The table will use blocked samples instead of missed samples.

Real-World Triathlon 2021-2022: SE Labs, AV-Comparatives, AV-Test (7548 samples in 24 tests)

Norton is better than Microsoft if:
"Norton 360" - error ~ Microsoft + error ----> error = 7.75 samples
Margin of error = 7.75/7536 *100% = 0.103%
Population Proportion = 7536/7548 *100% = 99.841%

On the basis of the above the sests from 1 and half year (AV-Comparatives+AV-Test+SE Labs) would be sufficient to prove (with confidence Level 95%) that Norton can provide better protection than Microsoft.

Analogical calculations for Kaspersky and Microsoft :
Margin of error = 4.75/7530 *100% = 0.0631%
Population Proportion = 7530/7548 *100% = 99.761%

If the statistical model of this calculator is true, then the sample size 7834 would require about 2 years of testing (AV-Comparatives + AV-Test + SE Labs). But, in the case of Kaspersky, the Confidence Level will be much lower (75%).

Simple definition for the population proportion, written in plain English. Finding confidence intervals and sample proportions, step by steps plus videos.

Simple definition for the population proportion, written in plain English. Finding confidence intervals and sample proportions, step by steps plus videos.

It is worth knowing that the probabilistic model used in the sample size calculator is different from the probabilistic scenario of AV tests. The first is time-independent. The second is generally time-dependent.
Let's assume that the test is similar to the AV-Comparatives R-W tests (600 samples per 4 months). When we draw 600 samples from the population Samples in the wild then:

In the first case, it is possible that among these 600 samples, there will not be any Samples from the first day of the test. The same is true for any particular day (or group of days) of testing.

In the second case among these 600 samples, we can always find a few (approximately 5) Samples from each day.

In the calculator, we can have only one parameter "Population Proportion". In the realistic test, we can have 120 such parameters. When we use the sample size calculator for the real AV testing data, a kind of average "Population Proportion" must be used. In my examples, I used the average from long-term data (two-year testing data). The more appropriate average would be over the period of testing (but hardly possible).

Final conclusion.
Thank @Max90, I examined the possibility of using a sample size calculator. This simple statistical model works surprisingly well, although it is a rough approximation of the real scenario and it is necessary to use also long-term data (like in the OP). It seems that this combination roughly confirms the necessity of using some AV clustering in the R-W tests. The number of tested samples is insufficient to reliably differentiate between several AVs contained in the same cluster. Such clustering is done for several years by AV-Comparatives, AV-Test, and SE Labs.

The calculations with the "sample size calculator" partially confirm that a two-year cumulative statistic is usually required to find statistically significant differences between AVs. In some cases, a three-year (or more) testing period is needed. But, using a three-year or more, can be accepted only in the case when the AVs protection rate does not change significantly. So I probably stay with a two-year long-term period.

The readers probably noticed that I had a problem with using the sample size calculator in the real scenario of AV testing. What was the problem?
The calculator uses only one parameter "Population Proportion" (which is easy to understand).

But, in the AV test, we have several AVs, and each AV can generate a different population proportion. Which one should be chosen? There is no natural choice. Should we use the average "Population Proportion" over all tested AVs? Another possibility would be using "Population Proportion" = 80% (or 20%) which would approximately describe truly never-before-seen malware which was not blocked by any AV.

After a few hours, I realized that by using some trick the calculator can be used to compare the adjacent clusters, by selecting only 2 AVs (each from a different cluster) and using only the "Population Proportion" of one AV. The trick is that this will work only if the difference in the protection of AVs is not too big.
In this post, I will improve this trick by using an average of the "Population Proportion" generated by both AVs. Here is an improved calculation for Norton and Microsoft from the two-year statistics:

For the Confidence Level 90% the sample size is 6632 (lower than in two-year statistics 2021-2022)
For the Confidence Level 95% the sample size is 9323 (higher than in two-year statistics 2021-2022).
In the two-year statistics (AV-Comparatives + AV-Test + SE Labs) the sample size is only 7548.
So for the two-year statistics from OP, the Confidence Level is probably close to 92% (Norton is better than Microsoft).

Edit, By using different absolute errors (6.18 for Norton and 9.32 for Microsoft) one can optimize the calculated sample size to 8950 with the Confidence Level 95%)

The calculations with the "sample size calculator" partially confirm that a two-year cumulative statistic is usually required to find statistically significant differences between AVs. In some cases three-year (or more) testing period is needed. But, using a three-year or more, can be accepted only in the case when the AVs protection rate does not change significantly. So I probably stay with a two-year long-term period.

As I read this thread, isn't there another variable, ie, eg, the Avast used in year2 or year3 is not necessarily the same Avast that was used in year1, ie, the vendors are "upgrading" or at least modifying their AV products every so often. Perhaps that is irrelevant, or is that just part of the averaging??

As I read this thread, isn't there another variable, ie, eg, the Avast used in year2 or year3 is not necessarily the same Avast that was used in year1, ie, the vendors are "upgrading" or at least modifying their AV products every so often. Perhaps that is irrelevant, or is that just part of the averaging??

It is certainly true that AV programs change over time. For example, they adapt to a new threat situation. However, a test from the past indicates how well the AV program was adapted to the threat situation at the time. The results of the past are therefore also decisive for the evaluation of an AV program.