App Review Homemade AV testing (a suggestion).

Andy Ful · Feb 12, 2025

Homemade AV testing (a suggestion).

Post updated / redesigned / shortened 14.02.2025

Simplified and standardized version can be found here:

App Review - Homemade AV testing (a suggestion).

Homemade AV testing (a suggestion). Post updated / redesigned / shortened 14.02.2025 Simplified and standardized version can be found here: https://malwaretips.com/threads/homemade-av-testing-a-suggestion.134815/post-1117709 In my posts, I often criticized the homemade malware "tests"...

malwaretips.com

In my posts, I often criticized the homemade malware "tests" (YouTube tests), pointing out that the results obtained are not statistically significant and the testing methodology is incorrect. In this post, I am going to suggest a test outline that removes some important issues.

Short description.

This kind of test is intended for fresh malware samples. It is a kind of competition between a selected AV called AV4 and a collective AV123 (a collection of three top AVs).
On each malware sample, the AV4 can win, lose, and draw with a collective AV123.
AV4 loses whenever it is bypassed by the malware sample and all three top AVs protect against that sample (Collective AV123 Pass).
AV4 wins whenever it protects against the malware sample and at least one of the top three AVs is bypassed by that sample (Collective AV123 Failure).
In other cases, we have a draw between AV4 and AV123.

Test details.

Take three top AVs (AV1, AV2, AV3 ---> collective AV123), and next the AV4 which is probably not a top AV (for example Norton + Kaspersky + Bitdefender as collective AV123, and additionally Microsoft Defender as AV4).
Take 25 fresh malware samples to perform a partial test.
Test those samples against AV123 and AV4.
Count the number of the AV4 wins and losses.
End the test if the condition [ wins < losses ] is fulfilled. This will prove with high confidence that AV4 is not a top AV.
If not [ wins < losses ], perform another partial test with a new pule of 25 samples, as in points 2-5. Do not reset the numbers of wins and losses. Those numbers should reflect wins and losses of all partial tests.
If still the condition [ wins < losses ] does not apply, continue partial tests but end the full test when [ wins < losses ] or 16 tests are done.
If the condition [ wins < losses] is not fulfilled after 16 partial tests, the AV4 is most probably a top AV (or close to top AVs).

Each partial test should be done against AV123, and AV4 in two hours. The VM images must be saved for each of the four AVs.
To save time, the analysis of possible infections must be done after testing AV1, AV2, AV3, and AV4.
The partial tests with 25 samples can be done with a few days brake.

Example of partial test (25 samples, passed mean that the sample was blocked/detected):
1-10. All AVs passed <--- 10 draws
11. AV1 passed, AV2 passed, AV3 passed (Collective AV123 Pass) ; AV4 failed <---- losses = 1
12-15. All AVs passed <--- 4 draws
16. AV1 passed, AV2 failed, AV3 passed (Collective AV123 Failure) ; AV4 failed <--- draw
17. All AVs passed <--- 1 draw
18. AV1 failed, AV2 failed, AV3 passed (Collective AV123 Failure) ; AV4 passed <---- wins = 1
19. AV1 passed, AV2 failed, AV3 passed (Collective AV123 Failure) ; AV4 passed <---- wins = 1 + 1
20-24. All AVs passed <--- 1 draw
25. AV1 passed, AV2 passed, AV3 passed (Collective AV123 Pass) ; AV4 failed <---- losses = 1 + 1

End of partial test. The result is inconclusive because it is not true that wins < losses. The test must be continued with another pule of 25 samples.

More examples:
https://malwaretips.com/threads/homemade-av-testing-a-suggestion.134815/post-1117642
https://malwaretips.com/threads/homemade-av-testing-a-suggestion.134815/post-1117528
https://malwaretips.com/threads/homemade-av-testing-a-suggestion.134815/post-1117531
https://malwaretips.com/threads/homemade-av-testing-a-suggestion.134815/post-1117613

Bot · Feb 12, 2025

Your proposed testing methodology seems quite comprehensive and well-structured. It addresses the issues of statistical significance and testing bias in homemade AV testing. However, it's important to note that the results can still be influenced by the specific malware samples chosen for the test. Also, the performance of AVs can vary over time due to updates and changes in malware trends. Nonetheless, this is a good starting point for a more rigorous and fair AV testing.

Shadowra · Feb 12, 2025

I really like your idea, which is clearly a repeat of what I did during my big comparisons, but you should know that it's long :/.

But it can be done at home! I'll keep your idea in mind

Andy Ful · Feb 12, 2025

This testing procedure ensures high confidence (statistically significant) for homemade tests with a small number of samples used in one partial test. But usually, several partial tests will be required to complete a full test. This is a method for patient testers.

Andy Ful · Feb 13, 2025

Post edited/updated

Here is a computer simulation of one partial test (25 malware samples tested). The "Col 0" includes the cumulative numbers of "wins ;; losses ".
In the below example, there are draws on samples 1 - 11, 13-15, 17, 18, 20-23, and 25. The malware samples 12, 16, and 19 bypassed AV4 but did not bypass AV123 (three losses). The malware sample 24 bypassed at least one of AV123 but did not bypass AV4 (one win). The final result of this partial test ( 1 ;; 3 ) is in the last row of the table. The condition wins < losses is fulfilled, so we do not need to do more partial tests, and the conclusion is that AV4 is not a top AV.

Andy Ful · Feb 13, 2025

Another example, when the second partial test had to be done. If AV4 has close-to-top protection, the full test can be inconclusive even after 16 partial tests.

Andy Ful · Feb 13, 2025

Updated the OP.
Now, the event is also counted when at least one of AV1, AV2, AV3 fails but AV4 passes (in the prior version the event was counted only when one AV failed).

Edit
To save time, when one of the top AVs (AV1, AV2, or AV3) fails on the concrete sample, testing that sample against two other top AVs is not necessary.

Andy Ful · Feb 15, 2025

Let's check this testing method against some extreme cases.
In the OP we had the win/lose/draw conditions:

AV4 loses whenever it is bypassed by the malware sample and all three top AVs protect against that sample (Collective AV123 Pass).
AV4 wins whenever it protects against the malware sample and at least one of the top three AVs is bypassed by that sample (Collective AV123 Failure).
In other cases, we have a draw between AV4 and AV123.

End the test if the condition [ wins < losses ] is fulfilled. This will prove with high confidence that AV4 is not a top AV.
If the condition [ wins < losses] is not fulfilled after 16 partial tests, the AV4 is most probably a top AV (or close to top AVs).

****************************
****************************
Example 1 (All AVs are identical top AVs).
AV1, AV2, AV3, AV4 = AV
If all AVs are identical, the condition from point 1 can never be fulfilled. So, there are no losses, and the ending condition [ wins < losses ] can never be fulfilled too. After doing 16 partial tests, we must conclude that AV4 is most probably the top AV.
****************************
****************************
Example 2 (AV4 = AV1)
As in Example 1, the condition from point 1 can never be fulfilled. After doing 16 partial tests, we must conclude that AV4 is most probably the top AV.
****************************
****************************
Example 3 (AV4 fails on the samples that can bypass at least one of AV1, AV2, or AV3).
This example includes also Examples 1 and 2, so we additionally assume that all tested AVs have different protection. In such a case the set S4 of samples that bypassed AV4 is a sum of similar sets for AV1, AV2, and AV3. We have S4 = S1 + S2 + S3, and S4 is usually significantly larger than any of S1, S2, and S3.
From condition in point 2, it follows that AV4 cannot win at all. If there are no wins, the ending condition [ wins < losses ] can be easily fulfilled so AV4 is not a top AV.
***************************
***************************
Example 4 (AV4 fails on different samples than any of AV1, AV2, and AV3).
This means that the set S4 (samples that bypassed AV4) does not have common samples with any of S1, S2, and S3.
From condition 1 it follows that any sample from S(4) must generate a loss ----> losses = n(S4) = number of elements in S4
From condition 2 it follows that any sample from S1, S2, or S3 must generate a win.

The condition [ wins < losses ] looks now: n(S1 + S2 + S3) < n(S4)
If S4 is a top (or close) AV the above condition can hardly hold, because usually n(S4) ~ n(S) < n(S1 + S2 + S3) where S is any of S1, S2, and S3.
Furthermore, for non-top AVs, we usually have: n(S4) > n(S1 + S2 + S3), and then the ending condition is fulfilled.
**************************

Andy Ful · Feb 15, 2025

A quite general example.

We consider a Real-World test when three top AVs (AV1, AV2, AV3) miss together fewer samples than AV4 alone.
I will show for such a case, how from the testing procedure in this thread, it follows that the AV4 result is significantly lower than for top AV (statistically significant difference).

For example:

Real-World Protection Test February-May 2024

The first half year results of the ongoing Consumer Real-World Protection Test for February-May 2024 are now available.

www.av-comparatives.org

where top AVs are Bitdefender, Kaspersky, Norton and AV4 is Microsoft Defender.
Indeed, the 13 AVs belong to the first cluster and Microsoft to the second cluster. According to the AV-Comparatives methodology, the differences between AVs in the same cluster are statistically insignificant. But, the difference in AV results between the first and the second cluster is statistically significant. So, Microsoft cannot be among the top AVs in this Real-World test.

The proof.

M = sum of numbers of samples missed by AV1, AV2, and AV3 (M = 1 + 2 + 2 in the AV-Comparatives test)
Q = number of samples missed by AV4 (Q = 6 in the AV-Comparatives test)
m = number of samples that bypassed at least one of the top AVs and also bypassed AV4 (Microsoft Defender in AV-Comparatives test). Those samples generate draws.
n =number of samples that bypassed at least one of the top AVs but did not bypass AV4. Those samples generate wins.
The numbers m, n are unknown (not included in the AV-Comparatives report).
m + n < = M (this inequality follows from the set theory.
M - Q < 0 , because we consider the case when AV1, AV2, and AV3 miss together fewer samples than AV4 alone.

draws = m
wins = n
losses = Q - draws = Q - m

We have:
m + n < = M
n < = M - m
If we replace M with Q which is greater than M then:
n < Q - m
wins < losses
End of proof.

Real-World Protection Test July-October 2023

The second half year results of the ongoing Real-World Protection Test July-October 2023 are now available.

www.av-comparatives.org

AV4 = Eset

Real-World Protection Test February-May 2023

The first half year results of the ongoing Consumer Real-World Protection Test for February-May 2023 are now available.

www.av-comparatives.org

Andy Ful · Feb 16, 2025

The condition [ M - Q < 0 ] from the previous example is not the strongest one. Using similar arguments one can derive a slightly stronger condition:

M123 - Q < 0

M123 = number of samples that bypassed at least one of the top AVs.
Q = number of samples that bypassed AV4.

It can explain some anomalous AV-Comparatives test results like:

Real-World Protection Test July-October 2024

The second half year results of the ongoing Real-World Protection Test July-October 2024 are now available.

www.av-comparatives.org

AV1, AV2, AV3 (Bitdefender, Kaspersky, Norton), AV4 (Microsoft Defender)

The condition [ M - Q < 0 ] does not hold, because M = 9 (Bitdefender + Kaspersky + Norton ) and Q = 8 (Microsoft Defender). It is too weak, so we cannot use it to conclude that Microsoft Defender is not a top AV in this test.
But, in this test, the number of missed samples by Kaspersky (6) is much bigger than usual. It is possible that two or 3 samples missed by Bitdefender or Norton were also missed by Kaspersky.
In such a case M123 < = 7 (which is a stronger condition compared to M = 9).
It is easy to see that the condition [ M123 - Q < 0 ] holds because M123 < = 7 and Q = 8.

So the conclusion in the AV-Comparatives report that Microsoft Defender is not a top AV in this test (MD is in the third cluster), can still be statistically significant.

Edit.
Although the explanation based on the condition [ M123 - Q < 0 ] is probable, there are also some other possibilities. Rarely, any top AV can get a significantly lower score by pure accident, due to choosing the pule of tested samples discriminating against that particular AV. We should remember that any top AV can miss hundreds of samples a day in the wild.

Andy Ful · Feb 17, 2025

One against the triple.

This is a simplified and standardized version of the testing procedure from the OP.

Many YouTube videos present the battle between two AVs (on the same pule of malware samples). The common issue is that the presented differences between the tested AVs are often statistically insignificant.
The situation can be improved when replacing one of the tested AVs with a triple AV123 of the top AVs (AV1, AV2, AV3). So we now have the battle between two parties: AV123 and AV4. In the examples mentioned in the previous posts, I proposed Bitdefender, Kaspersky, and Norton as AV1, AV2, and AV3. The tested AV4 can be for example Microsoft Defender.

To preserve the idea of the battle between two parties we must define when the AV123 fails on the concrete sample, and when the test result can be statistically significant:

AV123 failure on the concrete sample happens when at least one of AV1, AV2, or AV3 fails on that concrete sample.
If AV123 failures < AV4 failures, and the test is done on about 400 fresh samples - the AV4 presents (statistically significant) lower protection.

Some thoughts about testing.

Testing few-day-old samples is pretty much useless.
The more 0-day samples in the pule of samples, the better the test reflects protection in the wild.
It is hard to find and test many 0-day samples in one day, so the test can be divided in time into several "partial tests" with a smaller number of samples.
AVs should be tested at approximately the same time (one partial test should be completed in 2 hours)
The testing procedure requires checking/confirming which concrete sample bypassed the protection of AV123 or AV4. This is usually possible when running each sample against a concrete AV (except those detected by manual scan) on the clean VM image.
If the concrete sample bypasses one of AV1, AV2, or AV3, then testing that sample against two other top AVs is not necessary (AV123 already failed).
The condition AV123 failures < AV4 failures, gives similar results as statistical methods used by AV-Comparatives and AV-Test.
Uploading the sample to VirusTotal, cloud sandboxes, etc. is possible, but only after the sample has been tested. The uploaded samples are often shared with AV vendors.

Search

App Review Homemade AV testing (a suggestion).

Andy Ful

From Hard_Configurator Tools