Testing real-time protection of antiviruses with 10.124 Sample

Wouldn't it have been better to present this test in a video? Or would it not add much? :)
It would add nothing. The discussion back and forth here is about whether the way the test was conducted presents any real value.
The argument is that the test did not replicate the realistic paths through which malware arrives.
In addition, the test uses questionable tools that execute the malware samples.
The author argues so many samples can't be executed one by one.

My position is in the middle. I don't like tests with many samples (not curated) and I don't like auto-execution tools.
However, the test still has some validity, as it was conducted the same way for many products.
Whatever the situation was for Microsoft Defender (ganganged as we had written earlier), unrealistic and so on, the same situation was for others.
So from this point of view, the test cannot be classified as misleading.

Furthermore, some failures are way beyond what could be excused and explained by the testing methodology.
 
The author argues so many samples can't be executed one by one.
Sure they can. It just requires a willingness to do so, time, and effort.

Perhaps not use 10,000+ samples as a wiser decision?

My position is in the middle. I don't like tests with many samples (not curated) and I don't like auto-execution tools.
I call it "gang banging the security solution." I don't care that the phrase is considered inappropriate, suggestive, and "politically incorrect."

I don't agree with flooding the security solution and overwhelm it. All it is doing is stress testing the entire system, which includes whatever security capabilities that are present during the test. That just makes the test quality of dubious reliability.

The only case that they emulate are "Monster Downloaders or Installers" that perform multiple PUA/PUP/malware downloads and execute them. Although, of the worst Chinese monster downloaders, I've never observed one that downloaded and executed more than 50 software.

The motive for many monster downloaders is not to actually infect the system. The people that create monster installers usually get paid by the install by PUA/PUP publishers.
 
Last edited by a moderator:
My 2 cents worth. Many good points regarding particular accuracies to frame an acceptable report. The Heavy Salvo of releasing a cascade (not necessarily thousands though) in my humble opinion is a good start for @winball501 's test tool. I most certainly applaud the effort and remind the group there's potential to build on this First Step Tool as i see it.

Give the fella a break, with additional ideas & suggestions that Test Tool can maybe broaden the scope looking ahead, should he brainstorm additions to it. As is like i said it's something which can become something beyond it's current presentation. Heck Yeah i applaud the effort and with some good useful suggestions for consideration it just might better develop into a more comprehensive malware testing capture reporter (executables for now) - scripts perhaps down the pike.

And that's why i'm bothering to comment at all on this thread. For now the Test Habits in this scenario are limited but dang sure creates a baseline for the program. Hey Group, he's put that test tool together and run an avalanche of virus samples from a launcher and it drops a report. (well 2, activated exe's and Settings).

That's a start i guess. I been toying with it myself. Like @Andy Ful or @cruelsister etc, we crave more CAPABILITY- On what a tool of this nature can do to make the findings/report accurate enough to draw out different scenarios Testers can size up with confidence .
 
Last edited:
Sure they can. It just requires a willingness to do so, time, and effort.

Perhaps not use 10,000+ samples as a wiser decision?
Yeah, I wouldn't. But in this case the argument is that you are testing against 2 drops from the whole ocean (randomness, statistically insignificant). So there is no satisfaction.
 
I do not want to be an active participant in this thread, but I was curious what the different AI's thought about the two different testing methods, so I asked this question, and they all pretty much said the same thing... "Some argue that testing an antivirus should involve evaluating all its security components—such as the URL filter, behavior blocker, heuristics engine, and signatures—as a cohesive system. Others believe that scanning a large collection of malware samples or executing them to check if they're blocked provides a more accurate measure of an antivirus's effectiveness. What’s your perspective on this debate? Which approach do you support, and what are the key arguments for both sides?"

From ChatGPT...

This is a long-running debate in the antivirus (AV) and endpoint security industry, and both sides make valid arguments depending on what you’re trying to measure: the holistic user protection experience versus the raw detection capabilities. Let’s break it down carefully.

Perspective 1: Test the AV as a Cohesive System (Full-Stack Evaluation) ✅ (Proponents: AV vendors, consumer-focused testing labs like AV-TEST, AV-Comparatives)

Philosophy:
A modern antivirus solution isn’t just a signature scanner anymore. It’s a multi-layered security ecosystem that relies on different components—URL filtering, exploit protection, cloud reputation services, machine learning, heuristics, behavior monitoring, rollback, etc. The end-user doesn’t care which module blocked the malware; they just want to know if they were protected.
Testing methodology:
  • Simulate real-world scenarios:
    • User downloads a malicious file via browser → AV’s URL filter or reputation engine kicks in.
    • A malicious attachment is opened in Outlook → exploit protection or email filtering engages.
    • An unknown ransomware sample executes → behavior blocker stops encryption mid-flight.
  • Evaluate end-to-end protection: Did the malware cause harm or not?
Arguments in favor:
  1. Reflects real-world user experience → This is how malware actually arrives and behaves.
  2. Multi-layered defense is the industry norm → Testing only the signature scanner ignores the majority of an AV’s value.
  3. Tests prevention and remediation → Some AVs can’t block everything upfront but can contain, roll back, or quarantine later.
  4. Cloud-based engines matter → Many vendors rely on real-time lookups; offline malware pack testing doesn’t measure this.
Downside:
  • Harder to standardize and reproduce results because environmental factors affect detection.
  • Vendors can “game” these tests by optimizing for popular scenarios without improving raw detection quality.
  • Less transparent for researchers who want to compare engines directly.

Perspective 2: Test by Scanning or Executing Malware Collections (Sample-Centric Testing) ✅ (Proponents: independent researchers, some enterprise security teams)

Philosophy:
The core purpose of an AV is to detect and stop malware, so the fairest way to evaluate its effectiveness is to throw lots of malware at it—either by scanning a collection or executing samples—and measure the block rate.
Testing methodology:
  • Collect a large, diverse malware corpus (thousands to millions of samples).
  • Either:
    • Static test: Scan the samples and measure detection rate.
    • Dynamic test: Execute samples in a controlled sandbox and measure whether the malware successfully compromises the system.
  • Compare detection percentages between vendors.
Arguments in favor:
  1. Objective and reproducible → Every AV faces the same sample set, making it easier to compare results.
  2. Uncovers weaknesses in engines → Shows which products rely too heavily on cloud or signatures.
  3. Useful for researchers → Better visibility into what’s actually detected versus missed.
  4. Scales better → Easier to test millions of samples than to simulate thousands of real-world infection vectors.
Downside:
  • Unrealistic user scenarios → Users rarely encounter malware by scanning giant archives.
  • Ignores multi-layer protection → URL filters, exploit mitigations, and reputation engines often aren’t triggered in offline pack tests.
  • Can penalize cloud-reliant AVs → Products optimized for live environments might look worse than they actually are.
  • False negatives can be misleading → Missing an unpacked sample in a ZIP file isn’t necessarily equivalent to “user would have been infected.”

My Perspective: Hybrid Testing Is Best 🏆

I support testing the AV as a cohesive system but with complementary raw engine tests.
  • If your goal is to answer:
    “How safe am I if I install this AV and use my computer normally?”
    Full-stack, real-world testing is the gold standard.
  • If your goal is to answer:
    “Which vendor has the strongest detection engine?”
    → You need sample-based corpus testing.
Ideal testing framework combines both:
  1. Real-world protection tests → Simulate email phishing, malicious websites, drive-by downloads, ransomware payloads.
  2. Static detection tests → Scan large malware collections to evaluate engine depth.
  3. Behavioral/zero-day tests → Execute fresh samples in a controlled environment to measure heuristic and behavioral blocking.
This is why reputable testing labs like AV-Comparatives and MITRE ATT&CK Evaluations already run layered frameworks rather than relying on a single method.

Bottom Line

  • Real-world tests measure protection → Best for consumers and enterprise decision-makers.
  • Corpus-based tests measure raw detection power → Best for researchers and comparative analysis.
  • A hybrid model is ideal because it respects how AVs actually work today.

If you’d like, I can design a modern testing methodology that combines both approaches into a single, structured framework—something that would evaluate your own products like CyberLock and DefenderUI Pro fairly against competitors while highlighting their strengths.
Do you want me to put together a hybrid testing framework tailored for your products? It could also emphasize ransomware defense and zero-day protection, which would set you apart from traditional AV testing.

You said:​

Sure, that would be great, thank you!

ChatGPT said:​

Perfect 🙂 — I’ll design a modern hybrid antivirus testing framework tailored for your products (CyberLock, DefenderUI Pro, WDAC Lockdown) while also making it relevant for comparing against traditional AVs.
This framework will focus on real-world protection, ransomware defense, zero-day blocking, and overall endpoint resilience—areas where your products excel. I'll structure it like a professional AV testing lab methodology so you can use it for internal QA, competitive benchmarking, or marketing collateral.

CyberLock Hybrid Antivirus Testing Framework

A Modern Methodology for Measuring Endpoint Security Effectiveness

1. Goals of the Framework

  • Evaluate security as the user experiences it — end-to-end protection, not just raw detection.
  • Measure layered defense capabilities: URL filtering, signatures, heuristics, behavioral blockers, sandboxing, and rollback/remediation.
  • Stress-test performance against zero-day attacks, ransomware, and fileless threats.
  • Provide objective, reproducible results for competitive comparison.

2. Testing Categories

A. Real-World Protection Testing (Primary Focus)

Purpose: Measure how well the product protects a user during real-world attack chains.

Scenarios to Test

ScenarioAttack VectorExpected Protection Layer
Malicious website drive-byExploit kit via browserURL filter, exploit mitigation
Phishing email attachmentMalicious Office macroHeuristics + behavior blocker
Executable downloadEXE via HTTPS linkCloud reputation + signatures
Fileless attackPowerShell abusing LOLBinsScript control + behavioral AI
Malvertising campaignMalicious iframe injectionURL filter + runtime memory protection
Scoring:
  • 1 point if the attack is fully prevented (e.g., blocked before execution).
  • 0.5 points if compromise occurs but the product fully remediates (e.g., ransomware stopped mid-encryption, files recovered).
  • 0 points if the attack succeeds.

B. Ransomware Defense Evaluation (High Visibility)

Purpose: Showcase how CyberLock and DefenderUI Pro handle mass encryption attacks better than traditional AVs.

Test Method

  1. Use a curated set of 20–30 real-world ransomware samples.
  2. Include known strains (e.g., LockBit, BlackCat, Conti) and fresh zero-days.
  3. Execute each payload directly to bypass URL filters and reputation engines.
  4. Monitor:
    • Was encryption prevented?
    • Were encrypted files restored automatically?
    • Was the process terminated or contained?
Scoring:
  • 2 points → Ransomware fully blocked before file encryption.
  • 1 point → Encryption started, but recovery was complete.
  • 0 points → Files remained encrypted.
Highlight: CyberLock’s behavioral engine + WDAC integration should outperform signature-heavy AVs here.

C. Zero-Day Malware Blocking (Proactive Capability)

Purpose: Test the heuristics and behavior blockers against unknown threats.

Test Method

  • Gather 50–100 fresh samples less than 24 hours old.
  • Deliver via multiple vectors:
    • Direct disk writes
    • Malicious attachments
    • Script execution
  • Disable cloud lookups temporarily to test offline detection.
Scoring: Percentage of blocked samples.

D. Large-Scale Static Detection Test (Optional)

Purpose: Benchmark raw detection capabilities for transparency.

Test Method

  • Scan a 1M+ sample corpus of mixed malware (old + new).
  • Include packed, obfuscated, and polymorphic samples.
  • Measure on-demand detection rate.
Scoring: Straight percentage of detected samples.
Note: This test isn’t as relevant for marketing, but some enterprise customers expect to see these numbers.

E. False Positive Testing

Purpose: Ensure protection isn’t overly aggressive.

Test Method

  • Use a cleanware corpus of 100K+ known-safe files.
  • Include installers, portable tools, developer binaries, and documents.
  • Measure:
    • Number of files flagged incorrectly.
    • How quickly whitelisting resolves misclassifications.
Scoring:

F. Performance Impact Testing

Purpose: Prove that CyberLock + DefenderUI Pro provide lightweight protection.

Metrics to Measure

MetricTest Method
Boot time impactAverage Windows boot time ± AV installed
File operationsCopying, extracting, and deleting 10GB dataset
App launch delayOpening large apps (Office, Photoshop, browsers)
CPU & RAM usageIdle + scanning + active blocking
Scoring: Lower numbers = better. Compare against Windows Defender and 2–3 competitors.

3. Scoring System

CategoryWeight (%)Why It Matters
Real-world protection35%Directly reflects user experience
Ransomware defense25%High-impact threat, key differentiator
Zero-day blocking15%Shows proactive security
Static detection10%Benchmark transparency
False positives5%Avoids usability issues
Performance impact10%User satisfaction & adoption
 
Yeah, I wouldn't. But in this case the argument is that you are testing against 2 drops from the whole ocean (randomness, statistically insignificant). So there is no satisfaction.
Create a test script, tool, or automation workflow that launches a malware sample and then waits for 5 minutes to execute the next sample until the entire "pack" is tested or a malware borks the system, whichever comes first. Keep the testing to 100 or less malware samples.

This test method is not difficult.
 
Create a test script, tool, or automation workflow that launches a malware sample and then waits for 5 minutes to execute the next sample until the entire "pack" is tested or a malware borks the system, whichever comes first. Keep the testing to 100 or less malware samples.

This test method is not difficult.
It's much better to study and curate them, and ensure you are testing against a broad range (if you really want meaningful results). But the problem is, on forums, there are beehives for every product. When you don't show them the results that they want to see, the test is criticised.
Some valuable conclusions can be drawn from every test (if one wants to draw them).
 
Last edited:
What if we take the malware where the AV failed and retest the AV using that malware only? Would that be enough proof that the AV did not do well?

The context for my question is, I currently removed Avast free from several machines at home using Windows 11 because it kept showing too many popup ads for upgrades, and after converting most of them from Windows 10 using FlyBy11, and took the effort to turn on as many Windows Security features as possible because only one machine is new and branded (a laptop) and the rest are custom-built PCs (half the price of branded desktops) and older laptops (e.g., went to BIOS to turn on virtualization and secure boot, went to the OS and turned on the Virtul Machine Platform, and then converted from MBR to GPT when needed plus removed several old drivers using Drive Store Explorer and manual renames to make Secure Boot run).

Given that, what would be the result of this test on Windows Defender with Core Isolation on (including Memory Integrity, Kernel-Mode Stack Protection, Local Security Authority Protection, and the Microsoft Vulnerable Driver Blocklist), Secure Boot, Reputation-Based Control, and Exploit Protection?

Third, I tested one PC quickly using NovaBench free and noted the following:

The system has the highest performance with Windows Security and all of the features turned on. Turning them off did not lead to an improvement in performance, although some say that for games, there was only a small drop in FPS (around 4) when the features were turned on.

The system had a slightly lower performance with any of the third-party AVs installed (from 1 to 5 percent decrease). I also read that even with third-party AVs the Windows Security features should be left on as they are very helpful.

Finally, I found out that with online stores one can buy yearly subscription to various AVs for only a few euros a year, so cost is no longer an issue, too.
 

Dexter_Morgan31, winball501


It is good that you are interested in AV testing. We have the new AV testing tool, so it would be interesting to use it in practice. You do not have sufficient resources to replicate professional tests, but there should be a way of testing that could produce reliable and statistically significant results.

My Proposition
The first possibility is conducting the test like in the opening post, but changing the winning criteria. The winner is the AV, which is not bypassed by any sample. However, this will require more effort to check if the system is clean.
  1. Use Norton Power Eraser and maybe some other similar tools.
  2. Use Autoruns to confirm that popular persistence methods were not applied.
  3. Use a tool for checking DLL injections.
  4. Use a tool to identify C2 connections.
  5. Add some info about the purpose of the test and the methodology used.
Notes about the significance of the results.
As can be seen from similar tests conducted by AV-Comparatives, one test is not sufficient to find winners in a reliable (statistically significant) way.
However, you can conduct your tests systematically as long as you want (a few times a year) and post the results on MT.
After a year, we can analyze the results and choose the winners. This will require testing the same group of AVs during one year.
 
Last edited:
Notes about the significance of the results.
As can be seen from similar tests conducted by AV-Comparatives, one test is not sufficient to find winners in a reliable (statistically significant) way.
However, you can conduct your tests systematically as long as you want (a few times a year) and post the results on MT.
After a year, we can analyze the results and choose the winners. This will require testing the same group of AVs during one year.

The problem can be if there is no winner. This is possible because in such a test, AV can miss on average 1 sample per 2000 samples.
The solution would be to decrease the number of samples to 2000 (more or less) and conduct the test more frequently (one test per month).
 
  • +Reputation
Reactions: simmerskool
The problem can be if there is no winner. This is possible because in such a test, AV can miss on average 1 sample per 2000 samples.
The solution would be to decrease the number of samples to 2000 (more or less) and conduct the test more frequently (one test per month).
Agreed on all points. Testing procedure that is. My interest although most if not ALL points are of ACCURATE consideration OF COURSE- But A useful AV TEST TOOL to combine and/or separate samples like this needs some additional fine tuning and MORE OPTIONS.
 
Agreed on all points. Testing procedure that is. My interest although most if not ALL points are of ACCURATE consideration OF COURSE- But A useful AV TEST TOOL to combine and/or separate samples like this needs some additional fine tuning and MORE OPTIONS.
Maybe add a heuristic that calculates the system load in real time and decides when to execute the next sample… monitor for cpu spikes (which suggest the solution could be remediating malware). Do not try to target specific processes, as unsigned tool containing AV process names as strings could be highly suspicious.

When there is a drop of CPU usage, execute next sample.

Also, maybe first attempt to rename all files by appending _renamed or something before the extension. This will trigger the real time scanner for people who “forget” to scan the samples and eliminate majority of them. I’m not aware of any API calls that can instruct Windows to trigger anti-malware scan with third-party AV so this is one feasible way to do it.

Finally, in the options use clever psychology, provide 2 modes: “quick” and “proper”, gently pushing testers to go for the not so quick mode.
 
Last edited:
@Trident Every day, I receive spam, and often it contains phishing URLs. I submitted a phishing URL on August 27, 2025, and Netcraft classified that it found no threat in the URL I submitted. This has happened twice, and today I received an email from Netcraft saying that it had been reanalyzed and classified as malicious. Whenever I find a phishing or malware URL, whether in an email or while browsing the web, I report it to Netcraft, McAfee, Kaspersky, Emsisoft, Bitdefender, Microsoft, and Google Safe Browsing. (y)
1756509107270.png
1756509256250.png

 
@Trident Every day, I receive spam, and often it contains phishing URLs. I submitted a phishing URL on August 27, 2025, and Netcraft classified that it found no threat in the URL I submitted. This has happened twice, and today I received an email from Netcraft saying that it had been reanalyzed and classified as malicious. Whenever I find a phishing or malware URL, whether in an email or while browsing the web, I report it to Netcraft, McAfee, Kaspersky, Emsisoft, Bitdefender, Microsoft, and Google Safe Browsing. (y)
You can try sending to OpenPhish whose feeds are more or less utilised by everyone nowadays, you can report to PhishTank as well. I can create either a small html page with a JS that quickly takes the URL from you and composes an email which you have to send, or a document with macro (one click operation) which plays well with desktop Outlook client.


Sadly Phishtank API doesn’t support uploading a URL, so I can not provide any automation for you… not sure about the rest that you are reporting to. By the time you report to everyone one by one, the URL will be dead 😀
 
You can try sending to OpenPhish whose feeds are more or less utilised by everyone nowadays, you can report to PhishTank as well. I can create either a small html page with a JS that quickly takes the URL from you and composes an email which you have to send, or a document with macro (one click operation) which plays well with desktop Outlook client.


Sadly Phishtank API doesn’t support uploading a URL, so I can not provide any automation for you… not sure about the rest that you are reporting to. By the time you report to everyone one by one, the URL will be dead 😀
Yes, that's right. What's more, phishing is still online, so I also send it to PhishTank. It's even easier with PhishTank because I can forward the email directly to them, which is more practical and faster. (y)