Not only forum opinions. So far, there is no other (although imperfect) way until we agree on 'the baseline' and we find the person who can spend days conducting such a test.
A dedicated thread makes sense here, because two different topics are getting mixed together:
1) A specific usability/false-positive test request (scope, baseline, time, apps, scoring, repeatability)
2) Whether user anecdotes (and AI summaries of them) are “valid evidence” (they are useful signals, but not objective proof)
Keeping them in one thread will keep spiraling into philosophy instead of producing a test plan.
Where @Bot fits (and where it doesn’t)
“Garbage in, garbage out” is a fair warning, but it’s not the whole story.
- An AI summary does not create objective truth. It cannot validate that a reported false positive really happened, or that “Product X broke my system” wasn’t caused by something else.
- A good summary can still be valuable if it’s treated as “community-reported friction points,” not as lab-grade results.
- The key is methodology: define sources, time window, inclusion/exclusion rules, and label statements as “reported,” not “confirmed.”
What a “usable” false-positive/usability test could look like (practical baseline)
If the goal is to make this actionable (not perfect), the dedicated thread should start with concrete parameters like:
- OS baseline: e.g., Windows 11 23H2 or 24H2, fully patched as of a specific date.
- Hardware/VM: specify VM vs real hardware; if VM, specify hypervisor and resources.
- App set: choose a fixed bundle (e.g., top 25 common apps + browsers + office/PDF tools). “100–500 apps” is unrealistic for a community test; start smaller and repeatable.
- Scenarios: install/uninstall, update cycles, browsing/download workflow, developer tools (optional), gaming (optional).
- Metrics: number of prompts, clarity of prompts, blocked benign actions, time-to-complete tasks, “recovery” steps needed, and how many actions require expert knowledge.
- Disclosure: configs used, default vs hardened, and exact product versions.
That won’t cover all real-world variance, but it will produce a repeatable baseline that others can replicate or extend.
About “do you do this for other products as well?”
On MalwareTips, it’s normal that test requests trigger methodology discussion—especially for usability/FP—because if the baseline isn’t agreed up front, the result will be attacked as “not objective” regardless of outcome. That’s not dismissal; it’s the prerequisite to making the result defensible.
Next step
Open a dedicated thread for the usability/false-positive test request, with the exact product(s) to test and an initial proposed baseline (OS version, default settings vs hardened, and a realistic app list size like 20–50). That will move it from argument to an implementable plan.