Reply to thread

Message: <blockquote data-quote="Andy Ful" data-source="post: 1123461" data-attributes="member: 32260">The OP has been updated.The probability of finding x=0, 1, 2, 3, ... undetected malware was calculated in the OP:p( x ) = B( m - k , n - x ) * B( k , x ) / B( m , n )where B( p , q ) is binomial coefficient.<img src="https://malwaretips.com/attachments/1744978145533-png.288143/" alt="1744978145533.png" class="fr-fic fr-dii fr-draggable " data-size="" style="width: 264px" />I noticed (by numerical experiments) that for sufficiently large numbers of samples in the wild ( m >> k , n ) and a small number of missed samples ( x << n ), the function p(x) depends on the infection rate ( r = k/m ) and the number of tested samples ( n ). Now we can use the probabilistic approximate formula:p( x ) ~ B( n , x ) * r ^ x * (1 - r ) ^ ( n - x )[ATTACH=full]288198[/ATTACH]So, increasing the number of in-the-wild samples does not change significantly the probabilities if the infection rate k/m does not change and m is big enough.We do not know how exactly the AV Labs choose the malware samples. But most probably, they choose the test samples from large feeds (over 300,000 suspicious and malicious threats per day) and eventually remove some morphed samples of the same malware. If so, the approximate formula for p(x) is very accurate.The example of the malware feed:[URL unfurl="true"]https://www.mrg-effitas.com/services/threat-feeds-malware/[/URL]If we know the average infection rate of top AVs, the formula for p(x) can be used to determine if a particular AV can be awarded in a test (as a top AV) or not. For example, the missing samples threshold can be calculated as:p(t) < 0.05It means that missing t samples disqualifies the particular AV result from the top award, because such a result can happen for a top AV due to pure accident with chances less than 5%.Edit.Corrected a typo error in the formula for p(x).</blockquote>

Verification

Search

Search

Reply to thread