Malware Analysis Malware Detection Issues, Challenges, and Future Directions: A Survey

Andy Ful · Sep 3, 2022

This article is a comprehensive state-of-the-art review of malware detection model research.

Malware Detection Issues, Challenges, and Future Directions: A Survey

Abstract:

The evolution of recent malicious software with the rising use of digital services has increased the probability of corrupting data, stealing information, or other cybercrimes by malware attacks. Therefore, malicious software must be detected before it impacts a large number of computers. Recently, many malware detection solutions have been proposed by researchers. However, many challenges limit these solutions to effectively detecting several types of malware, especially zero-day attacks due to obfuscation and evasion techniques, as well as the diversity of malicious behavior caused by the rapid rate of new malware and malware variants being produced every day. Several review papers have explored the issues and challenges of malware detection from various viewpoints. However, there is a lack of a deep review article that associates each analysis and detection approach with the data type. Such an association is imperative for the research community as it helps to determine the suitable mitigation approach. In addition, the current survey articles stopped at a generic detection approach taxonomy. Moreover, some review papers presented the feature extraction methods as static, dynamic, and hybrid based on the utilized analysis approach and neglected the feature representation methods taxonomy, which is considered essential in developing the malware detection model. This survey bridges the gap by providing a comprehensive state-of-the-art review of malware detection model research. This survey introduces a feature representation taxonomy in addition to the deeper taxonomy of malware analysis and detection approaches and links each approach with the most commonly used data types. The feature extraction method is introduced according to the techniques used instead of the analysis approach. The survey ends with a discussion of the challenges and future research directions.

Full article:

https://www.researchgate.net/publication/362958068_Malware_Detection_Issues_Challenges_and_Future_Directions_A_Survey/fulltext/6308d6ebacd814437fdacfbd/Malware-Detection-Issues-Challenges-and-Future-Directions-A-Survey.pdf

struppigel · Sep 4, 2022

The malware analysis and detection taxonomy does not make sense to me. E.g., it is implying that automatic signatures are only created for heuristic signatures. And that behaviour and patterns cannot be heuristic.
The distinction into behaviour/signature/heuristic as separate types also does not make sense. They try to to partially fix the contradiction by adding behaviour based to signature as a secondary type of signature based detection.
The whole graph is messy too, a lot of the final nodes are duplicated. "API calls" is there 6 times.
Why is Op Code noted as a type of behaviour?

Andy Ful · Sep 4, 2022

Your post shows the difference between theory and practice.

struppigel said:
The malware analysis and detection taxonomy does not make sense to me. E.g., it is implying that automatic signatures are only created for heuristic signatures. And that behaviour and patterns cannot be heuristic.

I am not sure if the authors use the term "heuristic" in the same way as you. This term is used by authors only in the context of malware detection made by researchers and it is probably more specific:

"A heuristic-based approach has been used in various research by generating generic rules that investigate the extracted data, which are given through dynamic or static analysis to support the proposed model of detecting malicious intent. The generated rules can be developed automatically using machine learning techniques, the YARA tool, and other tools or manually based on the experience and knowledge of expert analysts."

According to this definition, heuristic-based detection is based on static and dynamic data. So, it contains "heuristic signatures", "heuristic behaviors", and "heuristic patterns".

struppigel said:
The distinction into behaviour/signature/heuristic as separate types also does not make sense.

Yes. But this is a review of many articles that just did it. The researchers usually must simplify the reality to compare the results with others. This is similar to other disciplines. For example, in medicine, a surgeon can see a human body very differently compared to a psychiatrist.
Anyway, the authors agree with you.

struppigel said:
The whole graph is messy too, a lot of the final nodes are duplicated. "API calls" is there 6 times.

It must be so because API calls can be seen independently at different stages (analysis, detection, or feature extraction). For example, another set of API calls can be important when analyzing the unknown malware, compared to API calls used for detecting this malware. The same can be true for signature-based and behavior-based detections. In many cases, the malware can be detected first by behavior (API calls used) when there is no malware signature. Some of these API calls can be also used to create malware signatures.

struppigel said:
Why is Op Code noted as a type of behaviour?

This Opcode is taken from the memory when malware is running.

Andy Ful · Sep 4, 2022

@struppigel,

In my opinion, this article is only a proposition, which is based on many kinds of research. I posted it here, to show the readers how complex is the malware analysis/detection world and how people try to grasp the problem.

AV vendors often use such research to improve their products, especially the research related to AI and Machine Learning. Some methods are used several years after the research, both by good and bad guys.

struppigel · Sep 7, 2022

Andy Ful said:
@struppigel,
AV vendors often use such research to improve their products, especially the research related to AI and Machine Learning.

Not really because most of the research has become so detached from reality, it lacks practicability. Especially for AI research I mostly see things that are not usable, suffer from base rate fallacy when evaluating their results, do not take into account a reasonable false positive rate, have an unacceptable performance and try to be a one-in-all solution which never works.

Most of the time AV vendors invent their own techniques but then sadly have to keep shut about how it works to not benefit the criminals nor the competitors (I am a huge fan of sharing knowledge but it does not work for participants in capitalism).

Andy Ful said:
According to this definition, heuristic-based detection is based on static and dynamic data. So, it contains "heuristic signatures", "heuristic behaviors", and "heuristic patterns".

Your explanation is a great summary of why the distinction into heuristic/behaviour/signature does not work.
It would be the same as dividing people into these categories:

naked people
people over 40
people that are medical doctors

And then you encounter a 45 year old naked medical doctor and suddenly realize, oh wow, I need subcategories for medical doctors, which are naked and over 40
And I need subcategories for the naked people as well, which are medical doctors and over 40.
And I need subcategories for the over 40 year olds, which are medical doctors and naked.

And, oh my, here I have a person with clothing who is only 10 and not a medical doctor, what a surprise. Let us just add this as subcategory to the medical doctors, even though it does not fit entirely, because why not. So here is medical doctors subcategory "not actually a medical doctor, but clothed and below 40 years old"

I totally understand how the authors came to create this graph, since they summarize a lot of papers in one. But it does not make sense nor help anyone to mix up the terms of other papers this way except for showing that this makes stuff very complicated. I bet each of these papers has a different understanding of what the different terms mean, so mixing them does not work. Plus many just parrot the non-sensical stuff you can read everywhere else and for some reason hasn't gotten any sanity check (by which I mean the non-sensical distinction of heuristic/signature/behaviour)

Andy Ful said:
@struppigel,

In my opinion, this article is only a proposition, which is based on many kinds of research. I posted it here, to show the readers how complex is the malware analysis/detection world and how people try to grasp the problem.

Yes, it works well for your intentions and it was not a criticism towards you posting this. On the contrary, I kinda enjoyed reading it.

I still wanted to point out for anyone who is actually trying to understand this, that this needs to be taken with a grain of salt.

Andy Ful said:
opcode is taken from the memory when malware is running.

It is not behaviour then. The data is obtained dynamically and the scan is done after execution, but the signature is still based on a pattern and not related to any behaviour of the sample.

upnorth · Sep 7, 2022

struppigel said:
Most of the time AV vendors invent their own techniques but then sadly have to keep shut about how it works to not benefit the criminals nor the competitors (I am a huge fan of sharing knowledge but it does not work for participants in capitalism).

Small side note on this part because I think it's so perfectly said. We even had some members on this forum become very frustrated when some vendors whitepaper would not describe the technical parts good enough in their personal views, but they also forgot or ignored the legal disclaimers. I would guess the latter. Here's one great example from back in the days when IBM made a huge blunder and gave out too much information and another company snatched it right under their nose:

Before it sued Google for copying from Java, Oracle got rich copying IBM’s SQL

Oracle’s history highlights a possible downside to its stance on API copyrights.

arstechnica.com

Andy Ful · Sep 7, 2022

struppigel said:
Not really because most of the research has become so detached from reality, it lacks practicability. Especially for AI research I mostly see things that are not usable, suffer from base rate fallacy when evaluating their results, do not take into account a reasonable false positive rate, have an unacceptable performance and try to be a one-in-all solution which never works.

That is right, the practice and research often go parallelly. But scientific research does not look as you think. The very important thing is that it uses unified language and formalism so people can compare, repeat, and confirm the results (AV vendors avoid such things). Exploring the wrong paths is as important as the rare success events. The AV vendors hire people who have got experience in Machine Learning, Big Data, AI, etc. Most of them read similar articles to keep track of new research. They are similar to engineers and inventors who are able to adopt in practice the research in physics, chemistry, electronics, etc.
There are many examples related to Machine Learning and AI. The theoretical models adopted by AV vendors were discovered and researched many years ago. Most techniques were used in medicine, speech recognition, computer vision, computer gaming, data mining, etc. People who work for AV vendors often use the known tools used generally in Machine Learning (usually written in Python).

struppigel said:
Most of the time AV vendors invent their own techniques but then sadly have to keep shut about how it works to not benefit the criminals nor the competitors (I am a huge fan of sharing knowledge but it does not work for participants in capitalism).

Yes, keeping shut is for benefit of AV vendors, but not always for the benefit of customers. Anyway, the same is true for scientific research.

struppigel said:
Your explanation is a great summary of why the distinction into heuristic/behaviour/signature does not work.

That was my intention and the intention of the authors of this article. But it would be hard to avoid these terms because most people use these terms (heuristic/behaviour/signature). These terms are very popular among researchers so using something else would make the article even harder to understand. So, the authors show only that heuristic/behaviour/signature, and probably some others should be treated together as a unified approach.

struppigel said:
...
I totally understand how the authors came to create this graph, since they summarize a lot of papers in one. But it does not make sense nor help anyone to mix up the terms of other papers this way except for showing that this makes stuff very complicated.

Let's leave the final decision to the researchers. They like reviews for some reason and most reviews must mix up the terms and look complicated.

struppigel said:
It is not behaviour then. The data is obtained dynamically and the scan is done after execution, but the signature is still based on a pattern and not related to any behaviour of the sample.

This can depend on the definition of signature and behavior. These terms are not precisely defined. Many people think about the signature as something related to static detection. But in many cases, the code of the malware file on disk is different from the final code executed in memory. In such cases, the code in memory can depend on the behavior. The malware can initially load the code to memory and then modify some parts of it when running. The detection can include both behavior and Opcode created by this behavior. Of course, in many cases, these things can be also separated.

struppigel · Sep 7, 2022

Andy Ful said:
That is right, the practice and research often go parallelly. But scientific research does not look as you think. The very important thing is that it uses unified language and formalism so people can compare, repeat, and confirm the results (AV vendors avoid such things). Exploring the wrong paths is as important as the rare success events. The AV vendors hire people who have got experience in Machine Learning, Big Data, AI, etc. Most of them read similar articles to keep track of new research. They are similar to engineers and inventors who are able to adopt in practice the research in physics, chemistry, electronics, etc.
There are many examples related to Machine Learning and AI. The theoretical models adopted by AV vendors were discovered and researched many years ago. Most techniques were used in medicine, speech recognition, computer vision, computer gaming, data mining, etc. People who work for AV vendors often use the known tools used generally in Machine Learning (usually written in Python).

Ah, I was only talking about research tailored towards malware detection because those are the ones that make use of malware detection taxonomies.
I agree with you in that the advances in machine learning in general help us.

Regarding: "it uses unified language and formalism so people can compare, repeat, and confirm the results"
They should do that, but in IT security terminology is a mess and every paper uses other definitions, especially for the terms we are talking about here.

Andy Ful said:
That was my intention and the intention of the authors of this article. But it would be hard to avoid these terms because most people use these terms (heuristic/behaviour/signature). These terms are very popular among researchers so using something else would make the article even harder to understand. So, the authors show only that heuristic/behaviour/signature, and probably some others should be treated together as a unified approach.

I was not suggesting to avoid those terms. They just do not relate to each other in the way this taxonomy wants to make us believe as in they are no types or boxes you can put detection mechanisms into. Instead they are characteristics that detection mechanisms can have.

Andy Ful said:
This can depend on the definition of signature and behavior. These terms are not precisely defined. Many people think about the signature as something related to static detection. But in many cases, the code of the malware file on disk is different from the final code executed in memory. In such cases, the code in memory can depend on the behavior. The malware can initially load the code to memory and then modify some parts of it when running. The detection can include both behavior and Opcode created by this behavior. Of course, in many cases, these things can be also separated.

Didn't you just say there is unified language?

Opcode is not behavior, even if created by behavior. If that was the logic, then all files would be behavior too, because they are created by file creation behavior. Where does it end?

I think we can agree that a behavior-based detection mechanism must be related to behavior in some way. However, that is not the case with opcode in memory because it can just be there without ever being executed. To detect the opcode in memory, it is completely irrelevant what the malware did prior to put it there. In most cases it will be the Windows loader that put it in memory and not the malware.

Dynamic extraction of data does not mean behavior either. The word for that is already there: dynamic.

Andy Ful · Sep 7, 2022

There are some things that can be done better by AI/ML than by human experts. One of the examples is Chess or Go game. But it is probably untrue in the malware detection. The human experts are still needed. Furthermore, the human brain works differently than AI/ML, so the approach taken by ML/AI can look strange to the malware analyst. The chess grandmaster feels the same when reading the articles about AI/ML approach in chess engines.

Andy Ful · Sep 7, 2022

struppigel said:
To detect the opcode in memory, it is completely irrelevant what the malware did prior to put it there. In most cases it will be the Windows loader that put it in memory and not the malware.

In many cases, some important Opcode already loaded into the memory can be modified after the malware execution. So, it is relevant what malware did before the Opcode modification. In such a case it is rather behavior-based. If the AVs could fully emulate the malware execution in all cases (before executing it in the real system), then I would agree with you.

struppigel · Sep 7, 2022

It is not relevant what the malware did before, because just from the presence of a certain opcode we cannot conclude anything about the behavior that happened prior. Anything can put any opcode into memory. E.g. it might be there because there is another security software scanning for said opcode patterns.
I think I need to remove myself from this discussion now. I do not like the feeling that things go in circles, if you know what I mean.

Andy Ful · Sep 7, 2022

struppigel said:
It is not relevant what the malware did before, because just from the presence of a certain opcode we cannot conclude anything about the behavior that happened prior.

Behavior-based data does not have to contain such information. It can be just a collection of suspicious behaviors labeled by time. The possibility of concluding something about the behavior that happened prior (or forward) can be useful for the human analyst but not always necessary when the ML model is used.
Anyway, the discussion is becoming too technical and rather semantic. Your understanding of behavior-based data is slightly different from that used in the article. I am not the right person to settle this problem. Furthermore, it is not really important in this thread. I can accept any definition if any would be commonly accepted.

struppigel · Sep 7, 2022

From what I understand you see the opcode detected in memory as a part of a behavior based signature that includes also API calls etc which might have been recorded whereas I have seen opcode scanning used isolated from other data. So in larger context such a signature is behavior and pattern based at the same time.

Yes, I think we can agree that proper definitions are pretty important prior to using any of these terms in papers or any discussion. Even if they are that widely used.

Search

Malware Analysis Malware Detection Issues, Challenges, and Future Directions: A Survey

Andy Ful

From Hard_Configurator Tools

struppigel

Super Moderator

Andy Ful

From Hard_Configurator Tools

Andy Ful

From Hard_Configurator Tools

struppigel

Super Moderator

upnorth

Level 68

Before it sued Google for copying from Java, Oracle got rich copying IBM’s SQL

Andy Ful

From Hard_Configurator Tools

struppigel

Super Moderator

Andy Ful

From Hard_Configurator Tools

Andy Ful

From Hard_Configurator Tools

struppigel

Super Moderator

Andy Ful

From Hard_Configurator Tools

struppigel

Super Moderator

Similar threads