Coding Styles Survive Binary Compilation, Lead Investigators Back to Programmers

Exterminator

Community Manager
Thread author
Verified
Staff Member
Well-known
Oct 23, 2012
12,527
Algorithm may be used for malware code attribution
Researchers from three universities and the US Army Research Laboratory have created a machine learning algorithm that can accurately detect code written by different programmers, even if the code has been compiled into an executable binary.

Previously, the same researchers managed to put together a similar algorithm that would identify different programmers based on their coding style (code stylometry).

This research continues their previous work and expands the algorithm to support cases where the source code isn't accessible, and has been compiled into an executable binary.

De-anonymizing programmers may halt the creation of controversial software.
By providing a proof-of-concept in their paper, the researchers are sounding the alarm on situations where programmers may not want to associate their name with controversial software.

The algorithm developed by the researchers is using as training data source code samples (compiled into binaries) from 600 programmers that participated in the Google Code Jam competition.

Because all programmers had to implement the same functionality, but each did it in his own way and using a coding style unique only to him, in the end, the algorithm learned to distinguish different coding styles after decompiling executable binaries (which does not produce 100% clear source code views as many think).

The algorithm has a high de-anonymization accuracy
According to the researchers, the algorithm managed to de-anonymize executable binaries written by 20 programmers with an accuracy of 96%, after the machine learning classifier trained only on 8 executable binaries for each programmer.

After analyzing binaries from all 600 programmers, researchers reported a 52% accuracy, which is more than acceptable for an algorithm that was only recently created, and hasn't seen years of development.

"Stripping and removing symbol information from the executable binaries reduces the accuracy to 66%, which is a surprisingly small drop," says Mrs. Caliskan-Islam, one of the researchers. "This suggests that coding style survives complicated transformations."

Algorithm structure overview

The researchers also concluded that the de-anonymization accuracy goes up if the programmer is more skilled, since advanced programmers often create their own style of coding, very distinct from scholastic, standard approaches.

Authors of controversial software may want to stay away from GitHub
Because of open source coding repositories like GitHub, state agencies can build a database of all developers and their coding styles, and then easily compare the coding style used in "anti-establishment" software to detect the culprit.

Researchers said that when the algorithm was tested on GitHub repositories it managed to achieve a 62% de-anonymization accuracy. They did say that the algorithm is quite useless in collaborative projects where multiple programmers contribute to the same source code.

Despite all the privacy implications this research may have, the algorithm can also be used by security researchers to track down malware authors. Currently, researchers say that the algorithm is not yet ready to take on malware code, which is often very well obfuscated.

"Our results so far suggest that while stylistic analysis is unlikely to provide a 'smoking gun' in the malware case, it may contribute significantly to attribution efforts," Mrs. Caliskan-Islam also noted.

Below is Aylin Caliskan-Islam presenting the research paper at the 32nd Chaos Communication Congress (32C3). The video can also be downloaded from the Conference's website.

 
L

LabZero

Good news in malware battle!

Before it was almost impossible to trace, for example, from a program written in C compiled at its source. Once transformed into executable istructions it was limited possible to understand its operation watching the disassembled in Assembly.
 

Rishi

Level 19
Verified
Honorary Member
Top Poster
Well-known
Dec 3, 2015
938
This is still dependant upon a previous sample available from a known coder, the unknown self-styled ones in the wild will be hard to catch, but it can still give leads on a malicious coder in the future, once profiling is done.
 
Last edited:

jamescv7

Level 85
Verified
Honorary Member
Mar 15, 2011
13,070
This concept must be done way back before cause the cases of controversial software are occur many times and as an argument of developers where its coincidence but rather use for reference where actually it leads to nowhere unique result.

Good news to obtain better analysis and sometimes complexity of codes are hard since you need to analyze manually through basics until complicated ones.
 
  • Like
Reactions: DardiM and LabZero

About us

  • MalwareTips is a community-driven platform providing the latest information and resources on malware and cyber threats. Our team of experienced professionals and passionate volunteers work to keep the internet safe and secure. We provide accurate, up-to-date information and strive to build a strong and supportive community dedicated to cybersecurity.

User Menu

Follow us

Follow us on Facebook or Twitter to know first about the latest cybersecurity incidents and malware threats.

Top