Abstract
This article explores adversarial attacks on neural networks employed in malware detection, with a particular emphasis on white-box attacks. Unlike traditional domains, modifying executable files requires meticulous attention due to the sensitivity of binary values. The proposed method identifies specific bytes within a malicious binary that can be edited, enabling adversarial modifications without compromising the file’s functionality. The study reveals an impressive nearly 99% success rate in causing misclassification of these adversarial examples. While it acknowledges the challenges associated with black-box attacks, the paper discusses possible reasons for the lack of transferability. Additionally, it delves into optimizing the process using explainable machine learning techniques, successfully reducing the number of edited bytes while still achieving a 93% success rate.
Introduction
As widely known in the literature, machine learning algorithms, like deep neural networks, are, in general, susceptible to adversarial examples. It is possible to generate inputs, applying worst-case perturbations to existing ones, such that the crafted input is misclassified with high confidence [1]. An adversarial example, given an input for a neural network (or any other ML model), is a troubled version of the original one, such that it preserves its functionalities and appears similar to the original but is misclassified by the network. The problem arises in a large number of domains where neural networks are applied: for example, adversarial examples can be generated from images or speech, producing perfectly looking inputs undistinguishable from how humans perceive them, but the network will misclassify that. In this paper, I am going to investigate how to produce adversarial examples in the case of Malware Detection. While when dealing with image or speech, the goal is to apply small perturbations to the whole input, not to change how it is perceived, when coping with executable files, careful attention must be taken since just a small modification to the binary values (i.e., changing an offset or an opcode in the executable), leads to complete changes in functionalities. Therefore, classical adversarial examples generation is not suitable for our goal and must be tuned. Our approach consisted of identifying which bytes in the malicious binary could be modified without changing its functionality and how to map back these changes to the original binary to produce a new malicious file, evading detection being classified as benign by the neural network.
Related Work
A. Adversarial Examples
A great part of machine learning models have been shown to be vulnerable to manipulations of their inputs to produce adversarial examples. Adding a carefully chosen manipulation to craft a humanly indistinguishable new input against a target model leads it to misclassify the crafted input consistently. This behavior has been explained to be caused by the inherent linearity of the components of the model, even when the components result in a non-linear model, as in neural network [1]. The goal of the misclassification can be targeted or untargeted, depending on the fact whether the adversary chooses a particular class to produce the input to be misclassified into, or not. In the case of binary classification, both targeted and un-targeted attacks produce the same result. Different algorithms have been proposed to deal with untargeted attacks. One of the first and simplest algorithms proposed is the Fast Gradient Sign Method. Essentially, given an input x, it produces an adversarial example adding noise to the input itself to increase the loss function L with respect to the predicted class y, computing:
with L being the loss function, usually categorical crossentropy. While for targeted attacks, a slight modification of FGSM is provided to decrease the loss for the input with respect to a target class:
More advanced techniques have been proposed, but as I will se later, they won’t suit our problem. The state-of-theart approach has shown how adversarial examples can be transferable between models. It is shown how an adversarial examples, once crafted for a particular model trained on a specific dataset, remain misclassified with high probability when analyzed from both the same network trained on a different dataset and from a different network trained on the same dataset. It has also been shown that it is possible to generate an adversarial example in a black box fashion without the knowledge of the underlying model, developing a synthetic dataset.
B. Malware Detection
The never-ending cat-and-mouse game between malware writers and antivirus vendors has seen an infinite list of techniques from both parties. Any time a malware detection system introduces a new method, a new malware evasion technique comes out from the hood to try to defeat the new process. As machine learning improves, neural networks seem to be the most promising step into precise malware detection, and they are resilient to metamorphic code or obfuscation techniques, producing encouraging results. Malware Detection through a machine learning model can be based either on features extracted from the inspected binary or on the raw bytes themselves. Since malware producers know which features are collected for classification, they can efficiently hide malicious behavior (as they have done for existing anti-virus techniques). The most promising ML technique to analyze malware seems to be the one based on raw bytes.
The Problem
Therefore, I set our goal to produce Adversarial Examples against a neural network detector for Malware classification. This was guided by the hope of raising awareness on how neural networks should not be trusted blindly for security purposes.
A. The target Neural Network
I took MalConv, which seems to be state of the art for Malware Detection, to craft adversarial examples for [3]. The MalConv neural network detects Windows PE32 executable malware without any need for preprocessing or feature selection. Therefore, crafting an adversarial example, in this case, means directly changing bytes in the executable file. The neural network structure is represented in the figure below:
It takes the raw bytes from the executable files and maps each byte to an 8-dimensional floating-point array. This is done to abstract the value of the bytes, which, in the case of an executable file, simply means data or an instruction without having particular intensity or closeness meaning. The input dimension is fixed to 1MB due to resource constraints. Bigger files are cropped, while smaller files are padded to match the dimensions. The convolutional layers are used to tackle local spatial relations, as can be malicious functions or data. The network produces a binary classification value, stating if it believes or not the file analyzed to be malware.
B. Challenges
There exist multiple challenges to crafting an adversarial example for such a problem. First of all, as previously stated, our adversarial input must be a valid PE32 executable, therefore bytes in the original file must be changed paying attention not to destroy malicious functionality. Additionally, the network itself poses challenges to the creation of adversarial examples: the presence of the embedding layer to translate bytes values into 8-dimensional vectors makes all the existing algorithms not applicable as they are. In fact they usually require to compute the gradient of the loss of the output class, with respect to the input. Still, the embedding layer at the beginning of the network makes the whole network undifferentiable. I will tackle the different challenges separately.
C. Windows PE32 Format
It is useful to know in some detail the layout of the Portable Executable file format for the Windows operating system to understand how to deal with the executable files that are given in input. A Windows binary begins with the DOS header. This is placed at the beginning for legacy reasons, and the Windows loader, when executing a program, first checks if the first two bytes of the file match the MS-DOS signature MZ to indicate a valid executable format. It then reads 4 bytes from the offset 0x3c to know the location of the PE header while ignoring the remaining bytes of the DOS header and the DOS stub (they contain a valid MS-DOS program to be executed if no PE header found).
The PE header contains important metadata on which Windows subsystem can run the file and how. Such header includes information as the entry point of the program, along with the sections the program will be composed by, with their respective metadata. Each program will be composed by multiple sections that may contain code, data or resources, each of the section must start at an offset that is aligned to at least 512 bytes in the file, or more, if stated in the header. A padding of zero bytes is present to make the sections align properly. The figure below illustrates the header structure:
Design Choices
A. Choosing bytes to modify
The first design choice that I had to make was how to change the provided malware to make it an adversarial example, preserving its functionalities. An analysis of the PE format provided us with some candidate bytes that need to be changed. As stated in the previous chapter, the DOS header is completely ignored by the Windows loader if it finds two valid bytes in the MS-DOS signature and four valid bytes at the offset 0x3c. Therefore, the whole DOS header, apart from the previously mentioned bytes, can be changed along with the DOS stub without modifying the functionalities of the malware. While I could have been fine-tuning our choices in the header bytes to change, including some bytes in the PE header (like, for example, the compilation timestamp or reserved bytes that are present but not used), I decided to leave them unchanged to avoid to tide us into architecture version specific details. In addition to the DOS header bytes, I included the set of bytes to be mutable as the padding bytes between sections. They are present in the executable image only to make the section starting points aligned to 512 bytes boundaries and are never accessed by the software. Therefore, any change to those bytes will pass unnoticed with respect to binary functionality. I opted out of changing data in the malware, like strings in the executable. This was because, even if changing them wouldn’t have changed the binary functionality, it could have impacted how the malware could have been perceived by the victim (i.e., strings describing instruction to pay a ransom in Ransomware), so leaving them unchanged was a safe default behavior. All the bytes described above gave us less than 1% of the bytes of the executable file to be changed, but that was still enough to mount our attack. There are existing works that face the problem of adding a dump section where to place adversarial noise at the end of the malware [4]. I believe such an approach is not scalable: in fact, to be able to make the network access the added section correctly, you have to make sure that adding the section does not exceed the input size parsable from the network. Complex malware, such as “Wannacry”, exceeds various MBs in size, which makes it impractical to build a network that can analyze such big sequences as a whole. Therefore, the input is usually cropped if it exceeds a threshold size, and this would remove any added adversarial noise. Our approach does not suffer from this particular problem.
B. Facing the embedding layer Problem
While inserting an embedding layer in the network boosted the accuracy of malware detection, this was a challenging wall to break in order to mount the attack. Being nondifferentiable, the embedding layer doesn’t allow the computation of the gradient of the loss with respect to the input. Therefore, most of the existing attacks that are based on gradient descent are not applicable to our case as they are. Following the solution proposed by [4], I choose to compute the loss to apply existing methods with respect to the output of the embedding layer itself to avoid differentiation problems. Therefore, instead of obtaining an adversarial input, the output of our algorithm would be an adversarial embedding. Then, the problem is how to map the produced adversarial embedding to valid bytes to get back a new executable file, since in the produced adversarial embedding, for the modified vectors, there won’t exist any byte that would match the particular vector once passed into the layer. In fact, the embedding layer maps 256 possible bytes to 256 possible 8- dimensional real vectors, and once modified, there won’t be a low chance to match exactly any other existing vector. My approach was to try to map each 8-dimensional vector to its nearest neighbor in the 8-dimensional space and then take the corrective byte back. The process of taking the nearest neighbor approximates the solution found by the algorithm, but this was enough to build the attack.
C. Tuning existing algorithms
All the existing algorithms for constructing adversarial examples are focused on making imperceptible changes to the input to make it misclassified by the network. This produces perfectly looking inputs undistinguishable by human perceptions. What differs from our goal is that I am not interested in making changes invisible to humans; I only need to preserve malware functionality, avoiding changing meaningful bytes. While it doesn’t seem a problem, complications indeed arise when applying these existing algorithms to the embedding layer as I designed. In fact, since I have to map back the adversarial embedding layer to actual bytes using nearest neighbors, when applying such imperceptible changes to the embedding vectors, what I obtain is that the closest neighbor of the adversarial embedding vectors is with a high probability the original embedding vectors themself, resulting in less or no change at all in the input file. Given that the perturbations that I applied were far bigger than the ones recommended in state-of-the-art techniques, I wanted to be sure that, when computing the input with respect to the embedding, I would obtain an adversarial example with high probability. Therefore I had to discard existing algorithms that rely on minimisation problems to get a small perturbation [5], and I developed a slight modification of the targeted Fast Sign Gradient Method, by empirically tuning the steps to obtain adversarial example with highest probability possible. I am not stating that the modification to the FSGM could generalize to other problems, but this was the algorithm that worked in practice, producing adversarial examples reliably. It is reported in Algorithm 1.
Implementation and evaluation
I decided to implement the whole process using the Keras deep learning library. I collected 302 malware from the hood [7] to be classified, on top of which to construct adversarial examples. I used a pre-trained version of the MalConv network from [6]. Despite the fact that both the original paper [3] and the paper from which I took the pre-trained model reported over 90% detection rate for the model, the MalConv network, on our specific dataset, performed poorly: between the 302 collected malware, only 118 were identified as malign, resulting in a detection rate of only around 40%. Therefore, to demonstrate the effectiveness of our technique, I chose to apply it inverting any class the model was finding, not restricting ourselves to making malware appear benign. I also constructed an adversarial example for any malware that was yet classified as benign to correctly classify it as malign. Out of the 302 malware samples, I was able to invert the classification for 298 malware, achieving a good 98,7% accuracy. What was interesting was that I achieved 100% accuracy in the embedding domain, but the approximation induced by the nearest neighbour algorithm sometimes ruined our result. An interesting implementation issue was that some malware examples had a null gradient for the loss with respect to the input, which would have prevented our method from working. Introducing random mutation in the editable gradient bytes always empirically produced a non-null gradient, which allowed us to use our method.
Black Box attacks
The attack I just implemented is the worst-case scenario. It involves an attacker being able to access any part of the neural network, including its weights. Therefore, it is called a White Box attack. Existing works inspect the scenario, dropping the assumptions to have complete access to the network, which are called BlackBox attacks. They involve not having access to the network topology, having only a self-built approximation, while having or not the original train set of the network. It’s worth reporting how I failed to replicate any of the black box attacks presented in the literature for classic adversarial examples. It seems existing approaches are not suitable for the problem, and I found no paper describing or proposing new approaches to bypass networks, including embedding layers. I took two slightly different implementations of the MalConv network to test the transferability of adversarial examples in the binary domain. While being able to generate adversarial examples for both networks, unfortunately, no adversarial example was able to generalize on both. I believe further research should be focused on the problem of generating adversarial examples in the binary domain. I suggest current techniques are not suitable for the problem, not taking into account the limited editing power I have for the bytes to be changed or the approximation introduced into mapping the embedding domain into the input one when crafting the adversarial example. I report for completeness the structure of both networks in Figure 1 and Figure 2.
Towards a better approach
I started investigating how to optimize the number of bytes our algorithm would edit: I relied on explainable machine learning to understand how an input byte concurred with the final decision for the prediction. Among the various techniques, I choose to use saliency visualization [8]. It is normally applied to understand which pixel of an image concurred to make the network decide for a particular outcome. I used it to map any pixel position to the embedding layer offset and back to the original byte so that I could produce a saliency heatmap for the files correctly classified as malware. The figure below reports an example for the “Wannacry” malware:
Despite the fact that the figure is hard to understand and is too dense, it is clear how the highest saliency is localized in particular areas of the binary. While most of the high saliency zones belong to noneditable parts of the binaries, such as the beginning of text, data, or resource sections, a highly salient zone belongs to the header. This was a shared feature among the vast majority of saliency maps that I examined. In the figure below, I report a zoom for the header of “Wannacry”.
As I can notice, the highly salient zone includes the PE header (bytes 0xF8 up to 0xFFF), which makes sense, being an important part of any executable file. But I can also notice how the DOS header (bytes 0x0, 0xF7) is included in the highly salient zone. Being always part of the editable bytes, this lets us modify a highly salient area without any constraint, which gives us a strong primitive to localize our modifications. Therefore, I repeated our experiments, limiting our modification only to the bytes of the DOS header. Ignoring the edits on padding bytes makes our algorithm work. Also, in the worst-case scenario, no padding bytes are present to be edited. This resulted in an average 0.01% of the bytes of the executable file to be changed. Without any further modification to our method, I was able to invert the classification of the MalConv prediction in 280 out of 302 malware, which still had a 93% success rate, even with two orders of magnitude less number of editable bytes. Still, I noticed that I had a 100% success rate in the embedding domain, but the approximation of the nearest neighbors made the results worse.
Conclusion
In this paper, I presented an approach to mount White Box adversarial examples of attacks against neural networks in the malware domain. I faced multiple challenges, including the restrictions that the domain introduces into the bytes that can be edited or not, and the difficulties existing implementation introduces as well, relying on embedding layers to analyze the malware input correctly. I showed how tuning and combining existing techniques let us build an approximation of adversarial examples that are misclassified in almost 99% of the cases. I also assessed how neither existing techniques nor the modifications that I proposed were suitable to mount Black Box attacks in the malware domain. I debate that the embedding layer may be the cause for the nontransferability of the adversarial examples generated since slight differences in the training set may produce completely different embedding domains that, in their turn, affect successive layers. So further study could be interesting in that direction. I then exploited explainable machine learning techniques to further optimize our approach, reducing the number of bytes to be changed by our algorithm. Restricting the noise to DOS header bytes, which have a high saliency, still produced a 93% success rate in crafting adversarial examples while reducing the bytes edited by two orders of magnitude. This makes our algorithm work. Also, in the worst-case scenario,. No padding bytes are present to be edited. While believing that further research must be done in favor of Black Box attacks, I state that our approach has a success rate when crafting approximation of adversarial examples in the malware domain.
REFERENCES
[1] Explaining and Harnessing Adversarial Examples, Ian J. Goodfellow, Jonathon Shlens, Christian Szegedy, arXiv:1412.6572
[2] Practical Black-Box Attacks against Deep Learning Systems using Adversarial Examples, Nicolas Papernot, Patrick McDaniel, Ian Goodfellow, 2016
[3] Malware Detection by Eating a Whole EXE, Edward Raff, Jon Barker, Jared Sylvester, arXiv:1710.09435
[4] Adversarial Examples on Discrete Sequences for Beating WholeBinary Malware Detection, Shir Aviv Reuven, Joseph Keshet, arXiv:1802.14528v1
[5] Towards Evaluating the Robustness of Neural Networks, Nicholas Carlini, David Wagner, arXiv:1608.04644v2
[6] EMBER: An Open Dataset for Training Static PE Malware Machine Learning Models, Hyrum S. Anderson, Phil Roth arXiv:1804.04637v2
[7] theZoo: A repository of LIVE malwares for your own joy and pleasure, https://github.com/ytisf/theZoo, arXiv:1804.04637v2
[8] Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps, Karen Simonyan, Andrea Vedaldi, Andrew Zisserman arXiv:1312.6034v2