Malware research is a discipline of information security that aims to provide protection against unwanted and dangerous software. Since the mid-1980s, researchers in this area are leading a technological arms race against creators of malware. Many ideas have been proposed, to varying degrees of effectiveness, from more traditional systems security and program analysis to the use of AI and Machine Learning. Nevertheless, with increased technological complexity and despite more sophisticated defenses, malware’s impact has grown, rather than shrunk. It appears that the defenders are continually reacting to yesterday’s threats, only to be surprised by their today’s minor variations.
This lack of robustness is most apparent in signature matching, where malware is represented by a characteristic substring. The fundamental limitation of this approach is its reliance on falsifiable evidence. Mutating the characteristic substring, i.e., falsifying the evidence, is effective in evading detection, and cheaper than discovering the substring in the first place. Unsurprisingly, the same limitation applies to malware detectors based on machine learning, as long as they rely on falsifiable features for decision-making. Robust malware features are necessary.
Furthermore, robust methods for malware classification and analysis are needed across the board to overcome phenomena including, but not limited to, concept drift (malware evolution), polymorphism, new malware families, new anti-analysis techniques, and adversarial machine learning, while supporting robust explanations. This workshop solicits work that aims to advance robust malware analysis, with the goal of creating long-term solutions to the threats of today’s digital environment. Potential research directions are malware detection, benchmark datasets, environments for malware arms race simulation, and exploring limitations of existing work, among others.
Topics of interest include (but are not limited to):Malware Analysis
We invite the following types of papers:
Submissions must be anonymous (double-blind review), and authors should refer to their previous work in the third-person. Submissions must not substantially overlap with papers that have been published or that are simultaneously submitted to a journal or conference with proceedings.
Papers should be in LaTeX and we recommend using the ACM format. This format is required for the camera-ready version. Please follow the main CCS formatting instructions (except with page limits as described above). In particular, we recommend using the sigconf template, which can be downloaded from https://www.acm.org/publications/proceedings-template.
Accepted papers will be published by the ACM Digital Library and/or ACM Press. One author of each accepted paper is required to attend the workshop and present the paper for it to be included in the proceedings. Committee members are not required to read the appendices, so the paper should be intelligible without them. Submissions must be in English and properly anonymized.
Submissions should be made online at https://worma22.hotcrp.com.
|20:00 (15 min)||Opening speech||Fabio Pierazzi and Nedim Šrndić|
|20:15 (40 min)||Adaptive Malware Control: Decision-Based Attacks in the Problem Space of Dynamic Analysis||Ilias Tsingenopoulos, Ali Mohammad Shafiei, Lieven Desmet, Davy Preuveneers and Wouter Joosen||Adversarial malware have been widely explored, most often on static analysis based detection and feature space manipulations. With the prevalence of encryption, obfuscation, and packing, dynamic behavior is considered much more revealing of a program's nature. At the same time, defining and performing attacks through the feature representation of malware faces several obstacles, especially in dynamic analysis. However, if program behavior is both malleable and indicative of malicious intent, we concern ourselves with the question of how it can be adaptively controlled in order to evade detection.
In this work, we redefine adversarial attacks on malware behavior so that they can be performed directly by the original binary and in that way obviating the need to compute gradients through feature representations. We show that this can occur even in the fully black-box case where only the final, hard-label decision is known. Furthermore, we empirically evaluate our approach by training state-of-the-art sequence models for detecting malware behavior, constructing several malware manipulation environments, and training a host of reinforcement learning (RL) agents on them that learn evasive policies by interaction. Finally, we utilize the adversarial behavior learned by the RL agents to adversarially train the original detection models and we show that while an indispensable approach, the degree of robustness it imparts can be deceptive; especially when we consider adversaries with different action sets.
|20:55 (20 min)||Position Paper: On Advancing Adversarial Malware Generation Using Dynamic Features||Ali Shafiei, Vera Rimmer, Ilias Tsingenopoulos, Lieven Desmet and Wouter Joosen||Along the evolution of malware detection systems, adversaries develop sophisticated evasion techniques that render malicious samples undetectable. Especially for ML-based detection systems, an effective approach is to craft adversarial malware to evade detection. In this position paper, we conduct a critical review of existing adversarial attacks against malware detection, and conclude that current research focuses mainly on evasion techniques against static analysis; generating adversarial Windows samples to evade dynamic analysis remains largely unexplored. In the context of black-box attack scenarios, we investigate an adversary's potential to carry out practical transformations in order to influence behavioral features observed by ML systems and security products. Moreover, we investigate the range of dynamic behavior transformations and identify critical properties and associated challenges that relate to feasibility, automation, technical costs and detection risks. Through this discussion, we propose solutions to important challenges and present promising paths for future research on evasive malware under dynamic analysis.|
|21:15 (40 min)||Transformers for End-to-End InfoSec Tasks: A Feasibility Study||Ethan M. Rudd, Mohammad Saidur Rahman and Philip Tully||Training a machine learning (ML) model from raw information security (InfoSec) data involves utilizing distinct data types and input formats that require unique considerations compared to more conventional applications of ML like natural language processing (NLP) and computer vision (CV). In this paper, we assess the viability of transformer models in end-to-end InfoSec settings, in which no intermediate feature representations or processing steps occur outside the model. We implement transformer models for two distinct InfoSec data formats – specifically URLs and PE files – in a novel end-to-end approach, and explore a variety of architectural designs, training regimes, and experimental settings to determine the ingredients necessary for performant detection models.
We show that in contrast to conventional transformers trained on more standard NLP–related tasks, our URL transformer model requires a different training approach to reach high performance levels. Specifically, we show that 1) pre-training on a massive corpus of unlabeled URL data for an auto-regressive task does not readily transfer to binary classification of malicious or benign URLs, but 2) that using an auxiliary auto-regressive loss improves performance when training from scratch. We introduce a method for mixed objective optimization, which dynamically balances contributions from both loss terms so that neither one of them dominates. We show that this method yields quantitative evaluation metrics comparable to that of several top-performing benchmark classifiers.
Unlike URLs, binary executables contain longer and more distributed sequences of information-rich bytes. To accommodate such lengthy byte sequences, we introduce additional context length into the transformer by providing its self-attention layers with an adaptive span similar to Sukhbaatar et al. We demonstrate that this approach performs comparably to well-established malware detection models on benchmark PE file datasets, but also point out the need for further exploration into model improvements in scalability and compute efficiency.
|21:55 (40 min)||Keynote speech: Open Challenges of Malware Detection under Concept Drift||Gang Wang||The security community today is increasingly using machine learning (ML) for malware detection for its ability to scale to a large number of files and capture patterns that are difficult to describe explicitly. However, deploying an ML-based malware detector is challenging in practice due to concept drift. As the behaviors of malware and goodware constantly evolve, the shift in their data distribution often leads to serious errors in the deployed detectors. In addition, such dynamic evolvement further adds to the pressure of labeling new malware variants for model updating, which is already an expensive process.
In this talk, I will introduce our recent exploration of the challenges introduced by malware concept drift and the potential solutions. I will first discuss the problem of detecting drifting samples to proactively inform ML detectors when not to make decisions. We explore the idea of self-supervision for drift detection and design the corresponding explanation methods to make sense of the detected concept drift. Second, to facilitate malware labeling and model updating, I will share our recent results from combining cheap unsupervised methods with the existing limited/biased labels to generate higher-quality labels. Finally, I will discuss the emerging threat of poisoning and backdoor attacks that exploit the dynamic updating process of malware detectors, and potential directions to robustify this process.
|22:35 (5 min)||Closing speech||Fabio Pierazzi and Nedim Šrndić|