Please enable JavaScript.
Coggle requires JavaScript to display documents.
2010 - Identifying Dormant Functionality in Malware Programs (Information,…
2010 - Identifying Dormant Functionality in Malware Programs
Information
Approach
REANIMATOR
Leverage behavior observed while dynamically executing a specific malware sample to identify similar functionality in other programs
Automatically extract and model the parts of the malware binary that are responsible for this behavior
Leverage these models to check whether similar code is present in other samples to statically identify dormant functionality (functionality that is not observed during dynamic analysis)
Statistical search of code that is not executed during dynamic analysis
Automated model generation
All automated model / signature generation searches for bytes, instructions, subgraph that appears in malware frequently
Functionality-aware model
Model are required to be functionally aware, i.e. equipped with semantic information that indicates a malicious functionalities
e.g., the fact that a malware sends spam, monitors keystrokes, or starts a web server to provide backdoor access to a compromised host
Steps
Generating models for malware behaviors
Dynamic Behavior Identification
Malware binary is executed in dynamic analysis environment
Anubis records invocation of security-relevant system calls and Windows API function
Taint analysis
Used to track data flow dependencies between system and function call arguments
Based on the recording. a set of specification is used to identify different types of phenotypes, i.e. interesting security relevant behavior that a malware exhibit during dynamic analysis
Use rules that describe a malware phenotype in terms of the required system or API calls, their arguments, and the data flows between these arguments
Behavioral specifications for different phenotypes is written manually
Extracting Genotype Models
Filtering
Techniques
Finding exclusive instructions
White-listing
The goal of this filtering step is to identify instructions that are not directly responsible fora malicious behavior
It is likely that a program slice contains code that is not directly related to the malicious behavior that was observed
Slicing
Identify all instructions that contribute to the input parameters of these system calls, as well as instructions that operate on their output parameters
Once this code is located, we can extract its CFG and generate the corresponding fingerprints. These fingerprints then serve asthe genotype model for detecting dormant behaviors in other binaries
Genotype Models
In other words, a genotype model is not the colored CFG itself, but a set of fingerprints that represent it. To search a binary for the presence of a particular genotype, Only the fingerprints are used.
An algorithm generates a subset of all possible k-node subgraphsof G and normalizes them. Each normalized k-nodesubgraph then serves as a succinct fingerprint of the coderegion that is modeled
Given a genotype, modeled as a colored CFG G, the problem of finding this genotype in a malware binary is reduced to finding an isomorphic subgraph of size k that is present both in G and in the binary under analysis
Genotype are considered similiar when their respective CFG share at least one isomorphic subgraph that is sufficiently large
Colored control flow
graph
Nodes of the CFG we use are colored based on the classes of instructions that are present in the corresponding basic blocks, e.g. arithmetic, logic, data transfer
Edge is a possible control flow ( e.g. jump or branch)
Node is basic block
Need to be able to characterize binary code
Since the result of slicing step is neither precise or complete, the result is filtered for parts not related to the behavior and germination step that extends the slice to include parts of the code that is missed by slicing
Starts by identifying all instructions that contribute to the input parameters of the system calls previously discovered using program slicing step
Once genotype is located, a model for it can be builld
Goal is to locate genotype, i.e. part of the binary directly responsible for certain phenotype previously discovered
Germination
A slice might be incomplete. In particular, a slice might fail to include instructions that are part of a behavior,simply because these instructions do not directly operate on tainted data or because they are not part of define-use chains
Consider an instruction as part of the code that implements a behavior when this instruction cannot be executed without executing at least one instruction that is part of the program slice
Finding Dormant Functionalities
Statistically disassemble an unpacked sample and check binaries for dormant functionality using previously created models
When code region is found that matches one of the model, we report this sample contains a dormant functionality that implements behavior associated with matching genotype
Packed / obfuscated code, the system use need to be unpacked
System Overview
Anubis
Sandbox dynamic analysis tool build on top of QEMU
Focus on
Exploits the fact that many malware samples share the same code base, or at least, parts of their code.
Copying and pasting is a common practice in software development
Goals
Problem
In many cases, only a small subset of all possible malicious behaviors is observed within the short time frame that a malware sample is executed
Previous techniques to increase coverage such as multi path or forced execution to increase the coverage of dynamic malware analysis is potentially expensive, as the number of paths that require analysis can grow exponentially.
Various heuristics are used to first select more promising continuations. However,these heuristics rarely achieve full code coverage
Contribution
Introduce a novel technique to automatically identify and model code regions in binaries that are directly responsible for specific runtime behaviors.
Present a system that leverages models to statically check unknown programs for the presence of previously-seen, malicious functionality
Authors
PM Comparetti