2015 - Behavioral Analysis of Obfuscated Code
2015 - Behavioral Analysis of Obfuscated Code
Malware would be trigged by querying Windows API functions
Malware always operate system resources after they were triggered like
Identify interesting portions of the traces by finding patterns that depend on the original semantics of the program.
Information is inferable from memory and execution traces that is attributable to the behavior of the program and reveals information on its semantics, regardless of obfuscation
Which techniques are effective in highlighting this information and give useful insights in the business logic of the target program?
There are some characteristics of the execution that are strictly correlated with the underlying logic of the code and are invariant
Introducing different techniques for processing memory and execution traces
Describe behavior by recording in which sequence the instructions are executed and which memory accesses are performed.
These properties can be exploited for the purpose of reverse engineering, exploring side effects of the execution to gather insightful information.
Visualization of memory accesses
Memory trace is displayed in an interactive chart, where the x-axis represents the instruction count while the y-axis the address space
Allows the analyst to visually identify memory segments (data, heap,libraries and stack) and explore the trace for finding interesting patterns or accesses that leak confidential information
Discovery of repeating or distinctive patterns in the
Analysis of statistical properties of the data
The use of statistical properties of the data-flow also to identify parts of the binary that deal with data with distinctive characteristics
Expresses the average amount of information that is contained in a specific data stream
Encrypted or compressed data has very high entropy
Groups the memory accesses in chunks of selectable length
For each chunk the probability distribution of each possible byte value is computed
Entropy level = -(Sum of P(xi) log (xi))
P(xi) = Frequency of each byte in the observed data
Distinguish encrypted from random data
1 more item...
Visualization of entropy and randomness of the data flow can reveal patterns that enable the identification of distinctive operations performed by the code
Sequences of memory accesses are tightly coupled with the semantics of the program
Most obfuscation methods are concerned of concealing the program logic by substituting instructions with equivalent (but more complex) one or altering the control flow
Distinctive patterns in the memory accesses remain unvaried and part of the data that flows to and from the memory is also unchanged
Dynamic Taint Analysis
Some tests the implementation offered by PANDA is requiring too much memory and thus the analysis can be unfeasible
Computation of the difference between memory traces, recorded with different inputs
Alternate approach if DTA by PANDA proves unfeasible
Identifying which parts of the execution depend on our input can be helpful in order to isolate smaller parts of the code that will later be analyzed in detail.
Auto-correlation of memory accesses
It is used to identify repeating patterns in time series of power consumption
Memory trace needs to be transformed in a time series, we consider locations that were accessed over time
Visualizing an execution graph, loops or repetitions of basic blocks and by using graph analysis to counter obfuscations of the control-flow.
After obfuscation transformations, the control flow is often heavily modified in order to make static analysis more difficult
By recording concrete traces of the execution we intrinsically filter out all the dead/junk code and only focus on the parts of the program that were actually executed
Analyzing cryptographic implementations one or very few traces are often enough as the control-ow of a cryptographic function should not depend on the input data
Even though the original control-ow of the program is transformed, there are still some patterns in the execution trace that remain.
Multiple executions of parts of the code
Distinctive sequences of blocks that are run one after each other
Though this comes at the expense of not reaching complete coverage of the possible execution paths
To visualize the collected data for the analyst, a graph is built from the execution trace
A subset of CFG, considering only executed parts of it
Nodes are the basic blocks in the execution trace while edges represent a transition from one basic block to another
Analysis of the execution graph for countering control-flow flattening
Recorded every memory access and every execution of
basic blocks produced by target binary during one concrete execution.
To isolate the target process from others running on the same system we use the ASID (Address-Space Identifier) register, it determines the location of the page table associated to a process and it is univocally corresponding to it.
One for execution traces
One for recording memory traces
Since we are interested in all memory accesses and instructions that are executed by one single process
PANDA records at records at a system level, without having the notion of processes
PANDA only record non deterministic output/input to optimize traces for size
Malware often have anti-debugging techniques that detect if it is running in a sandbox or debugging environment
PANDA offers instrumentation at emulator level able to evade such technique
Dynamic Taint Analysis
Program is formed by a sequence of instructions that are executed by the processor, these instructions operate on the memory
Altering the flow of control within the code
Reordering statements ,methods, loops
Hiding actual control flow behind irrelevant conditional statements
Compiled code follows the principle of spatial locality of logically related basic blocks. Also, blocks that are usually executed near in time are placed adjacent in the code.
Transformations that involve reordering and unconditional branches break these properties.
Opaque predicate is a special conditional expression whose value is known to the obfuscator, but is difficult for an adversary to deduce statically
Used in combination with a conditional jump: the correct branch will lead to semantically relevant code, the other one to junk code
Functions inlining is the process of including a subroutine into the code of its caller
Function outlining means separating a function into smaller independent parts.
Using control flow constructs in an uncommon way is an effective way for making a control graph not very meaningful to an analyst.
Obfuscation of data structures used by the program
Applied at function-level
Function is modifed such that, basically, every branching
construct is replaced with a big switch statement (Different implementations use if-else constructs, calling of sub-functions, etc. but the underlying principle remains unaltered)
Implementation of a custom virtual machine
An ad-hoc instruction set is defined and selected parts of the program are converted to opcodes for the VM. At runtime the newly created bytecode will be interpreted by the virtual machine, achieving a semantically equivalent program.
An adversary needs to first reverse engineer the virtual machine implementation and understand the behavior of each opcode. Only after these operations it will be possible to decompile
the bytecode to actual machine code.