Please enable JavaScript.
Coggle requires JavaScript to display documents.
2016 - CodeXt: Automatic Extraction of Obfuscated Attack Code from Memory…
2016 - CodeXt: Automatic Extraction of Obfuscated Attack Code from Memory Dump
Information
CodeXt
Malware code extraction framework built upon selective symbolic execution S2E
Able to extract attack code from the memory dump, accurately pinpoint the exact start and boundaries of the attack code even if it is mingled with random bytes in the memory dump
Able to extract attack code even if it is protected by multiple layers of sophisticated encoders without using any signature or pattern of the decoder
Able to automatically collect relevant intermediate results during multi-layered decoding, revealing obfuscations used at each layer
Able to merge all hidden code fragments into logically related collections
Able to validate the extracted hidden code via symbolic execution to verify that execution of extracted hidden code will lead to any detection conditions reported by the intrusion or malware detection system
Does not rely on any signature or pattern of any particular decoder
Approach
S2E: Selective symbolic execution
Supports in-vivo multi-path analysis and allows us to execute any basic block either concretely with QEMU or symbolically with KLEE
Uses a combination of symbolic and concrete execution during analysis
Symbolic Execution
Pinpoint the exact code start and boundaries by exploring all the legitimate execution start points and paths
Concrete execution
Handle potential dynamic binary transformation and self-modifying code
Assumption
There is some intrusion or malware detection system that can detect the execution of attack code in real-time and it will dump the memory around the instruction where the attack has been detected and other attack context information
Assume the attack context information includes some system call triggered by the attack code and corresponding register values
Dumped memory is large enough to contain all hidden attack code present in the runtime memory when the attack was detected
No infinite loop in the the attack code and our system will terminate after a configurable maximum number of instructions have been executed
Online component
S2E plugins which can monitor, track, and direct the selective symbolic execution of any given byte stream by exploring all execution paths from all offsets
Filters out impossible code snippets
Records those that are feasible and satisfy the attack context information given
Offline component
Further analyzes the online results to derive the hidden code’s start and boundaries
Locating hidden code
Determine the existence of, exact start, and the boundaries of any hidden code from a given memory dump
To leverage the system call information from the IDS, we have developed a S2E plugin to catch all the system calls triggered from within a given memory dump
The hidden code is usually mingled with random data/code
Every offset in the memory dump is treated as a possible logical start, or entry point, of the hidden code
Online kill conditions
To avoid unnecessary symbolic execution
Immediately terminate an offset’s execution
Condition
Any instruction does not align to the system call we know
Invalid memory access such as a segmentation fault
Exception due to an invalid instruction;
Detected system call number or address does not match given context from the IDS
Execution of end of path system calls
Jumps out of bounds of the memory buffer
Record the symbolically executed instructions that end with a system call as a code fragment for each starting offset
Any application level attack code must execute one or more segments of privileged code (i.e., system calls) to cause any real harm
To model code with multiple system calls, we define a code chunk as a sequence of code fragments in a control flow. To extract code with multiple system calls, we merge adjacent code fragments into a code chunk
Handling Self-Modifying Code
Recovering transient code involved in multiple layers of self-modification,
Need to take snapshots for each layer of decoding
Self-modifying code
Executing dynamically generated instruction
Can reliably identified if any instruction consists of bytes written by the code under observation
Achieved by tracking all the memory updates within the memory buffer range at run-time.
We do not want to take a snapshot for each dynamically generated instruction as one layer of decoding normally consists of multiple cor-related instruction blocks
Instead we developed a clustering based approach for obtaining appropriate snapshots of self-modifying code
Maintain a global counter of all the instructions executed, and assign the current global counter to each to be executed instruction as its unique sequence number, which reflects the temporal order of the execution of all instructions
We treat one cluster of writes as one snapshot. We mark those snapshots from which we executed any instructions after the snapshot was created. These marked snapshots correspond to each layer of self-modifying code executed
By stringing the snapshots together, generate a memory map to show the changes over time. Specifically, can see all the values of all memory bytes translated, executed, or written, even if the same memory location has been overwritten multiple times during the execution
What is?
in vivo
Author
Farley R
Wang X
Contribution
Goals
Problem
Automatically recovering malware attack code is critical to improving effective malware analysis, forensics, and reverse engineering
Existing methods involve substantial manual effort