Please enable JavaScript.
Coggle requires JavaScript to display documents.
2018 - Towards Generic Deobfuscation of Windows API Calls (Information…
2018 - Towards Generic Deobfuscation of Windows API Calls
Information
What is
eax register
It is used for I/O port access, arithmetic, interrupt calls
First-order logic
Monitoring API calls is a common way to gain insight on malware
Malware author obfuscate API calls, IAT is populated by pointers to unrelated functions or empty
Windows executable stores address of API functions that it depend on in Import Address Table (IAT)
On Windows API function reside in Dynamic Link Libraries (DLL)
Obfuscated API calls solves API calls in an ad hoc manner separate from the Windows loader
Tackling obfuscation in API calls
Static analysis
Generating script that returns missing API name by reverse engineering of obfuscation scheme
Drawbacks
Time consuming on malware families that deploy complex obfuscation routines
Inflexible, minor changes on obfuscation scheme break deobfuscation script
Dynamic analysis
Logging API calls as malware is executed in controlled environment
Drawbacks
Only explore one execution path per execution
Malware may employs anti analysis technique that may detect the controlled environment and thwart the analysis
Approach
Static analysis approach
Generic deobfuscation of Windows API calls
Predicting API function based on the arguments and the context in which it is called
Symbolic Execution Engine
Supports limited number of x86 instructions that is assume to be crucial for API call recognition
Designed to process functions as opposed to full program
Execute the longest path to prevent path explosion and determine whether call address is an API function or a function jumping to and API function
Removes the argument of the function from the stack
Sets eax register to dummy symbolic value
Vectorization
Hidden Markov Models
Statistical model for sequential data
Data Collection
Steps
Extract functions from an Windows executable using radare2, each function is represented as (virtual address, sequence of instructions)
Build CFG of the function and find the longest execution path (largest number of edges)
Symbolically execute the sequence of instructions of the functions
Argument representation
Collected arguments type
Types
Symbolic expression
Can be any combination of symbolic values and supported operations between them
Symbolic value
Mapped onto set of strings
Mapped to
reg
var
mem
ret
*
Integer
Predefined values
e.g. permission constants, flags, enumeration
Pointers
i.e. addresses in the memmory
Arbitary values
e.g. size of memory, size of buffer to read into
Problem
Mixed dataset
The dataset is of mixed type
Mixed type complicate modelling process
Vocabulary is too large
Three argument typesare mapped into finite sets to allow to model arguments with single categorical distribution
Simulating memory
Key-value storage, key is memory address written to or read from
When the value of an address is unknown, emit a symbolic value
Prediction
Multinomial Logistic Regression
Predictive model, or classifier, is trained to learn a mapping between the feature vectors onto API function names.
Symbolic Execution
Symbolic values
Unknown value represented by symbols
Able to operate on symbolic values
Control flow path
Each control-flow path has first-order logic formula
Describes conditions must be satisfied for the program to take that path
Drawbacks
Path Explosion
Expansive performance on large functions / program
Goals
Problem
Malware author employs API obfuscation techniques for API calls
Focus
32bit Windows executable and DLL
25 most used API functions
Authors
Vadim Kotov
Contribution