Detecting Malicious Behaviors of Software through Analysis of API Sequence k-grams i

Nowadays, software is widely applied to increase accuracy, efficiency, and convenience in various areas in our life. So, it is essential to use software in our recent computing environments. Despite of the valuable applications of software, malicious behaviors caused by vulnerability of software threaten our secure computing environments. So, it is important to identify and detect malicious behaviors of software for maintaining computing environments. In this paper, we propose an approach to detecting malicious behaviors of software by analyzing information of API function calls. API function calls are essentially used to make use of various services provided by operating systems or devices in developing software. In addition, API functions can describe the behaviors of software because they perform predefined specific operations during program execution. In this paper, we classify API functions in Microsoft Windows operating systems, and propose an approach to representing malicious behaviors of software with API functions. We propose an approach to detecting malicious behaviors of software by analyzing dynamic API function calls. To increase the efficiency and the tolerance of the analysis, malicious behaviors are abstracted as sets of k-grams, and they can be identified by calculating similarity between the sets of k-grams and a sequence of API function calls.


Introduction
Malware is an abbreviation of malicious software, which means software containing malicious behaviors in its execution. Malware code can be added or modified in a software system in order to have a bad influence on the system by intended malicious functions [4,5]. So, malware is harmful in normal execution of software system. In recent computing environments, software is widely used in various areas to increase productivity, efficiency, and convenience in our real life. Despite of the widespread useful applications of software, it is major concerns to solve the problems of malicious behaviors that abuses vulnerability of software in maintaining safe computing environments. So, detection of such malicious behaviors of software is helpful to improve the safety of our computing environments.
The detection methods of malicious behaviors are classified into two ways according to approaches to figure out the malicious codes contained in programs. The first way is signature-based methods, which can detect the existence of malicious codes in programs by finding the previously known signatures of malware. The signature of malware means specific patterns of code or data that are used to perform malicious operations in programs without users' permission. In applying this approach, the signature set of malicious behavior should be previously analyzed and defined before detecting malicious behaviors. This method is very effective in detecting previously known malware that is already analyzed and defined because it can be easily detected by confirming the existence of the signatures in programs. However, in the case of newly appearing malware that has no signatures yet, the method cannot be applied to detect such malware. On the other hand, the second way is behavior-based method that can detect the existence of malicious behavior in programs by analyzing static or dynamic behaviors in programs. The behaviors of programs can be described various ways to represent the essential goal and operations identified by static analysis or dynamic analysis during program execution. So if malicious behaviors of software can be described and identified through behavior analysis, malicious behaviors can be detected by confirming the existence of the behaviors. In this paper, we describe an approach to identifying malicious behaviors of software through dynamic analysis of API function calls. This paper is organized as follows. Section 2 summarizes related work on detecting malicious behaviors of software. Section 3 describes the proposed algorithm for tracing API function calls in Microsoft Windows operating system. Section 4 describes the method for analyzing API function calls extracted during program execution. Section 5 presents 86 Detecting Malicious Behaviors of Software through Analysis of API Sequence k-grams approaches to identifying malicious behavior of software through analysis of API functions calls and k-gram abstraction. Finally, section 6 concludes this paper.

Related Work
There have been several researches on detecting malicious behaviors of software. Wang et al. [1] presented an approach of virus detection based on analysis of the suspicious behaviors indicated by the sequences of APIs in Windows environments. This approach applied Bayes algorithm to identify suspicious behaviors and detect viruses. Beaucamps et al [2] presented an approach for malware detection by abstraction of program behaviors. The technique abstracts program traces by rewriting original information into abstract symbols representing their functionality. Then, suspicious behaviors are detected by comparing trace abstractions to identify malicious behaviors. Babic et al [3] proposed an approach to learning and generalizing from observed malware behaviors based on tree automata generated from system call dataflow dependency graphs. Alazab et al [5] presented an approach to analyzing and classifying the behavior of API function calls based on the malicious intent hidden within any packed program. The method classified six main categories of suspicious behavior of API call features and analyzed API call distribution of malware samples. Veeramani et al [8] proposed the malware detection method based on extracting relevant API calls from sub categories of malware. Malware is categorized based on their infection mechanism and actions performed according to the type of API calls in malware categories. Wang et al [9] presented an approach of malware detection that is based on analysis of representative characteristic and systemic description of the suspicious behaviors indicated by the sequences of APIs called under Windows. Faruki et al [10] proposed a behavior model that represents abstraction of a binary by analyzing the API strings made by Windows Portable Executable (PE) files. The method extracted temporal snapshots of malware and benign executables known as API Call-grams. Alazab et al [12] proposed an approach to detecting obfuscated malware by investigating the structural and behavioral features of API calls. This method identified malicious behavior by using n-gram statistical analysis of binary content. Elhadi et al [13] presented an approach to detecting malware using API call graph. In this approach, malware sample was represented as data dependent API call graph. Then, a graph matching algorithm based on longest common subsequence algorithm was used to calculate similarity between the input sample and malware API call graph. Ravi et al [14] proposed an approach to detecting malware by using the Windows API call sequence. This method used k-grams to model the API calls and applied iterative machine learning process combined with the run-time monitoring of program execution behavior.

Monitoring API Calls
In developing software, functions are used for operations to perform specific tasks during program execution. Design and use of functions increase reusability and productivity by modularizing software in software development process. In the viewpoint of software behavior, the calling information of functions in software can be used to summarize the specification of software behavior. So, the calling information of software is helpful to understand how the software behaves and interacts with each other. By monitoring API function calls, the information can be collected and analyzed to discover the behavior of software. To monitor API calls in software, hooking method can be used to intercept function calls and record analyzed information of the calls [1]. API functions and system calls are related with services provided by operating systems. API functions and system calls support various key operations provided by operating systems, such as networks, security, system services, file managements, and so on. In addition, they include various functions for utilizing system resources, such as memory, file system, network, or graphics. Because there is no other way for software to access system resources that are managed by operating systems without using API functions or system calls, such API functions or system calls are essentially used to access system resources during program execution. From the viewpoint of this characteristic of software, it is essential to analyze the patterns of API function calls that represent the interactions between software and operating systems. In addition, the patterns of API function calls can provide key information that can be used to detect the movement of software and to represent behaviors of the software, because they give meaningful information on the behaviors of software. So, analysis on API functions and system calls plays an import role in behavior analysis of malware.
In detection of malicious behaviors of software, API calling sequence can also provide important information, because API function calls may represent behaviors of software performed to use system resources in operating systems, such as file management, network access, system service access, and memory access.  In order to trace API function calls in Microsoft Windows runtime environments, we first should analyze the static characteristics of software through Windows PE file analysis. Then, we attach the runtime analyzer to the software to perform dynamic analysis. Figure 2 shows the main algorithm for tracing API functions called during program execution in Microsoft Windows operating system. This analyzer consists of three phases, namely main_analyzer, API_manager, and API_monitor. At first, the main_analyzer initializes the overall analysis procedures before executing a program. It prepares for analysis procedures by attaching the API_manager analyzer to the program before executing the program. After calculating the OEP of the software, it sets a breakpoint to the location to initiate the API analysis procedure. The initialization procedure sets up analysis environments, and it begins execution of the program and analysis.
API_manager is a procedure for handling events that occur at the breakpoint set on the OEP of the program at the step (2). This procedure collects the information used in tracing API function calls during program execution, and controls to continue the analysis and execution of software. It obtains the information of dynamic link library (DLL) loaded during program execution and the list of API functions from the import address table through analysis of PE files of the program. After calculating the memory load address of each API function, it saves the address and function name information to monitor the API function calls. Then, the procedure sets breakpoints on every address of API functions and attaches API_monitor procedure to analyze API traces when API functions are called in execution. After setting breakpoints and attaching the monitor to the program, it continues execution to monitor and analyze the API function calls.
API_monitor handles the events that occur at the breakpoints set on the addresses of API functions. Whenever API functions are called, API_monitor is called to monitor API function calls and analyze the behavior of software. API_monitor can calculate the start address of current execution program from the EIP, and it can find the name of API functions loaded at the current address through the information saved by API_manager at the step (7). After processing the information of the API function call, it continues the execution to perform next operations. API_monitor (10) When the execution program reaches the breakpoint set on the memory load addresses of API functions, the analyzer stops the execution. (11) The analyzer finds the current EIP (instruction pointer) and get the API function names from the information saved at the step (7).

Analyzing API Calls
In this section, we describe an approach to analyzing the trace of API function calls to represent malicious behaviors of software. To abstract and summarize the behaviors of software, API functions are classified into several groups according to their functions. Then, malicious behaviors are represented as behavior automata to describe the specific behaviors of malicious software. In general, malware tries to hide the presence of itself. However, it should contain suspicious behaviors that are distinctive from the behaviors of the other benign programs while performing intrinsic operations of the malicious behavior. So, we can detect the possibility of maliciousness of software if the software contains such suspicious behavior during program execution. Because most of these kinds of suspicious behaviors are performed in kernel mode or as services provided by operating systems, API functions and behaviors can be classified into several groups according to the characteristics and the specifications of API functions. For example, Table 1 shows the group of API functions to analyze behaviors for exploiting file operations in malicious behaviors. Careful monitoring of classified API function calls is helpful to identify the behaviors of software and detect malicious behaviors.
Because the length and the number of API function calls continuously increase as the program continues the execution, an effective representation and identification method is required. The method should have a form that can efficiently express the behaviors of software during program execution. So, monitoring and analyzing all kinds of API function calls are not appropriate for representing and detecting malicious behavior, because there will be too many kinds of API functions that are unrelated to malicious behaviors. The group of API functions related to malicious behaviors can play an important role to express behavior of software efficiently. The first step is to refine the set of API functions to be analyzed according to individual malicious behavior to avoid the side effect that may be caused by unrelated API function calls. By refining the API function set of behaviors, we can decrease the length and the number of API function call sequence efficiently. Table 1 shows an example for refining the set of API functions to analyze the behaviors that are related to file operations.
Each API function call performs a basic operation in the course of performing complex tasks, which may be either malicious or good behaviors. So, the patterns and sequence of API function calls can be classified into several groups according to the tasks of software. Then, malicious behaviors are extracted from the patterns and sequence of API function calls. The malicious behaviors are represented as forms of behavior automata to effectively describe and abstract the maliciousness.  Figure 2 shows a behavior automaton for the Allaple.A excerpt which are pinging remote hosts and scanning local drives [2]. The behavior automaton represents the distinguishing malicious behavior of the software. It shows operations of the the Allaple.A worm which try to ping remote host to find and infect other system and scan local drives to infect and propagate to other system without owner's permission.
The automaton for malicious behavior is constructed from the control flow graph of a program. At first, the control flow graph of a program is abstracted by extracting API calls in the program. From the abstract control flow graph of API functions, its malicious behavior is analyzed by tracing API function calls of the behavior. Finally, the API flow graph of malicious behavior is represented as automaton for the malicious behavior. The behavior automaton of malicious software is represented, and it can be used to detect malicious behaviors in analyzing software during execution.

Identifying Malicious Behaviors
Malicious behaviors of software can be represented as forms of automata, and the behaviors are recognized by confirming that one of the behavior automata is matched with the sequence of API function calls during program execution [3]. The recognition is accomplished by tracing and matching malicious behavior automaton with the API calling sequence extracted from a program execution. So, after extracting a sequence of API function calls during program execution, the sequence is analyzed to decide whether it is matched with a malicious behavior or not.
In tracing malicious behavior automaton of a program, the first transition starts from the start state of the behavior automaton. The current state of the behavior automaton transits to the next state by following each API function call. In the process of tracing the malicious behavior automaton with a sequence of API function calls, if the next state reaches one of the final states of the malicious behavior automaton, the sequence is matched with the behavior automaton. Therefore, the malicious behavior is identified by confirming that the sequence is satisfied with the behavior automaton. Figure 5 shows the analysis procedure for identifying malicious behaviors with a trace of API function calls.
As the execution time goes by, the number and the length of API function calls rapidly increase. So, it is preferable to accomplish the detection process at the early stage of the analysis. On the other hand, if detection of malicious behaviors does not succeed at the early time of the analysis, the complexities of the analysis will rapidly increases as time goes by. To make matters worse, if there is minor mutation in the behavior of malware, the analysis for identifying the malicious behavior will fail due to the minor differences of API function calls caused by the mutation. To mitigate this problem, it is helpful to abstract the behavior automata of malicious behaviors and to identify the behaviors by calculating the similarity between the abstract behaviors and API function calls.
A k-gram denotes a contiguous substring of length k that can be comprised of letters, words, or opcodes in a binary program [6]. In this analysis, the set of k-grams generated from the behavior automaton is used to abstract a malicious behavior. Then, the set of k-grams is compared with a sequence of API function calls to calculate the similarity between the behaviors and API function calls. The k-gram method only considers the limited k consecutive subsequence of a full sequence of API calls in comparison. Although there is minor mutation in the behavior of malware, such as insertion or omission of a few calls of API functions, the mutation can affect only a few number k-grams adjacent to the change. So, the remaining k-grams of the API functions, which are not affected by the mutation, still can be matched with each other. If the similarity between the set of k-grams and the sequence of API function calls is considerably high, it can be a good evidence of partial inclusion of malicious behaviors in software. Detecting Malicious Behaviors of Software through Analysis of API Sequence k-grams Figure 6. The analysis procedure for identifying a behavior automaton through k-gram abstraction Figure 6 shows the analysis procedure for identifying malicious behaviors by comparing a trace of API function calls and abstract behaviors represented as a k-gram set with the length k=3. In the procedure, the original behavior automaton is abstracted into a set of k-grams, which are substrings of API functions with k = 3 from the behavior automaton. In analyzing the abstract behavior of k-grams and API function calls, the behavior k-grams are compared with a trace of API function calls.
Let S be a set of k-grams of malicious behavior and A be a set of k-grams generated from a trace of API function calls during programs execution. The similarity between the behavior and the API function calls is calculated as follows: where |S| denotes the number of elements in the set S. So, the similarity is decided by the number of identical k-grams between the set of the behavior and API function calls. Because the original behavior automaton is abstracted into a form of k-grams, the analysis can tolerate minor differences in API function call sequences to a certain degree, which can be defined by a difference ratio between behavior k-grams and API function calls. From the calculation, the similarity is decided to be 80%. If the similarity value is higher than predefined tolerance value, then we can identify the presence of the malicious behavior from the sequence of API function calls during program execution.

Conclusions
Nowadays, software is practically applied in various areas to increase accuracy, efficiency, and convenience in our life. Software can also provide valuable operations in our recent computing environments. In some cases, however, malicious behaviors caused by vulnerability of software have bad effects in the environments. So, it is one of major concerns to secure safe computing environments by detecting malicious behaviors of software. In this paper, we propose an approach to detecting malicious behaviors of software by analyzing information of API function calls. Malicious behavior of software is represented as a form of behavior automata generated from API flow graph of a program. To recognize the malicious behavior of software, its behavior automaton is traced with a sequence of API function calls extracted during program execution. To increase the efficiency and the tolerance of the analysis, a behavior automaton is abstracted as a set of k-grams. Malicious behavior can be identified by calculating similarity between the set of k-grams and a sequence of API function calls.