Learning Program Semantics for Vulnerability Detection via Vulnerability-Specific Inter-procedural Slicing
Update: March 29, 2025
Overview
SnapVuln根据漏洞种类进行过程间切片,针对每种漏洞训练子模型,进行漏洞检测及漏洞种类预测。
Challenges
- single-function detection: Devign, REVEAL (TSE 2021), Draper (ICMLA 2018)
- 漏洞可能跨越多个函数(callee functions)
- 从vulnerable program points开始进行切片 (e.g., dangerous API calls / variables)
- noise:无法区分sink / source,并且包含所有与vulnerable points相关的statement。
- 由于错误执行顺序导致的漏洞语义:使用DDG / PDG切片,没有考虑CFG(程序执行顺序)。e.g., UAF。
Motivation
- 应该引入inter-procedural analysis用于获取更完整的漏洞语义
- 应该设计slicing algorithms用于获取精确的漏洞语义
- 没有考虑CFG,无法反映错误执行顺序导致的漏洞语义(e.g., UAF)。
- 包含与漏洞无关的statements:部分与漏洞无关的语句也依赖于vulnerability-related statements。
Approach - SnapVuln
- 从source code提取由PDG, CFG, CG (call graph) 结合的inter-procedural graph (IG)。
- 根据漏洞种类设计slicing algorithms用于获取漏洞语义,识别source & sink并进一步对IG进行操作,获得相应漏洞种类的IG subgraph。
- 本文针对6种常见漏洞类型设计了用于C/C++的切片算法
- 对每种漏洞种类使用一个submodel学习相应subgraph的代码表示信息
- 将所有submodel对各类型漏洞的预测结果集成,得到综合预测结果。
Evaluation
- RQ2: The Completeness of Vulnerability Semantics
- 不使用切片时,比较在single-function (intra)与muti-function (inter)上,SnapVuln的检测效果。
- RQ3: The Precision of Vulnerability Semantics
- 比较使用不同的切片策略(vuldeepecker-DDG, sysever-PDG, deepwukong-PDG with API / arithmetic operation as criterion),SnapVuln的检测效果。
Limitation
- Vulnerability types. “Vulnerabilities in real-world codebases are diverse, and the slicing algorithms in SnapVuln may not cover every single case.”
- Sample size. 难以有效处理有大量代码的large program
code: (1) https://sites.google.com/view/snapvuln (2) https://github.com/wubozhi/SnapVuln
SnapVuln为各漏洞类型制定的切片策略
基于SnapVuln对漏洞类型的扩展:https://sites.google.com/view/snapvuln/basic-tutorial-for-extention
- buffer overflow
- pattern: stack-based (arrays) & heap-based (pointers)
- source: allocation of arrays & pointers
- sink: function calls / assignments to manipulate pointers / arrays
- function calls: strcpy, memcpy, etc.
- assignment: assigns value to memory that pointer points to or array with index
- slicing: backward slicing on PDG: sink (write to array / pointers)->all sources
- memory leak
- pattern: does not su ciently track and release memory after it has been used
- source: 1) call standard function (i.e., malloc, calloc), 2) call wraper functions (e.g., “cifs_buf_get”, “AcquireQuantumInfo”)
- wrapper functions: analyzing the call graph of the program
- cifs_buf_get (call) mempool_alloc (call) pool->alloc
- sink: the end of the program
- slicing: forward slicing on PDG: source (assign to poiners & invoke callee) -> sink (program end)
- null pointer dereference
- pattern: accesses a pointer that expects to be valid but is NULL instead
- source: 1) call standard function, 2) call wraper functions
- sink: first statement using the pointer
- slicing: criterion - assign return of callee to pointer, 1) forward slicing on PDG, 2) forward slicing on CFG, 3) compute first statement
- integer overflow
- pattern: an integer value is incremented after calculation to a value that is too large to store in the associated representation
- source: assign value to a variable (int, long, short, etc.)
- sink: arithmetic operations
- slicing: backward slicing on PDG: sink (arithmetic operators) -> sources
- use after free
- pattern: allocated heap memory used after release
- source: 1) call standard function, 2) call wraper functions
- sink: first statement using the pointer after the free operation
- hard to identify free operation -> last statement uses the pointer
- slicing: criterion - assign to pointers & invoke callee, 1) forward slicing on PDG, 2) forward slicing on CFG, 3) compute last statement
- double free
- pattern: the same heap pointer is released twice or more in a program
- source: 1) call standard function, 2) call wraper functions
- sink: two or more free operations on pointers -> last statement uses the pointer
- slicing: forward slicing on PDG: source (assign to pointers & invoke callee) -> sink (last statement uses the pointer)