跳转至

The Data-Flow Framework

前置知识

Data Flow Analysis,可以理解为:How Data Flows on CFG?

每次语句的执行,如 Jimple 中的 stmt 执行,都会将输入状态转换为新的输出状态。数据流分析可以分为 forward Analysis 和 backward Analysis 两类。

在 intra-procedural Analysis 中,将不对方法调用进行细致分析。

在 CFG 中,结点有通常两种:

  • 单独的三地址码,例如 Jimple 就是一种三地址码
  • 基本块 Basic Blocks,是三地址码的集合,存在单进单出的性质等

此外,CFG 中分支 meet 时,处理方法有两种:

  • may analysis,此时采用并集 ∪,即可以误报,但不许漏报。
  • must analysis,此时采用交集 ∩ ,即可以漏报,但不许误报。

soot 实现

在 Soot 作者的文档中,其对做一个 flow analysis 分为四个步骤:

  • Decide what the nature of the analysis is. 即考虑 forward Analysis or backward Analysis?是否需要考虑分支信息?
  • Decide what is the intended approximation. 即考虑 may analysis or must analysis?这决定了在结点 meet 时,选择 ∪ or ∩
  • performing the actual flow. 本质上是为中间表示的每个语句编写等式,例如,如何处理赋值语句?
  • entry node and inner nodes 的初始状态. 通常初始状态为空集或全集。例如 available exporessions analysis 中,entry node 初始化为全集(1...1),inner nodes 初始化为空集(0...0)。

Step1: nature of the analysis

Soot 提供了三种实现:

  • ForwardFlowAnalysis,
  • BackwardFlowAnalysis
  • ForwardBranchedFlowAnalysis.

前两种的结果是两个 Map:

  • from nodes to IN sets
  • from nodes to OUT sets

最后一种还添加了分支信息,因此多了一个 Map:

  • from nodes to IN sets,
  • from nodes to fall-through OUT sets
  • from nodes to branch OUT sets.

上述三种都是通过 fixed-point mechanism using a worklist algorithm 来实现的,如果你想自己实现,可以通过继承如下三个类:AbstractFlowAnalysis (the top one), FlowAnalysis or BranchedFlowAnalysis.

举个例子:very-busy expressions,该分析是为了找到必定使用的表达式,例如:if(x>y) then a=1;b=2; else a=1;,此时a=1就是 very-busy expression。

为了进行 very-busy expressions analysis,需要使用 backward analysis,因此我们用于分析的类必须继承 BackwardFlowAnalysis。

然后为了利用 soot 提供的功能,需要提供一个实现如下功能的构造函数:

  • call the super’s constructor
  • invoke the fixed-point mechanism

具体实现如下:

class VeryBusyExpressionAnalysis extends BackwardFlowAnalysis{
    public VeryBusyExpressionAnalysis(DirectedGraph g) {
        super(g);
        doAnalysis();
    }
}

Step2: Approximation level

该步骤用于决定是 may or must analysis,其实就是决定在结点交汇时使用 ∪ or ∩。

仍旧以 very-busy expressions analysis 为例,它使用 must analysis,也就是在结点交汇(meet)时使用交集 ∩。

在 soot 的实现中,通过重写 merge() 方法,来确定 meet 的处理方法:

class VeryBusyExpressionAnalysis extends BackwardFlowAnalysis{
    @Override
    protected void merge(Object in1, Object in2, Object out) {
        FlowSet inSet1 = (FlowSet)in1,
        inSet2 = (FlowSet)in2,
        outSet = (FlowSet)out;
        inSet1.intersection(inSet2, outSet);
    }
    @Override
    protected void copy(Object source, Object dest) {
        FlowSet srcSet = (FlowSet)source,
        destSet = (FlowSet)dest;
        srcSet.copy(destSet);
    }
}

在上述的例子中可以看到,soot 抽象了 lattice element 的表示方式,本例中使用FlowSet。因为这种抽象,因此需要重写 copy()方法,用于将 lattice element 的内容复制给另一个 lattice element。

Step3: Performing flow

此处是数据流分析中的 transfer function,是核心部分。

该阶段的处理分为两个步骤:

  • kill():当流从 IN 经过 NODE 到 OUT 的时候,需要 kill 掉一些已经无效的数据。
  • gen():需要向 OUT 中加入 NODE 生成的数据。

也就是说,最终的 OUT[NODE] = genNODE ∪ ( IN[NODE] - killNODE )

class VeryBusyExpressionAnalysis extends BackwardFlowAnalysis{
    @Override
    protected void flowThrough(Object in, Object node, Object out) {
        FlowSet inSet = (FlowSet)source,
        outSet = (FlowSet)dest;
        Unit u = (Unit)node;
        kill(inSet, u, outSet);
        gen(outSet, u);
    }
    private void kill(FlowSet inSet, Unit u, FlowSet outSet){ /* 用户自定义实现 */ }
    private void gen(FlowSet outSet, Unit u) { /* 用户自定义实现 */ }
}

其中 kill()gen()方法需要用户根据实际的情况实现,可以参考示例源代码

Step4: Initial state

该步骤用于对 entry node 以及其 inner node 进行初始化,分别由entryInitialFlownewInitialFlow两个方法实现:

class VeryBusyExpressionAnalysis extends BackwardFlowAnalysis{
    @Override
    protected Object entryInitialFlow() {
        return new ValueArraySparseSet();
    }
    @Override
    protected Object newInitialFlow() {
        return new ValueArraySparseSet();
    }
}

在 very-busy expressions analysis 中,entry node 是最后一句语句,并用空集来初始化它,其余的 inner node 也全部都初始化为空集。

FlowSet

FlowSet 代表着与控制流图 CFG 中 结点 node 相关联的数据集。对于 very-busy expressions analysis,a node’s flow set 代表 a set of expressions busy at that node.

有两种 FlowSet:

  • bounded FlowSet:A bounded set is one that knows its universe of possible values,实现接口BoundedFlowSet<T>
  • unbounded FlowSet:与 bounded 相反,实现接口 FlowSet<T>

继承了FlowSet<T>的接口需要实现:

  • clone()
  • clear()
  • isEmpty()
  • copy(FlowSet dest) // deep copy of this into dest
  • union(FlowSet other, FlowSet dest) // dest <- this ∪ other
  • intersection(FlowSet other, FlowSet dest) // dest <- this ∩ other
  • difference(FlowSet other, FlowSet dest) // dest <- this - other

In addition, when implementing BoundedFlowSet<T>, it needs to provide methods for producing the set's complement and its topped set (i.e., a lattice element containing all the possible values).

soot 实现了四种 FlowSet,分别是:ArraySparseSet、ArrayPackedSet、ToppedSet 和 DavaFlowSet

  • ArraySparseSet is an unbounded flow set. The set is represented as an array of references

  • ArrayPackedSet is a bounded flow set. Requires that the programmer provides a FlowUniverse object.

A FlowUniverse object is simply a wrapper for some sort of collection or array, and it should contain all the possible values that might be put into the set.

  • ToppedSet wraps another flow set (bounded or not) adding information regarding whether it is the top set (⊤) of the lattice.

In our very-busy expressions example, we need to have flow sets containing expressions and as such we want them to be compared for equivalence — i.e., two different occurrences of a + b will be different instantiations of some class implementing BinopExpr; thus they will never compare equal. To remedy this, we use a modified version of ArraySparseSet, where we have changed the implementation of the contains method as such:

public boolean contains(Object obj) {
    for (int i = 0; i < numElements; i++)
        if (elements[i] instanceof EquivTo && ((EquivTo) elements[i]).equivTo(obj))
            return true;
        else if (elements[i].equals(obj))
            return true;
    return false;
}

CFG

soot 在 package soot.toolkits.graph中提供了若干种 CFG( Control Flow Graphs ),都是基于接口 DirectedGraph<N>,该接口定义了若干个 getter 方法:

  • the entry and exit points to the graph,
  • the successors and predecessors of a given node,
  • an iterator to iterate over the graph in some undefined orde and the graphs size (number of nodes).

此处讨论的 CFG 的 node 是 Soot Units,在 Jimple 中,Units 的实现是 Stmt。此外,仅讨论 intra-procedural flow

Abstract class:UnitGraph,提供了构建 CFG 的工具,此外,soot 还提供了三种具体实现:BriefUnitGraph, ExceptionalUnitGraph and TrapUnitGraph.

  • BriefUnitGraph:最简单的实现,it doesn’t have edges representing control flow due to exceptions being thrown
  • ExceptionalUnitGraph:最常用的实现,it includes edges from throw clauses to their handler(catch block, referred to in Soot as Trap), that is if the trap is local to the method body.
  • TrapUnitGraph:类似 ExceptionalUnitGraph,但会考虑可能引发的异常。

在 soot 中,可以为 body 生成 CFG,例如:

UnitGraph g = new ExceptionalUnitGraph(body);

Wrapping the results of the analysis

todo