secure-software-engineering/phasar

Possible error in function resultatinLLVMSSA()

Luweicai opened this issue · 6 comments

  • [√] I have searched open and closed issues for duplicates
  • [√] I made sure that I am not using an old project version (DO: pull Phasar, update git submodules, rebuild the project and check if the bug is still there)

The resultatinLLVMSSA() helps to deal with the problem that new generated fact will only vaild after the current instrcuction. For example, the load instruction, if %i is hold, the fact %0 will hold after the load instruction. The resultatinLLVMSSA() aims to solve this problem by querying the result at the next instruction.
%0 = load i32, i32* %i, align 4;
However, it seems not consider the problem that if the next insturction is a call statement. Due to the underlying theory (and respective implementation) not fact at call statement holds and will only holds at the corresponding return site, i.e. the statement after the call statement.

Hi @Luweicai, to get your question right: Are you saying that PhASAR does not give any results at a call-site or that PhASAR should not give results at a call statement?

The result[s]AtInLLVMSSA() functions view the LLVM IR from a different point of view than the analysis solver does:

Due to the structure of LLVM IR a call statement (and in fact any other non-void instruction) can be seen as two parts: the right side of the = that computes a new result and the left side of the = that holds the computed result.
So, if you have an instruction %0 = call i32 @foo() the actual call is the right hand side call i32 @foo() and you can interpret the instruction as an assignment of the return-value of the call to the new "variable" %0. From that point of view, the return-site of the call is represented by the same LLVM instruction.

The regular result[s]At functions, however, see each LLVM instruction as atomic: The facts reported at a instruction hold immediately before that instruction gets executed -- incorporating the effects of the previous instruction.

Can you please further explain, why the call instruction has an issue? Maybe an example could be helpful.

Hi @fabianbs96 , thank you for your reply.

I agree with your description about result[s]At() and result[s]AtInLLVMSSA() and their functional difference.

What I would like to say is that, there seems to be a slight mistake in the implementation of result[s]AtinLLVMSSA().

The different functionality of the result[s]AtinLLVMSSA() compared to the result[s]At() is achieved through querying the fact holds at the 'next statement', i.e. the successor node instead of the current node (Let's consider only the scenario where successor nodes exist).

Here is the example:

%1 = add nsw i32 %0, 1;
%2 = add nsw i32 %1, 1;
call void @tt(i32 %3);
%5 = add nsw i32 %4, 1;

Assume at the begin the %0 is tainted. We know that %1 holds at %1 = add nsw i32 %0, 1 and %2 holds at %2 = add nsw i32 %1, 1;. If we use the result[s]AtinLLVMSSA() function to check it, it works well with %1. But will get wrong result with %2.

Let's see what happens when we want to query whether %2 holds at %2 = add nsw i32 %1, 1;.

Becasue the implementation of PHASAR, call void @tt(i32 %5); will be processed as a call site and the following statement is it's return site. The fact holds at %2 = add nsw i32 %1, 1; will be propagate directly to %4 = add nsw i32 %3, 1. Meanwhile, the result[s]AtinLLVMSSA() query the successor node, which is call void @tt(i32 %5);. The result will show that %2 not holds.

MMory commented

Hi @Luweicai,

I am not sure whether I understand your problem correctly.

Given your example, %1 is generated at %1... as its statement effect, which means that the fact appears at %2... . %2 should hold at both the third and fourth statement.

Anyway, looking into the implementation of result[s]AtinLLVMSSA(), they look suspicious. Also they are not used anywhere in the existing code base, so it's more than possible that these functions are incorrect.

Thank you for reporting this, we'll look into it.

Hi @MMory.

Let me show more details about the origin of this problem.

Here is the ''process intra propagate'' of the IFDS analysis (extract from IDESolver.h). I omit the trivial part and only illustrate the problem involved part.

  virtual void processNormalFlow(const PathEdge<n_t, d_t> Edge) {
   ...
/// find the sucessor node
    for (const auto nPrime : ICF->getSuccsOf(n)) { 
// compute the fact.
      FlowFunctionPtrType FlowFunc =
          CachedFlowEdgeFunctions.getNormalFlowFunction(n, nPrime);
    ...
      const container_type Res = computeNormalFlowFunction(FlowFunc, d1, d2); 
      ...
/// propagate the generate fact to the next position
      for (d_t d3 : Res) { 
        EdgeFunctionPtrType g =
            CachedFlowEdgeFunctions.getNormalEdgeFunction(n, d2, nPrime, d3);
        EdgeFunctionPtrType fPrime = f->composeWith(g);
        ...
        propagate(d1, nPrime, d3, fPrime, nullptr, false);
      }
    }
  }

From the code that we know a fact generated at pos n will propagate to the successor of it ( in the source code, the inner for loop that for (const auto nPrime : ICF->getSuccsOf(n)) { ...}

There is no mistake here, this is how the classical IFDS algorithm is written. In this scenario, the taint result for one position (let's call it Instruction P) is actually the taint result that before P is executed. For example, stmt: %1 = add nsw i32 %0, 1; with %0 is tainted and %1 is not before this instruction. The taint set returned by resultsAt(stmt) will not contain %1.

But if one just want to know that the whether the left hand value of = is just taint at that position, like whether %1 is tainted at %1 = add nsw i32 %0, 1;

resultinAtLLMVSSA() deal with this problem. The source code is:

resultAtInLLVMSSA(
    ByConstRef<n_t> Stmt, d_t Value, bool AllowOverapproximation) const ->
    typename std::enable_if_t<
        std::is_same_v<std::decay_t<std::remove_pointer_t<NTy>>,
                       llvm::Instruction>,
        l_t> {
  if (Stmt->getType()->isVoidTy()) {
    return self().Results.get(Stmt, Value);
  }
  if (!Stmt->getNextNode()) {
...
  }
  assert(Stmt->getNextNode() && "Expected to find a valid successor node!");
  return self().Results.get(Stmt->getNextNode(), Value);
}

resultinAtLLMVSSA() deal with this problem by query the successor node (return self().Results.get(Stmt->getNextNode(), Value);). We just omit the situation that no successor node exist as it is not the focal point of this problem.

It is corrent in most cases. However, if the successor node is a call statement. The processCall() function will propagate the fact to the called function position and the Returnside of the call statement. And not save the fact at position of call statement. Which casue mistakes of resultinAtLLMVSSA().

At this point, I began to doubt whether there is an issue with my implementation of the inter-analysis section. Because the results of my analysis are always look like (print by the --emit-raw-result option):

for test case(%0 is tainted at the initial position):

%1 = add nsw i32 %0, 1;
%2 = add nsw i32 %1, 1;
call void @tt(i32 %3);
%5 = add nsw i32 %4, 1;

the result is:

N: %1 = add nsw i32 %0, 1;
-----------------------------------------------------
D: @zero_value
D: i32 %0

N: %2 = add nsw i32 %1, 1;
-----------------------------------------------------
D: @zero_value
D: i32 %0
%1 = add nsw i32 %0, 1;

N: call void @tt(i32 %3);
-----------------------------------------------------
D: @zero_value

N: %5 = add nsw i32 %4, 1;
-----------------------------------------------------
D: @zero_value
D: i32 %0
%1 = add nsw i32 %0, 1;
%2 = add nsw i32 %1, 1;

Hi @Luweicai, we debugged your example and indeed found a problem that prevents results from being reported at certain call sites. Can you check whether #632 fixes your issue?

We have also identified some shortcomings in result[s]AtInLLVMSSA() that will be fixed in the future.

Hi @fabianbs96. It works. Thank you.