pascal-lab/Tai-e

about java-benchmarks log4j configuration

Closed this issue · 9 comments

Description

In options.yml, pointer analysis configuration is: cs:ci;distinguish-string-constants:null; If I change to more sensitive context sensitivity, such as cs:1-obj, Tai-e will not find any tainted flow. I think this maybe a hint that Tai-e is using some spurious flow to perform the taint propagation. I am not very sure, but i think that Tai-e is not handling strLookupMap reflection in log4j correctly. If I change options.yml to cs:1-obj;distinguish-string-constants:reflection; Tai-e will find the tainted flow on any context sensitivity including 2-obj.

I also have another question. In tainted-config.yml, there are 3 log4j specific taint transfer function. But these three are merely some kind of field assignment. I think Tai-e have the ability to perform the analysis without these three taint transfer. In fact, i delete these three taint transfer and make a little modification on the existing jdk taint transfer function, Tai-e is still able to find the correct taint flow(I check the result mannually). I think for security researchers, it will mean a lot if Tai-e is not that relied on application specific taint transfer function.

Thank you for bringing this issue to our attention.
We will promptly verify and provide a timely response to your first question.
Regarding the second question, could you please provide the complete taint-config.yaml configuration options, .log file, options.yaml file, and any other relevant information? This will help us to reproduce and verify the issue.
We appreciate your cooperation and understanding.

Sure!
The options.yml changed one line:

pta: cs:1-obj;implicit-entries:false;merge-string-builders:false;distinguish-string-constants:reflection;reflection-inference:solar;taint-config:java-benchmarks/log4j/2.14.0/taint-config.yml;reflection-log:java-benchmarks/log4j/2.14.0/refl.log; #dump-ci:true;

The reason is explained above. merge-string-builders:false is for precision.

refl.log is the same.

taint-config.yaml is as follows:

sources:
  - { kind: call, method: "<Server: java.lang.String getInput()>", index: result }

sinks:
  - { method: "<javax.naming.InitialContext: java.lang.Object lookup(java.lang.String)>", index: 0 }

transfers:
  - { method: "<java.lang.StringBuilder: void <init>(java.lang.String)>", from: 0, to: base.value }
  - { method: "<java.lang.StringBuilder: java.lang.String substring(int,int)>", from: base.value, to: result }
  - { method: "<java.lang.StringBuilder: void getChars(int,int,char[],int)>", from: base.value, to: 2 }
  - { method: "<java.lang.String: java.lang.String substring(int)>", from: base, to: result }
  - { method: "<java.lang.String: void <init>(char[],int,int)>", from: 0, to: base }
  - { method: "<java.lang.StringBuilder: java.lang.StringBuilder append(java.lang.CharSequence)>", from: 0, to: base.value }
  - { method: "<java.lang.ThreadLocal: void set(java.lang.Object)>", from: 0, to: base }
  - { method: "<java.lang.ThreadLocal: java.lang.Object get()>", from: base, to: result}

sanitizers:
  - { kind: param, method: "<java.lang.ThreadLocal: void set(java.lang.Object)>", index: 0 }

I also have to change the code in pascal/taie/language/type/TypeSystemImpl.java, which is explained below.

let me explain the changes:

  1. remove the application specific transfer function. reason is explained above.

  2. modify and add some transfer function about Stringbuilder
    java.lang.StringBuilder append(java.lang.CharSequence) is used in log4j and have to be added.
    append have to propagate to base.value, otherwise in the follow condition the taint propagate won't function normally:

private CharSequence charSequence;
public void formatTo(final StringBuilder buffer) {
    buffer.append(charSequence);
}

in this example, if transfer function is from 0 to base, then the local variable buffer will add tainted object to its point to set, then will not propagate any further.
To fit the append transfer function, i have to change all Stringbuilder related transfer function.

  1. add 2 transfers and 1 sanitizer for ThreadLocal.
    Without these 2 transfer functions, tai-e is still able find the correct path. But add these three make the analysis a lot more precise. Considering the following example(which is similar to the real use in log4j):
public static void main(String[] args) {
    ThreadLocal<String> map1 = new ThreadLocal<>();
    ThreadLocal<String> map2 = new ThreadLocal<>();
    String tainted = source();
    String nontainted = "123";
    map1.set(tainted);
    map2.set(nontainted);
    sink(map1.get());
    sink(map2.get());
}

Every Threadlocal object eventually write to the same map, which key is the object aka map1/map2. If let Tai-e to analysis the code above, it will report a spurious flow, from tained to map2.get() to sink, because internally in jdk, map1 and map2 sit in the same map. With the transfer and sanitize, Tai-e will report the precise tainted flow. ThreadLocal is used a lot in log4j. By these we can improve precision.
But these transfer and sanitizer are conflict with typefitler. Above example is sufficient to demonstrate that. I have to add a dirty hack to pascal/taie/language/type/TypeSystemImpl.java:183, like this:

else if (subtype instanceof ClassType) {
            if (supertype instanceof ClassType) {
                if (subtype == OBJECT) {
                    return true;
                }
                return hierarchy.isSubclass(
                        ((ClassType) supertype).getJClass(),
                        ((ClassType) subtype).getJClass());
            }

Because java.lang.ThreadLocal: java.lang.Object get() returns a java.lang,object, then cast to the target type which varies and are not known before static analysis. Type filter will filter it out immediately. so i have to disable the typefilter on these case. I hope there will be a elegant way to solve this.

By these taint configuration, Tai-e is able to output the correct result. Tai-e result is based on the pointer flow graph. So it only tracks the flow of the String object, but not showing the flow of the container object(Message and LogEvent). In the output dot graph, field store is immediately followed by field load, emitting the information about the flow of the object which field is assigned to, that is the message object. But i think the flow information about Message and LogEvent can also be extract from pointer flow graph, in a separate pass.

Thank you for your prompt response and the provided information. We appreciate your cooperation. We will now carefully review the details you've shared to gain a better understanding of the reported issue and actively work towards a resolution. Please expect to hear back from us shortly with our findings.

I want to discuss another related thought on log4shell tainted analysis, about increasing precision.
First of all, i think bug hunter should review the tainted flow reported by tai-e, to make sure whether it is a false positive, or real finding. I think maybe i should review both in-app tainted flow and tainted flow that pass through third party libraries(including JDK), unless i am one hundred percent sure that tai-e models every third party calls.
Under this assemption, i checked all tainted flow of log4shell tainted analysis, a lot of precision loss (spurious tainted flow) is of this pattern:

public static void main(String[] args) {
        String tainted = source();
        File f = new File(tainted);
        String Name = f.getName();
        String t2 = tainted.substring(0);
        sink(t2);
    }
public static String source(){
    return "tt";
}
public static void sink (String tainted){
    return;
}

Tai-e may thinks that: String tainted flows to File constructor, then to substring(File calls substring internally), then return to t2 then to sink(which is unwanted a control flow merge, causing a spurious tainted flow). This controll flow merge will not solved by k-obj context sensitivity, because the base object "tainted" is allocationed in main function, which have only 1 depth of object context. So every call to tainted.substring will merge to one CSMethod.
2-callsite sensitivity may help in this case, but many jdk calls is deeper than two calls.
I think one way to solve this kind of problem is to use transfer function and sanitizer (I am not sure this is a proper/intended way to use sanitizer), that is,

transfers:
  - { method: "<java.lang.String: java.lang.String substring(int,int)>", from: base, to: result }
  - { method: "<java.lang.String: java.lang.String substring(int)>", from: base, to: result }
sanitizers: 
  something like: - { kind: base, method: "<java.lang.String: java.lang.String substring(int)>"}
  something like: - { kind: base, method: "<java.lang.String: java.lang.String substring(int,int)>"}

this will prevent the call to tainted.substring, thus eliminat the unwanted control flow merge.
But currently tai-e do not have sanitizer that stops base to result and base to arg transfers. It is unclear to me that how it can be elegantly implemented, because intuitively, sanitizer on base object should stop the base object to be assigned to this variable of callee. This is handled before plugin callbacks, and it is not a PFG so that can not be filtered.
I implement the sanitizer on base object, and it works as expected and do filter out many false positive(flow edges).
Is there better way to solve it? Thanks!

There is another interesting thing that i want to share, related to the precision of log4j tainted analysis.
Some of the supurious flows are caused by this pattern(such as org/apache/logging/log4j/core/config/composite/CompositeConfiguration.java:82):

public class HashMapTest {
    public static void main(String[] args) {
        HashMap<String,String> hm1 = new HashMap<>();
        HashMap<String,String> hm2 = new HashMap<>();

        String tainted = source();
        String tainted2 = "123";

        hm1.put("1",tainted);
        hm2.put("1",tainted2);

        for (final Map.Entry<String, String> entry : hm1.entrySet()) {
            sink(entry.getValue());
        }

        for (final Map.Entry<String, String> entry : hm2.entrySet()) {
            sink(entry.getValue());
        }
    }

    public static String source(){
        return "tt";
    }

    public static void sink(String str){
        return;
    }

It will need 3-obj sensitivity, to seperate the tainted flow with non-tainted flow, since entryset iterator is allocated in entryset, which is allocated in hashmap. The goal is to seperate two different hashmaps, so i have to record the allocation site of the allocator(entryset)'s allocator(hashmap). If we use 2-obj sensitivity, result will contains a spurious flow that source flows to hm2 then to sink, which is not going to happen dynamically.
It is not possible for me to able to enable 3-obj sensitivity in the log4j analysis(OOM). But i found that if i use 1-obj for most part of the program, 3-obj for objects that are of inner class type and methods that belong to inner class type, it is only a little slower than 1-obj(1.5x), and much more precise.
If i use 2-obj for inner class related, 1-obj for other, it is actually faster (0.6x) and more accurate than 1-obj for all.
Is this a valid heuristic? Or are there better solutions?
Looking forward to your reply! Thanks for the time.

Thank you very much for your valuable feedback!

Regarding your first question, we have been able to reproduce the relevant scenario and, after multiple tests, have identified potential inaccuracies in reflective resolution within contexts such as ci, 1-type, 2-type, 1-call, and 2-call. We are delving into an in-depth investigation of this issue and are considering employing more accurate default configuration schemes in subsequent versions to address this problem.

As for your second question, your perspective is correct. If our taint propagation relies less on the implementation of specific applications but more on the internal methods of the JDK, it will contribute to better supporting the analysis of different programs. This approach represents an ideal way to fill in the transfer.

However, we currently adopt a different approach, primarily for the consideration of security professionals, allowing them to manually trace vulnerabilities more effectively. Our filling method includes some container-related transfer methods (e.g., Message and LogEvent as you mentioned) designed to provide more detailed vulnerability context information. We think this design can offer a clearer display of the vulnerability propagation paths and examples, assisting researchers in better understanding the issues.

Additionally, we have noted your discussion on the sanitizer. Tai-e's current support for sanitizer is indeed not very comprehensive. Since the issues related to the sanitizer may bring fundamental changes to pointer analysis itself, we need to be very careful. We are currently taking a more conservative approach. We suggest opening a separate issue for further discussion on this matter.

Thanks for the reply!
By the way, what's your opinion about that Stringbuilder transfer function should propagate to base.value, not base? Seems tai-e have some back propagation algorithm, but does not cover the example i given before. Maybe with more complete back propagtion algorithm, it will be OK for Stringbuilder to propagate to base?
And what about the transfer function / TypeSystemImpl confliction i mentioned? ThreadLocal transfer function will not work at all, without partially disable the TypeSystemImpl, like i demonstrate before.

Thank you for your valuable feedback!

Regarding your first question about the transfer function of StringBuilder: In determining whether the transfer method should propagate from base.value or just base, the choice depends on the desired reporting of tainted path content in the taint analysis. This decision may also necessitate referencing how other transfer methods are filled within the project. Additionally, regarding the mentioned back propagation algorithm, we haven't fully got it. Could you provide additional details to illuminate this aspect?

Concerning your second question about the conflict between transfer methods and the type system: The current design of our transfer method configuration requires improvement. We may enhance the capacity of transfer method in future. In the meantime, we offer a temporary solution to address this issue. You can specify a particular type for the transfer method to handle this specific case without modifying the type system. Here is an illustrative example of how to articulate the transfer functions and sanitization for the specified case:

transfers:
  - { method: "<java.lang.ThreadLocal: void set(java.lang.Object)>", from: 0, to: base, type: "java.lang.String" }
  - { method: "<java.lang.ThreadLocal: java.lang.Object get()>", from: base, to: result, type: "java.lang.String" }

sanitizers:
  - { kind: param, method: "<java.lang.ThreadLocal: void set(java.lang.Object)>", index: 0 }

Thanks for the reply!
For the second question, i've also considered the the type solution. The things is that ThreadLocal is used to store a lot of type of things(in log4j), not only one type. It is both hard to know all the type list statically, and hard to express it in current Tai-e.
Anyway, i will try some temporary solution for now.
Thanks for the help!