This is a minimal implementation of a data provenance tracking algorithm considering only dynamic data dependances. The provenance sources considered are stdin or files. More generally any reads from a given file descriptor is considered a provenance source. The provenance targets are all the program variables. The tool is capable of tracking the data provenance from sources to targets across registers and VEX IR based temporary variables.
- Install valgrind
- Follow the instructions described in here on setting up a new valgrind tool
The implementation is based on hash maps. We have used seperate hash maps for storing data dependances on memory locations and on temporary variables. For registers we have used valgrind's set_shadow_reg_area and get_shadow_reg_area platforms.
Since the sources are reads from file descriptors we detect them using syscall APIs in valgrind. When ever there is a read syscall we update the corresponding buffer address of the read as tainted by the same address.
Whenever there is a register write we update its corresponding shadow location with all the addresses the write depends on. Similarly for a register read we will return the corresponding address list to the reader to update their dependences.
Loads are essentially reading some memory location and updating a temp variable with its content. The corresponding abstract state update for this would be to access the shadow memory and pass the corresponding taint address list to the temp shadow map. A store would be updating a memory address with the content of some temp variable. In this case first we pass the provenance from the temp variable to store address and after that we output (final result of the tool) the address list.
ALU operations are arithmetic operations on top of temp variables and constants. The result is then assigned to another temp variable. For this we can simply pass the provenance from arguments of the ALU operation to the resultant temp variable.