ScanRE

A static code analysis toolkit built as a wrapper around SemGrep, ORT and other tools

What is Static Code Analysis?

Static analysis is a method of debugging that is done by automatically examining the source code without having to execute the program. This provides developers with an understanding of their code base and helps ensure that it is compliant, safe, and secure. To find defects and violations of policies, checkers perform an analysis on the code.

They operate by querying or traversing the model, looking for particular properties or patterns that indicate defects. Sophisticated symbolic execution techniques explore paths through a control-flow graph. The data structure representing paths that might be traversed by a program during its execution. A warning is generated, if the path exploration notices an anomaly.

To model and explore the astronomical number of combinations of circumstances, scanners employ a variety of strategies to ensure scalability. For example, procedures summaries are refined and compacted during the analysis, and paths are explored in an order that minimizes paging.

Prerequisits:

System having 8Gb RAM , Linux/Windows (We recommend running the code in Linux) , Docker , Git , Python3 and pip , PHP

Getting started!

Running backend engine:

Navigate to the backend directory

Create a virtualenvironment

For both Windows and Linux, you can start a virtualenvironment using:

python3 -m venv myenv
source myenv/bin/activate

Then, run

./scaffold.sh

Finally, run

./run.sh

This starts the flask application that is a wrapper around Static code analysis tools

Running the analytics tool(frontend):

Clone a SOAR tool, in our case we have run our project using

git clone git@github.com:JadenFurtado/SOARtool.git
# building
./dc-build.sh
# running (for other profiles besides mysql-rabbitmq look at https://github.com/DefectDojo/django-DefectDojo/blob/dev/readme-docs/DOCKER.md)
./dc-up.sh mysql-rabbitmq
# obtain admin credentials. the initializer can take up to 3 minutes to run
# use docker-compose logs -f initializer to track progress
docker-compose logs initializer | grep "Admin password:"

Dashboard:

Grouping of vulnerabilities:

List of vulnerabilities:

Details of the finding:

Running the filemanager:

navigate to fileMangaer and run:

php -S 0.0.0.0:5555

Architecture:

A high level layout of our system is shown below.

We've made extensive use of docker and celery to ensure that we are able to tackle the asynchronous nature of our task, i.e. scanning multiple files of code, each having different sizes across multiple repositories. A high level architecture of celery is shown below.

What was our motivation?

To help improve the security posture of open sourced software in the industry

We were inspired by the work showcased in NullCon22, Asia's largest cybersecurity conference. Links to the talks in the references section.

Findings:

Our ultimate motivation with this projec, as mentioned before was to prioritize research over usability as a product. We have taken and made decisions to ensure that ScanRE works successfully as a product as well as performs well for it's original purpose, i.e. a tool to help researchers quickly scan multiple repositories and analyze the findings. We ran ScanRE over 3000 of the most popular repositories available on github. Out of these, we decided to prioritize 300 repositories as a sample. We got over 12,000 findings in them as shown, pending manual validation. ScanRE was able to identify insecure code patterns, and thus prooves effective as a tool for enforcing coding standards too.

Metrics:

Since the underlying system is primarily built on top of SemGrep, our performance is mainly determined by the performance of SemGrep. Semgrep is able to outperform GitGuardian and other code analysis tools, both, in terms of time taken and false positives flagged.

Tree matching has a nearly negligible cost when compared to most deep program analysis techniques, such as pointer analysis or symbolic execution, so this was clearly a winning battle. As Semgrep grew more advanced, more features were added which caused it to err closer to the side of semantics, such as taint analysis and constant propagation.

These analyses are not necessarily ones that can be done quickly. Taint analysis, in particular, requires running dataflow analysis over the entire control flow of a program, which can potentially be huge, when considering how large an entire codebase may be, with all of its function calls and tricky control flow logic. To do taint analysis in this way would be to pick a losing battle.

Semgrep succeeds in that it only carries out single-file analysis, so the control flow graph never exceeds the size of a file. In addition, taint can be done incrementally. Functions have well-defined points where they begin and end, as well as generally well-defined entrances in terms of the data they accept (the function arguments). Thus, Semgrep collects taint summaries, which essentially (per function) encode the information about what taint may be possible, depending on the taint of the inputs that flow in.

References:

JadenFurtado/ScanRE