LLM code scanning evals
Set of tests to measure effectiveness of LLMs at identifying security issues in code and generating fixes
LLMs:
Currently setup for OpenAI GPT-4 and GPT-3.5-Turbo
Dataset:
https://github.com/OWASP-Benchmark/BenchmarkJava
How to run:
- Download and open the Notebook (owasp_java_benchmark.ipynb) in your choice of Jupyter Notebook environment
- Install the depedencies using pip install -r requirements.txt
- Make sure the OpenAI API key is available in the execution environment as env variable OPENAI_API_KEY
- Select the value for LLM (gpt-4-0613 or gpt-3.5-turbo-0613). GPT-4 is the newest most advanced model from OpenAI at the time of this writing, GPT3.5-Turbo is faster and cheaper
- Set the temperature (between 0-1). This is the attribure that adds variance (highest variance at 1) to the output of the model
- Run all the cells in the Notebook
Running all 2470 testcases on GPT-4 will cost around $100 with OpenAI API at the time of this writing. It will cost ~$5 for GPT-3.5-Turbo. More info about pricing - https://openai.com/pricing#language-models
How to read the results:
Columns from OWASP Benchmark
- metadata_vulnerability_exists (True/False) - This is the vulnerability tag from the XML file for the testcase that tells us if the testcase is exploitable
- expected_vuln_type - This is the category tag from the XML file that provides the category of the vulnerability in the tesetcase
Columns from the LLM
- vulnerability_found (True/False) - True if LLM finds a vulnerability in the code for the testcase, False otherwise
- vulnerability - This is the category the LLM classifies the vulnerability into if it finds one for the testcase
- vulnerable_code - This is the code sample from the testcase LLM thinks is vulnerable
- code_fix - This is the code generated by the LLM to fix the vulnerable code
- comment - Human readable comment that helps explain the issue and the code fix
Columns from comparison
- vulnerability_type_matches - True if there is a 80%+ fuzzy match between expected_vuln_type (from OWASP Benchmark) and vulnerability (from the LLM)
How are results calculated
- True Positive - TP = ((df['vulnerability_found'] == True) & (df['metadata_vulnerability_exists'] == True)).sum()
- True Negative - TN = ((df['vulnerability_found'] == False) & (df['metadata_vulnerability_exists'] == False)).sum()
- False Positive - FP = ((df['vulnerability_found'] == True) & (df['metadata_vulnerability_exists'] == False)).sum()
- False Negative - FN = ((df['vulnerability_found'] == False) & (df['metadata_vulnerability_exists'] == True)).sum()
More information:
https://medium.com/p/9c2ca0312036
Bootsrapping this quickly. PRs welcome.