SecEval: A Comprehensive Benchmark for Evaluating Cybersecurity Knowledge of Foundation Models

中文版.

The advent of large language models has ignited a transformative era for the cybersecurity industry. Pioneering applications are being developed, deployed, and utilized in areas such as cybersecurity knowledge QA, vulnerability hunting, and alert investigation. Various researches have indicated that LLMs primarily acquire their knowledge during the pretraining phase, with fine-tuning serving essentially to align the model with user intentions, providing the ability to follow instructions. This suggests that the knowledge and skills embedded in the foundational model significantly influence the model's potential on specific downstream tas ks

Yet, a focused evaluation of cybersecurity knowledge is missing in existing datasets. We address this by introducing "SecEval". SecEval is the first benchmark specifically created for evaluating cybersecurity knowledge in Foundation Models. It offers over 2000 multiple-choice questions across 9 domains: Software Security, Application Security, System Security, Web Security, Cryptography, Memory Safety, Network Security, and PenTest. SecEval generates questions by prompting OpenAI GPT4 with authoritative sources such as open-licensed textbooks, official documentation, and industry guidelines and standards. The generation process is meticulously crafted to ensure the dataset meets rigorous quality, diversity, and impartiality criteria. You can explore our dataset the explore page.

Using SecEval, we conduct an evaluation of 10 state-of-the-art foundational models, providing new insights into their performance in the field of cybersecurity. The results indicate that there is still a long way to go before LLMs can be the master of cybersecurity. We hope that SecEval can serve as a catalyst for future research in this area.

Leaderboard
Dataset
Generation Process
Limitations
Future Work
Licenses
Citation
Credits

Leaderboard

#	Model	Creator	Access	Submission Date	System Security	Application Security	PenTest	Memory Safety	Network Security	Web Security	Vulnerability	Software Security	Cryptography	Overall
1	gpt-4-turbo	OpenAI	API, Web	2023-12-20	73.61	75.25	80.00	70.83	75.65	82.15	76.05	73.28	64.29	79.07
2	gpt-3.5-turbo	OpenAI	API, Web	2023-12-20	59.15	57.18	72.00	43.75	60.87	63.00	60.18	58.19	35.71	62.09
3	Yi-6B	01-AI	Weight	2023-12-20	50.61	48.89	69.26	35.42	56.52	54.98	49.40	45.69	35.71	53.57
4	Orca-2-7b	Microsoft	Weight	2023-12-20	46.76	47.03	60.84	31.25	49.13	55.63	50.00	52.16	14.29	51.60
5	Mistral-7B-v0.1	Mistralai	Weight	2023-12-20	40.19	38.37	53.47	33.33	36.52	46.57	42.22	43.10	28.57	43.65
6	chatglm3-6b-base	THUDM	Weight	2023-12-20	39.72	37.25	57.47	31.25	43.04	41.14	37.43	39.66	28.57	41.58
7	Aquila2-7B	BAAI	Weight	2023-12-20	34.84	36.01	47.16	22.92	32.17	42.04	38.02	36.21	7.14	38.29
8	Qwen-7B	Alibaba	Weight	2023-12-20	28.92	28.84	41.47	18.75	29.57	33.25	31.74	30.17	14.29	31.37
9	internlm-7b	Sensetime	Weight	2023-12-20	25.92	25.87	36.21	25.00	27.83	32.86	29.34	34.05	7.14	30.29
10	Llama-2-7b-hf	MetaAI	Weight	2023-12-20	20.94	18.69	26.11	16.67	14.35	22.77	21.56	20.26	21.43	22.15

Dataset

Format

The dataset is in json format. Each question has the following fields:

id: str # unique id for each question
source: str # the source where the question is generated from
question: str # the question description
choices: List[str] # the choices for the question
answer: str # the answer for the question
topics: List[QuestionTopic] # the topics for the question, each question can have multiple topics.
keyword: str # the keyword for the question

Question Distribution

Topic	No. of Questions
SystemSecurity	1065
ApplicationSecurity	808
PenTest	475
MemorySafety	48
NetworkSecurity	230
WebSecurity	773
Vulnerability	334
SoftwareSecurity	232
Cryptography	14
Overall	2126

Download

You can download the json file of the dataset by running.

wget https://huggingface.co/datasets/XuanwuAI/SecEval/blob/main/questions.json

Or you can load the dataset from Huggingface.

Evaluate Your Model on SecEval

You can use our evaluation script to evaluate your model on SecEval dataset.

Generation Process

Data Collection

Textbook: We selected open-licensed textbooks from the Computer Security courses CS161 at UC Berkeley and 6.858 at MIT. These resources provide extensive information on network security, memory safety, web security, and cryptography.
Official Documentation: We utilized official documentation, such as Apple Platform Security, Android Security, and Windows Security, to integrate system security and application security knowledge specific to these platforms.
Industrial Guidelines: To encompass web security, we referred to the Mozilla Web Security Guidelines. In addition, we used the OWASP Web Security Testing Guide (WSTG) and OWASP Mobile Application Security Testing Guide (MASTG) for insights into web and application security testing.
Industrial Standards: The Common Weakness Enumeration (CWE) was employed to address knowledge of vulnerabilities. For penetration testing, we incorporated the MITRE ATT&CK and MITRE D3fend frameworks.

Questions Generation

To facilitate the evaluation process, we designed the dataset in a multiple-choice question format. Our approach to question generation involved several steps:

Text Parsing: We began by parsing the texts according to their hierarchical structure, such as chapters and sections for textbooks, or tactics and techniques for frameworks like ATT&CK.
Content Sampling: For texts with extensive content, such as CWE or Windows Security Documentation, we employed a sampling strategy to maintain manageability. For example, we selected the top 25 most common weakness types and 175 random types from CWE.
Question Generation: Utilizing GPT-4, we generated multiple-choice questions based on the parsed text, with the level of detail adjusted according to the content's nature. For instance, questions stemming from the CS161 textbook were based on individual sections, while those from ATT&CK were based on techniques.
Question Refinement: We then prompted GPT-4 to identify and filter out questions with issues such as too simplistic or not self-contained. Where possible, questions were revised; otherwise, they were discarded.
Answer Calibration: We refine the selection of answer options by presenting GPT-4 with both the question and the source text from which the question is derived. Should the response generated by GPT-4 diverge from the previously established answer, this discrepancy suggests that obtaining a consistent answer for the question is inherently challenging. In such cases, we opt to eliminate these problematic questions.
Classification: Finally, we organized the questions into 9 topics, and attached a relevant fine-grained keyword to each question.

Limitations

The dataset, while comprehensive, exhibits certain constraints:

Distribution Imbalance: The dataset presents an uneven distribution of questions across different domains, resulting in a higher concentration of questions in certain areas while others are less represented.
Incomplete Scope: Some topics on Cybersecurity are absent from the dataset, such as content security, reverse engineering, and malware analysis. As such, it does not encapsulate the full breadth of knowledge within the field.

Future Work

Improvement on Distribution: We aim to broaden the dataset's comprehensiveness by incorporating additional questions, thereby enriching the coverage of existing cybersecurity topics.
Improvement on Topic Coverage: Efforts will be made to include a wider array of cybersecurity topics within the dataset, which will help achieve a more equitable distribution of questions across various fields.

Licenses

The dataset is released under the CC BY-NC-SA 4.0 license. The code is released under the MIT license.

Citation

@misc{li2023seceval,
    title={SecEval: A Comprehensive Benchmark for Evaluating Cybersecurity Knowledge of Foundation Models},
    author={Li, Guancheng and Li, Yifeng and Wang Guannan and Yang, Haoyu and Yu, Yang},
    publisher = {GitHub},
    howpublished= "https://github.com/XuanwuAI/SecEval",
    year={2023}
}

Credits

This work is supported by Tencent Security Xuanwu Lab. we also apperiate Tencent Spark Talent Program for help.

qihuang0/SecEval