Draft

Introduction

An exploration into the potential misuse of language learning models (LLMs), such as GPT-4, exposes susceptibility to manipulation that could result in the generation of malicious software. For our initial draft we will be to retrieve information as to how to create a virus using GPT-4.

We will demonstrate two distinct strategies designed to circumvent existing safeguards. Additional information illustrating such manipulative approaches and their implications can be found in the specified GitHub repository: Attached Link.

The first approach involves subtle requests masked as academic inquiries or segments of a classroom exercise. The LLM interprets these as benign requests for input and may unwittingly provide information that aids in the construction of harmful software.

For the second strategy, we devise an innovative system that, similar to our initial approach, disassembles the various components of the virus. We then employ a strategy of 'translational subterfuge'. This system introduces a layer of obfuscation, misleading the LLM's censorship algorithms by substituting conventional terminology with unrelated terms. Thus, the system is able to bypass the LLM's 'explicit malware prompting' safeguards.

Execution of first exploit

Inquire about the various components of viral architectural elements.
Query individual elements, intentionally leaving select lines of code omitted.
Request the model to generate the omitted sections.
Iteratively repeat stages 2 and 3 until a complete virus is manifested. This methodology is capable of facilitating the generation of a virus, enabling even technically inexperienced individuals to promulgate harmful content.

Execution of the second exploit

The preliminary stage mirrors the first exploit. Subsequently, a mapping of standard terms to unrelated names is created. This mapping is used to confuse the model, hampering it from recognising and censoring sensitive information.

Vulnerability

Overcoming the challenges posed by these vulnerabilities requires a critical assessment of the LLM: How can it distinguish between legitimate queries and those aimed at exploitation?

The second strategy unmasks a more profound concern. By effectively circumventing the model’s censorship, the exploit presents no immediate sign for interception. Hence, it poses a considerable internal risk to the LLM.

Limitations

The primary constraint for the first exploit is the potential for a blanket censorship measure on sensitive topics.

Whereas, for the second exploit, the lack of attention within the model is a glaring issue. Denying the LLM access to certain words can potentially undermine discussions, limiting the conversational ability of the model and restricting opportunities for learning and development. Therefore, these vulnerabilities need to be addressed carefully, keeping the learning efficacy of the model intact.

Further Research

We plan to extend our research into this particular exploitation category. Specifically, we will be integrating advanced methodologies such as the 'chain of thought' prompting technique into our investigative framework. The vetting of this refined approach will further define the scope and depth of our research in this domain.

 .
├──  README.md
├──  DRAFT.md
├──  TODO.md
├──  translationTable.md
├──  setup.sh
├──  fd.c
├──  prompts.txt
├──  papers
│  ├──  SelfAttentiveFE.pdf
│  ├──  VulnerabilityDataSet.pdf
│  ├──  AutomatedVulnerabilityDetectioninSourceCode.pdf
│  └──  CoT.pdf
├──  viral_prompts
│  ├──  ransomware.md
│  └──  ransomware_translationTable.md
├──  src
│  ├──  scrapper.py
│  └──  model
├──  games.txt
├──  requirements.txt
├──  .gitignore
├──  .python-version
└──  try.txt