FLAIR

This repository contains the datasets and evaluation framework for the paper "Bot or Human? Detecting ChatGPT Imposters with A Single Question". The paper proposes a new framework named FLAIR (Finding Large Language Model Authenticity via a Single Inquiry and Response) to detect conversational bots in an online manner. The approach aims to differentiate human users from bots using single-question scenarios.

Datasets

The questions are divided into two categories:

Questions that are easy for humans but difficult for bots (e.g., counting, substitution, positioning, noise filtering, and ASCII art)
Questions that are easy for bots but difficult for humans (e.g., memorization and computation)

Below are the description for each FLAIR question:

Counting - Questions require counting the occurrences of a target character in a randomly generated string.
Substitution - Questions require deciphering a string where each character is substituted with another character based on a substitution table.
Positioning - Questions require finding the k-th character after the j-th appearance of a character c in a randomly generated string.
Random Editing - Questions require performing drop, insert, swap, and substitute operations on a random string and providing three different outputs.
Noise Injection - Questions are common sense questions with added noise by appending uppercase letters to words within the question.
ASCII Art - Questions present an ASCII art and require providing the corresponding label as the answer.
Memorization - Questions require enumerating items within a category or answering domain-specific questions that are difficult for humans to recall.
Computation - Questions require calculating the product of two randomly sampled four-digit numbers.

Evaluation

Implementation Details

We choosed ten users for our user study.

1. Counting

To conduct this experiment, we will first generate a candidate character set by randomly sampling 3 to 5 letters from the entire alphabet.
Using the generated character set, we will create a random string by sampling k times, where k is set to 30 for this experiment.
Next, we will randomly select a character from the generated string and ask users to count the number of times it appears.
Each participant is allocated with 10 counting questions. Answers should match the results exactly.

2. Substitution

We randomly choose 100 different english words as the original strings.
Then, we designed a random substitution rule to substitute characters within the words.
Given the words and different substitution rules, participants should perform substitution and output the correct results.
To standardize the experiment, each user will be allocated 10 substitution questions. Answers should match the results exactly.

3. Positioning

For our experiment, we will start by generating a candidate character set by randomly sampling 6 to 10 letters from the entire alphabet.
Using the generated character set, we will create a random string by sampling k times, where k is set to 30 for this experiment.
Next, we will randomly select a character from the generated string. Users should find the k-th character after the j-th occurence of the selected character.
Each participant is allocated with 10 positioning questions. Answers should match the results exactly.

4. Random Edit

For the first category of questions, we will randomly drop k zeros or ones from a sequence of 20 bits.
For the second category of questions, we will randomly add k zeros or ones to a sequence of 20 bits.
In the third category, we will randomly substitute k zeros with ones or k ones with zeros in a sequence of 20 bits.
The fourth category of questions will involve randomly swapping zeros and ones k times in a sequence of 20 bits.
Each participant is allocated 10 random edit questions from 2 categories. Answers should pass our answer checker.

5. Noise Injection

To design our experiment, we first collected a set of 100 common sense questions along with their corresponding answers. Additionally, we generated a set of 400 random words to serve as noise.
In order to inject noise into the common sense questions, we replaced the spaces within the questions with uppercase random words.
Users will be presented with the noisy questions and are required to remove the random words and answer the questions correctly.
Each participant is allocated 10 noise injection questions from 2 categories. It is important to note that all answers that make sense will be considered correct.

6. ASCII arts

To conduct our experiment, we first collected a set of 50 ASCII arts from https://www.asciiart.eu/
For the experiment, users will be presented with the ASCII arts and are required to identify what is depicted in each image.
Each participant is allocated 5 ASCII questions. It is important to note that all answers that make sense will be considered correct.

7. Memorization

We have collected 100 questions from various professional fields, including both numerical and knowledge-based questions.
For numerical questions, users are required to provide an answer with an error margin of no more than 5%.
For knowledge-based questions, users must provide accurate answers.
Each participant is allocated 10 random memorization questions.

8. Computation

Users are required to complete a multiplication question involving two randomly generated four-digit numbers within a time limit of 10 seconds.
Any answers submitted after the time limit will be marked as incorrect.
In order for the answer to be considered correct, the margin of error must be within 5%.

Contributing

We welcome contributions to expand the dataset and improve the detection of conversational bots. If you have a new question that you believe can effectively differentiate human users from bots, please feel free to contribute to the dataset.

voltek62/FLAIR